为了有效的学习和加深对于线程和协程的认知,将同一个任务用不同思路呈现很有帮助。
这里通过使用多线程以及协程两种方式对西游记的篇章内容进行爬取,来对比一下两者的耗时差异。
首先,多线程爬取。
这里没有限制并发量,在主函数中的for循环遍历pair,因为有100个p,因此实际上开辟了100个线程。
import threading
import requests
import os
import json
url = 'http://dushu.baidu.com/api/pc/getCatalog?data={"book_id":"4306063500"}'
path = 'xiyouji2'
os.mkdir(path)
def getCatalog(url):
resp = requests.get(url=url)
#print(resp.text)
jsdata = resp.json()
pair = []
for data in jsdata['data']['novel']['items']:
title = data['title']
cid = data['cid']
pair.append((cid,title))
return pair
def download(p):
cid = p[0]
title = p[1]
data = {
"book_id":"4306063500",
"cid":f"4306063500|{cid}",
"need_bookinfo":1
}
data = json.dumps(data)
url = f'http://dushu.baidu.com/api/pc/getChapterContent?data={data}'
response = requests.get(url=url)
jsdata = response.json()
with open(f'{path}/{title}.txt','w') as f:
f.write(jsdata['data']['novel']['content'])
if __name__ == '__main__':
pair = getCatalog(url)
for p in pair:
t = threading.Thread(target=download,args=(p,))
t.start()
多线程用时
# Finished in 661ms
接下来用协程实现爬取
协程任务是单线程,对比多线程通过cpu来调用线程,协程需要通过程序本身实现协程任务的切换,也就是遇到阻塞时await 挂起。因为不涉及线程的开辟和销毁,所以协程对于资源的消耗比多线程要少。
import requests
import asyncio
import aiohttp
import aiofiles
import os
import json
url = 'http://dushu.baidu.com/api/pc/getCatalog?data={"book_id":"4306063500"}'
path = 'xiyouji'
os.mkdir(path)
async def download(cid,title):
data = {
"book_id":"4306063500",
"cid":f"4306063500|{cid}",
"need_bookinfo":1
}
data = json.dumps(data)
url = f'http://dushu.baidu.com/api/pc/getChapterContent?data={data}'
print(url)
async with aiohttp.ClientSession() as session:
async with session.get(url) as resp:
async with aiofiles.open(f'{path}/{title}.txt','w') as f:
dic = await resp.json()
await f.write(dic['data']['novel']['content'])
async def getCatalog(url):
resp = requests.get(url=url)
#print(resp.text)
jsdata = resp.json()
tasks = []
for data in jsdata['data']['novel']['items']:
title = data['title']
cid = data['cid']
print(title,cid)
d = download(cid, title)
tasks.append(d)
await asyncio.wait(tasks)
if __name__ == '__main__':
asyncio.run(getCatalog(url))
协程耗时
# Finished in 1.2s
通过对比可以发现,在本任务中,协程耗时是多线程的2倍。不过当爬取任务进一步扩大,协程的效率会高于多线程 ,而且在访问同一个资源时协程无需多线程的锁机制。



