网站:https://www.dbbqb.com/
随便开一张表情包,url如下:
https://www.dbbqb.com/detail/320000.html
根据变更url,可知url构造规则:
https://www.dbbqb.com/detail/表情包数字.html网页分析
打开F12,发现是ajax的:
切到XHR页,发现json中的一项和图片url相同:
api接口构造规则:
https://www.dbbqb.com/api/image/表情包数字项目结构
可使用shell:
touch main.py mkdir image代码
from threading import Thread
import json
import os
import requests
from bs4 import BeautifulSoup
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4947.3 Safari/537.36'
HEADERS = {
'User-Agent': USER_AGENT
}
def download_image(url: str, num: int):
# 根据图片url下载图片
response = requests.get(url, headers=HEADERS)
with open(os.path.join('image', f'{num}.jpg'), 'wb') as f:
f.write(response.content)
def download(image_num: int):
# 根据给定的表情包id爬取图片
headers = HEADERS.copy()
headers[':path'] = f'/api/image/{image_num}' # 这里要加:path,反反爬
url = f'https://www.dbbqb.com/api/image/{image_num}' # url构造
response = requests.get(url, headers=HEADERS)
response.encoding = 'utf-8'
if response.status_code != 200: # 防意外
print(f'错误(ID: {image_num})')
return
data = json.loads(response.text)
try:
path = data['path']
except KeyError:
print(f'JSON数据错误: {data} (ID: {image_num})')
return
img_url = f'https://image.dbbqb.com/{path}'
download_image(img_url, image_num)
print(f'下载表情包成功(ID: {image_num})')
def main():
threads = [] # 懒得写线程队列
for i in range(1, 320001):
th = Thread(target=download, args=(i,)) # 注意:python的元组只有一项一定要加一个,
threads.append(th)
for t in threads:
t.start()
if __name__ == '__main__':
main()
需要注意,有些地方没有表情包,所以会打印错误信息,属于正常现象
效果部分截图:



