- 1. 网页分析
- 1.1 网页调试
- 1.2 请求参数分析
- 2. 数据爬取
- 2.1 爬取测试
- 2.2 优化
- 3. 保存数据
- 3.1 openpyxl
- 3.2 完整代码
- 推荐阅读
大家好,我是 【Python当打之年】
本期给大家分步介绍如何爬取知乎全部问答相关数据,希望对你有所帮助。
目标网址:
https://www.zhihu.com/question/368550554
爬取字段:抓取问题下所有回答的发布时间、作者、赞同数、内容等(其他字段可根据需要添加)。
1. 网页分析 1.1 网页调试F12打开浏览器调试窗口,查找加载数据的url,由于知乎问答页面显示限制,所以需要多向下加载几页,以便查找规律:
任意搜索网页内容,这里以“推荐两部曾经被我低估的电影”这句话为例,如上图所示,我们在 answers
这个索引的url下找到了回答内容,且完全一致。
请求的URL:
datas参数:
这里重点注意limit和offset两个参数,这两个参数到底是什么作用呢?我们搜索answers链接查看一下:
规律:limit始终是5,而offset以5为公差依次递增(5/10/15/20。。。)
实际上 limit 表示每次请求返回 5 条回答,offset 表示从第几个回答开始。
根据前面分析的URL,我们构造请求测试一下:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36',
}
for i in range(2):
url = f'https://www.zhihu.com/api/v4/questions/368550554/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=5&offset={i}&platform=desktop&sort_by=default'
r = requests.get(url, headers=headers)
r.raise_for_status()
r.encoding = 'utf-8'
datas = json.loads(r.text)
for info in datas['data']:
author = info['author']['name']
created_time = time.strftime("%Y/%m/%d %H:%M:%S", time.localtime(info['created_time']))
voteup_count = info['voteup_count']
text = info['excerpt']
oneinfo = [created_time,author,voteup_count,text]
print(oneinfo)
print('+++++++++++++++++++++++++++')
结果:
这样我们就可以循环遍历拿到所有数据了。
经测试 limit 最多可以改成 20,这样循环的次数就会降为原来的四分之一。
根据以上的方式爬取,我们需要提前知道一共有多少页数据,然后才能确定循环的次数。
循环次数 = int(数据条数/offset)
实际上细心的朋友会发现我们在获取导数据时还有一部分是page信息:
这里直接给出了:
上一页的链接(previous)
下一页的链接(next)
是否是首页(is_start)
是否是尾页(is_end)
总条数(totals)
递归爬取:
def getinfo(url, headers):
allinfo = []
try:
r = requests.get(url, headers=headers)
r.raise_for_status()
r.encoding = 'utf-8'
datas = json.loads(r.text)
for info in datas['data']:
author = info['author']['name']
created_time = time.strftime("%Y/%m/%d %H:%M:%S", time.localtime(info['created_time']))
voteup_count = info['voteup_count']
text = info['excerpt']
oneinfo = [created_time,author,voteup_count,text]
print(oneinfo)
allinfo.append(oneinfo)
next_url = datas['paging']['next']
if datas['paging']['is_end']:
print('----')
return
time.sleep(random.uniform(0.1, 20.1))
return getinfo(next_url, headers)
except:
return getinfo(next_url, headers)
3. 保存数据
3.1 openpyxl
这里我们使用openpyxl将数据保存到Excel中,大家也可以尝试保存其他文件或者数据库中:
def insert2excel(filepath,allinfo):
try:
if not os.path.exists(filepath):
tableTitle = ['发布时间', '用户名', '赞同数', '内容']
wb = Workbook()
ws = wb.active
ws.title = 'sheet1'
ws.append(tableTitle)
wb.save(filepath)
time.sleep(3)
wb = load_workbook(filepath)
ws = wb.active
ws.title = 'sheet1'
for info in allinfo:
ws.append(info)
wb.save(filepath)
print('文件已更新')
except:
print('文件更新失败')
效果:
以下是全部代码,可以直接在本地运行,还有很多可以优化的地方,小伙伴们可以自行修改:
import os
import json
import time
import math
import random
import requests
from openpyxl import load_workbook, Workbook
# 数据采集
def getinfo(url, headers):
allinfo = []
try:
r = requests.get(url, headers=headers)
r.raise_for_status()
r.encoding = 'utf-8'
datas = json.loads(r.text)
for info in datas['data']:
author = info['author']['name']
created_time = time.strftime("%Y/%m/%d %H:%M:%S", time.localtime(info['created_time']))
voteup_count = info['voteup_count']
text = info['excerpt']
oneinfo = [created_time,author,voteup_count,text]
print(oneinfo)
allinfo.append(oneinfo)
next_url = datas['paging']['next']
insert2excel(filepath,allinfo)
if datas['paging']['is_end']:
print('----')
return
time.sleep(random.uniform(5.1, 20.1))
return getinfo(next_url, headers)
except:
return getinfo(next_url, headers)
# 数据保存
def insert2excel(filepath,allinfo):
try:
if not os.path.exists(filepath):
tableTitle = ['发布时间', '用户名', '赞同数', '内容']
wb = Workbook()
ws = wb.active
ws.title = 'sheet1'
ws.append(tableTitle)
wb.save(filepath)
time.sleep(3)
wb = load_workbook(filepath)
ws = wb.active
ws.title = 'sheet1'
for info in allinfo:
ws.append(info)
wb.save(filepath)
print('文件已更新')
except:
print('文件更新失败')
url = 'https://www.zhihu.com/api/v4/questions/368550554/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=20&offset=0&platform=desktop&sort_by=default'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36',
}
filepath = '368550554.xlsx'
getinfo(url, headers)
推荐阅读以上就是本期为大家整理的全部内容了,赶快练习起来吧,原创不易,喜欢的朋友可以点赞、收藏也可以分享(注明出处)让更多人知道。
爬取《白蛇2:青蛇劫起》20000+影评数据分析可视化
可视化 | Python分析中秋月饼,这几种口味才是yyds!!!
123个Pandas常用基础指令,真香!
爬虫+可视化 | 动态展示2020东京奥运会奖牌世界分布
Pandas+Pyecharts | 北京某平台二手房数据分析+可视化
Pandas+Pyecharts | 2021中国大学综合排名分析+可视化
可视化 | Python绘制高颜值台风地理轨迹图
可视化 | 用Python分析近5000个旅游景点,告诉你假期应该去哪玩
可视化 | Python精美地图动态展示近20年全国各省市GDP
可视化 | Python陪你过520:在你身边,你在身边
爬虫 | Python送你王者荣耀官网全套皮肤
爬虫 | 用python构建自己的IP代理池,再也不担心IP不够用啦!
技巧 | 20个Pycharm最实用最高效的快捷键(动态展示)
技巧 | 5000字超全解析Python三种格式化输出方式【% / format / f-string】
技巧 | python定时发送邮件(自动添加附件)
爬虫 | Python送你王者荣耀官网全套皮肤
爬虫 | 用python构建自己的IP代理池,再也不担心IP不够用啦!
文章首发微信公众号 “Python当打之年” ,每天都有python编程技巧推送,希望大家可以喜欢



