爬虫 | Python抓取知乎问题下所有回答(附完整代码)

文章目录

1. 网页分析
- 1.1 网页调试
- 1.2 请求参数分析
2. 数据爬取
- 2.1 爬取测试
- 2.2 优化
3. 保存数据
- 3.1 openpyxl
- 3.2 完整代码
推荐阅读

大家好，我是  【Python当打之年】

本期给大家分步介绍如何爬取知乎全部问答相关数据，希望对你有所帮助。

目标网址：

https://www.zhihu.com/question/368550554

爬取字段：抓取问题下所有回答的发布时间、作者、赞同数、内容等（其他字段可根据需要添加）。

1. 网页分析 1.1 网页调试

F12打开浏览器调试窗口，查找加载数据的url，由于知乎问答页面显示限制，所以需要多向下加载几页，以便查找规律：

任意搜索网页内容，这里以“推荐两部曾经被我低估的电影”这句话为例，如上图所示，我们在 answers
这个索引的url下找到了回答内容，且完全一致。

1.2 请求参数分析

请求的URL：

datas参数:

这里重点注意limit和offset两个参数，这两个参数到底是什么作用呢？我们搜索answers链接查看一下：

规律：limit始终是5，而offset以5为公差依次递增（5/10/15/20。。。）
实际上 limit 表示每次请求返回 5 条回答，offset 表示从第几个回答开始。

2. 数据爬取 2.1 爬取测试

根据前面分析的URL，我们构造请求测试一下：

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36',
}

for i in range(2):
    url = f'https://www.zhihu.com/api/v4/questions/368550554/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=5&offset={i}&platform=desktop&sort_by=default'
    r = requests.get(url, headers=headers)
    r.raise_for_status()
    r.encoding = 'utf-8'
    datas = json.loads(r.text)
    for info in datas['data']:
        author = info['author']['name']
        created_time = time.strftime("%Y/%m/%d %H:%M:%S", time.localtime(info['created_time']))
        voteup_count = info['voteup_count']
        text = info['excerpt']
        oneinfo = [created_time,author,voteup_count,text]
        print(oneinfo)
    print('+++++++++++++++++++++++++++')

结果：

这样我们就可以循环遍历拿到所有数据了。
经测试 limit 最多可以改成 20，这样循环的次数就会降为原来的四分之一。

2.2 优化

根据以上的方式爬取，我们需要提前知道一共有多少页数据，然后才能确定循环的次数。

循环次数 = int(数据条数/offset)
实际上细心的朋友会发现我们在获取导数据时还有一部分是page信息：

这里直接给出了:

上一页的链接（previous）

下一页的链接（next）

是否是首页（is_start）

是否是尾页（is_end）

总条数（totals）

递归爬取：

def getinfo(url, headers):
    allinfo = []
    try:
        r = requests.get(url, headers=headers)
        r.raise_for_status()
        r.encoding = 'utf-8'
        datas = json.loads(r.text)
        for info in datas['data']:
            author = info['author']['name']
            created_time = time.strftime("%Y/%m/%d %H:%M:%S", time.localtime(info['created_time']))
            voteup_count = info['voteup_count']
            text = info['excerpt']
            oneinfo = [created_time,author,voteup_count,text]
            print(oneinfo)
            allinfo.append(oneinfo)
        next_url = datas['paging']['next']
        if datas['paging']['is_end']:
            print('----')
            return 
        
        time.sleep(random.uniform(0.1, 20.1))
        return getinfo(next_url, headers)
    except:
        return getinfo(next_url, headers)

3. 保存数据 3.1 openpyxl

这里我们使用openpyxl将数据保存到Excel中，大家也可以尝试保存其他文件或者数据库中：

def insert2excel(filepath,allinfo):
    try:
        if not os.path.exists(filepath):
            tableTitle = ['发布时间', '用户名', '赞同数', '内容']
            wb = Workbook()
            ws = wb.active
            ws.title = 'sheet1'
            ws.append(tableTitle)
            wb.save(filepath)
            time.sleep(3)
        wb = load_workbook(filepath)
        ws = wb.active
        ws.title = 'sheet1'
        for info in allinfo:
            ws.append(info)
        wb.save(filepath)
        print('文件已更新')
    except:
        print('文件更新失败')

效果：

3.2 完整代码

以下是全部代码，可以直接在本地运行，还有很多可以优化的地方，小伙伴们可以自行修改：

import os
import json
import time
import math
import random
import requests
from openpyxl import load_workbook, Workbook

# 数据采集
def getinfo(url, headers):
    allinfo = []
    try:
        r = requests.get(url, headers=headers)
        r.raise_for_status()
        r.encoding = 'utf-8'
        datas = json.loads(r.text)
        for info in datas['data']:
            author = info['author']['name']
            created_time = time.strftime("%Y/%m/%d %H:%M:%S", time.localtime(info['created_time']))
            voteup_count = info['voteup_count']
            text = info['excerpt']
            oneinfo = [created_time,author,voteup_count,text]
            print(oneinfo)
            allinfo.append(oneinfo)
        next_url = datas['paging']['next']
        insert2excel(filepath,allinfo)
        
        if datas['paging']['is_end']:
            print('----')
            return 
        
        time.sleep(random.uniform(5.1, 20.1))
        return getinfo(next_url, headers)
    except:
        return getinfo(next_url, headers)

# 数据保存
def insert2excel(filepath,allinfo):
    try:
        if not os.path.exists(filepath):
            tableTitle = ['发布时间', '用户名', '赞同数', '内容']
            wb = Workbook()
            ws = wb.active
            ws.title = 'sheet1'
            ws.append(tableTitle)
            wb.save(filepath)
            time.sleep(3)
        wb = load_workbook(filepath)
        ws = wb.active
        ws.title = 'sheet1'
        for info in allinfo:
            ws.append(info)
        wb.save(filepath)
        print('文件已更新')
    except:
        print('文件更新失败')
        
url = 'https://www.zhihu.com/api/v4/questions/368550554/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=20&offset=0&platform=desktop&sort_by=default'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36',
}
filepath = '368550554.xlsx'
getinfo(url, headers)

以上就是本期为大家整理的全部内容了，赶快练习起来吧，原创不易，喜欢的朋友可以点赞、收藏也可以分享（注明出处）让更多人知道。

爬虫 | Python抓取知乎问题下所有回答(附完整代码)

Python相关栏目本月热门文章