前言

环境部署

插件推荐

爬虫目标

项目创建

webdriver部署

项目代码

Item定义

中间件定义

定义爬虫

pipeline输出结果文本

配置文件改动

验证结果

总结

前言
闲来无聊，写了一个爬虫程序获取百度疫情数据。申明一下，研究而已。而且页面应该会进程做反爬处理，可能需要调整对应xpath。

Github仓库地址：代码仓库

本文主要使用的是scrapy框架。

环境部署
主要简单推荐一下

插件推荐
这里先推荐一个Google Chrome的扩展插件xpath helper，可以验证xpath语法是不是正确。

爬虫目标
需要爬取的页面：实时更新：新型冠状病毒肺炎疫情地图

主要爬取的目标选取了全国的数据以及各个身份的数据。

项目创建

使用scrapy命令创建项目

scrapy startproject yqsj

webdriver部署
这里就不重新讲一遍了，可以参考我这篇文章的部署方法：（Scrapy框架）爬虫2021年CSDN全站综合热榜标题热词 | 爬虫案例_阿良的博客-CSDN博客

项目代码
开始撸代码，看一下百度疫情省份数据的问题。

页面需要点击展开全部span。所以在提取页面源码的时候需要模拟浏览器打开后，点击该按钮。所以按照这个方向，我们一步步来。

Item定义

定义两个类YqsjProvinceItem和YqsjChinaItem，分别定义国内省份数据和国内数据。

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class YqsjProvinceItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    location = scrapy.Field()
    new = scrapy.Field()
    exist = scrapy.Field()
    total = scrapy.Field()
    cure = scrapy.Field()
    dead = scrapy.Field()


class YqsjChinaItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 现有确诊
    exist_diagnosis = scrapy.Field()
    # 无症状
    asymptomatic = scrapy.Field()
    # 现有疑似
    exist_suspecte = scrapy.Field()
    # 现有重症
    exist_severe = scrapy.Field()
    # 累计确诊
    cumulative_diagnosis = scrapy.Field()
    # 境外输入
    overseas_input = scrapy.Field()
    # 累计治愈
    cumulative_cure = scrapy.Field()
    # 累计死亡
    cumulative_dead = scrapy.Field()

中间件定义

需要打开页面后点击一下展开全部。

完整代码

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
from scrapy.http import HtmlResponse
from selenium.common.exceptions import TimeoutException
from selenium.webdriver import ActionChains
import time


class YqsjSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, or item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request or item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class YqsjDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        # return None
        try:
            spider.browser.get(request.url)
            spider.browser.maximize_window()
            time.sleep(2)
            spider.browser.find_element_by_xpath("/*;q=0.8',
    'Accept-Language': 'en',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36'
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
   'yqsj.middlewares.YqsjSpiderMiddleware': 543,
}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   'yqsj.middlewares.YqsjDownloaderMiddleware': 543,
}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'yqsj.pipelines.YqsjPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_ConCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

验证结果

看看结果文件

总结
emmmm，闲着无聊，写着玩，没啥好总结的。

分享：

修心，亦是修行之一。顺境修力，逆境修心，缺一不可。 ——《剑来》

如果本文对你有作用的话，不要吝啬你的赞，谢谢。

(Scrapy框架)爬虫获取百度新冠疫情数据 | 爬虫案例

前言
闲来无聊，写了一个爬虫程序获取百度疫情数据。申明一下，研究而已。而且页面应该会进程做反爬处理，可能需要调整对应xpath。

Github仓库地址：代码仓库

本文主要使用的是scrapy框架。

环境部署
主要简单推荐一下

插件推荐
这里先推荐一个Google Chrome的扩展插件xpath helper，可以验证xpath语法是不是正确。

爬虫目标
需要爬取的页面：实时更新：新型冠状病毒肺炎疫情地图

主要爬取的目标选取了全国的数据以及各个身份的数据。

项目创建
使用scrapy命令创建项目

scrapy startproject yqsj

webdriver部署
这里就不重新讲一遍了，可以参考我这篇文章的部署方法：（Scrapy框架）爬虫2021年CSDN全站综合热榜标题热词 | 爬虫案例_阿良的博客-CSDN博客

项目代码
开始撸代码，看一下百度疫情省份数据的问题。

页面需要点击展开全部span。所以在提取页面源码的时候需要模拟浏览器打开后，点击该按钮。所以按照这个方向，我们一步步来。

验证结果

看看结果文件

总结
emmmm，闲着无聊，写着玩，没啥好总结的。

分享：

修心，亦是修行之一。顺境修力，逆境修心，缺一不可。 ——《剑来》

如果本文对你有作用的话，不要吝啬你的赞，谢谢。

Python相关栏目本月热门文章

(Scrapy框架)爬虫获取百度新冠疫情数据 | 爬虫案例

前言 闲来无聊，写了一个爬虫程序获取百度疫情数据。申明一下，研究而已。而且页面应该会进程做反爬处理，可能需要调整对应xpath。 Github仓库地址：代码仓库 本文主要使用的是scrapy框架。

环境部署 主要简单推荐一下

插件推荐 这里先推荐一个Google Chrome的扩展插件xpath helper，可以验证xpath语法是不是正确。

爬虫目标 需要爬取的页面：实时更新：新型冠状病毒肺炎疫情地图 主要爬取的目标选取了全国的数据以及各个身份的数据。

项目创建 使用scrapy命令创建项目 scrapy startproject yqsj

webdriver部署 这里就不重新讲一遍了，可以参考我这篇文章的部署方法：（Scrapy框架）爬虫2021年CSDN全站综合热榜标题热词 | 爬虫案例_阿良的博客-CSDN博客

项目代码 开始撸代码，看一下百度疫情省份数据的问题。 页面需要点击展开全部span。所以在提取页面源码的时候需要模拟浏览器打开后，点击该按钮。所以按照这个方向，我们一步步来。

验证结果 看看结果文件

总结 emmmm，闲着无聊，写着玩，没啥好总结的。 分享： 修心，亦是修行之一。顺境修力，逆境修心，缺一不可。 ——《剑来》 如果本文对你有作用的话，不要吝啬你的赞，谢谢。

Python相关栏目本月热门文章

前言
闲来无聊，写了一个爬虫程序获取百度疫情数据。申明一下，研究而已。而且页面应该会进程做反爬处理，可能需要调整对应xpath。

Github仓库地址：代码仓库

本文主要使用的是scrapy框架。

环境部署
主要简单推荐一下

插件推荐
这里先推荐一个Google Chrome的扩展插件xpath helper，可以验证xpath语法是不是正确。

爬虫目标
需要爬取的页面：实时更新：新型冠状病毒肺炎疫情地图

主要爬取的目标选取了全国的数据以及各个身份的数据。

项目创建
使用scrapy命令创建项目

scrapy startproject yqsj

webdriver部署
这里就不重新讲一遍了，可以参考我这篇文章的部署方法：（Scrapy框架）爬虫2021年CSDN全站综合热榜标题热词 | 爬虫案例_阿良的博客-CSDN博客

项目代码
开始撸代码，看一下百度疫情省份数据的问题。

页面需要点击展开全部span。所以在提取页面源码的时候需要模拟浏览器打开后，点击该按钮。所以按照这个方向，我们一步步来。

验证结果

看看结果文件

总结
emmmm，闲着无聊，写着玩，没啥好总结的。

分享：

修心，亦是修行之一。顺境修力，逆境修心，缺一不可。 ——《剑来》

如果本文对你有作用的话，不要吝啬你的赞，谢谢。