栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 软件开发 > 后端开发 > Python

Scrapy框架的学习(一)

Python 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力
1. Scrapy概述 1. 为什么要学习scrapy框架
  • 爬虫必备的技术,面试会问相关的知识。
  • 让我们的爬虫更快更强大。(支持异步爬虫)
2. 什么是Scrapy?

  • 异步爬虫框架:Scrapy是一个基于Python开发的爬虫框架,用于抓取网站并从其页面中提取结构化数据,也是当前Python爬虫生态中最流行的爬虫框架,Scrapy框架架构清晰,可扩展性强,可以灵活高效的完成各种爬虫需求。
    程序状态转换图:
3. 如何学习Scrapy?
  • 官网:https://scrapy.org/
  • 官方文档1(中文):https://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/tutorial.html
  • 官方文档2(英文):https://docs.scrapy.org/en/latest/
4. Scrapy工作流程



分工介绍表
版块介绍要求
Scrapy engine(引擎)总指挥:负责数据和信号在不同模块之间的传递Scrapy已经实现
Scheduler(调度器)一个队列,存放引擎发过来的request请求Scrapy已经实现
Downoader(下载器)下载引擎发过来的requests请求的源码(即response),将源码返回给引擎scrapy已经实现
Spider(爬虫)处理引擎发来的response,提取数据,提取url,并交给引擎需要手写
Item Pipline(管道)处理引擎传过来的数据,比如存储数据需要手写
Downloader Middlewares(下载中间件)可以自定义的下载扩展,比如设置代理一般不用手写
Spider Middlewares(中间件)可以自定义requests请求和进行response过滤一般不用手写
2. Scrapy快速入门(小案例) 1. 安装
pip install scrapy
pip install scrapy==2.5.1   # 指定安装2.5.1版本的scrapy


在终端内输入“scrapy”命令验证是否安装好了:

以上显示就说明已经安装好了。

2. 创建项目
  • 需要进入到项目保存位置的cmd中。
# scrapy startproject 项目名称
scrapy startproject my_Scrapy

3. 项目结构分析

  • my_Scrapy
    • my_Scrapy
      • spiders
        • __init__.py
      • __init__.py
      • items.py
      • middlewares.py
      • piplelines.py
      • settings.py
    • scrapy.cfg

功能介绍:

  • scrapy.cfg:Scrapy项目配置文件,定义项目的配置文件的路径,部署信息。(一般不需要改)
  • items.py:定义了item的数据结构,所有item的定义都可以放在这里。(定义爬取的数据内容有哪些)
  • piplelines.py:定义item Pipeline的实现。
  • settings.py:定义项目的全局配置。
  • middlewares.py:中间件文件,定义了Spider Middlewares和Downloader Middlewares的实现。
  • spiders:里面包含一个个spider(爬虫)的实现。每一个spider对应一个.py文件。
4. 创建Spider
# 先进入项目目录:
cd my_Scrapy
# scrapy genspider 爬虫文件名 爬取网站的域名
scrapy genspider spider1 www.baidu.com


  • 修改spider1.py文件:
import scrapy


class Spider1Spider(scrapy.Spider):
    # spider(爬虫)名称,需要记住,通过名字来启动爬虫:
    name = 'spider1'
    # 允许爬取的域名:可更改(限制爬虫,不要爬到其他网站去了)
    allowed_domains = ['http://quotes.toscrape.com/']
    # 初始request请求:
    start_urls = ['http://quotes.toscrape.com/']

    # 解析方法:
    def parse(self, response):
        print(response.text)

官方参考案例网站:http://quotes.toscrape.com/

5. 创建item
  • item是保存爬取数据的容器,定义爬取的数据结构。
    修改项目中的items.py文件如下:
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class MyScrapyItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    # 采集的目标内容:名言、名人、分类标签
    # 名言:
    text = scrapy.Field()
    # 名人:
    author = scrapy.Field()
    # 标签:
    tags = scrapy.Field()
6. 解析Response 1. 仅仅爬取第一页的数据
  • 修改spider1.py文件中的parse()方法,该方法用于解析源码中的目标内容。
import scrapy
from lxml import etree


class Spider1Spider(scrapy.Spider):
    # spider(爬虫)名称,需要记住,通过名字来启动爬虫:
    name = 'spider1'
    # 允许爬取的域名:可更改(限制爬虫,不要爬到其他网站去了)
    allowed_domains = ['http://quotes.toscrape.com/']
    # 初始request请求:
    start_urls = ['http://quotes.toscrape.com/']

    # 解析方法:
    def parse(self, response):
        # print(response.text)
        """
        解析返回响应,提取数据或进一步生成要处理的请求。
        :param response:
        :return:
        """
        # 方法一:通过css选择器来进行解析
        """ 
        quotes = response.css('.quote')   # 列表
        for quote in quotes:
            # text = quote.css('span.text::text')   # ::text表示获取里面的文本(注意它是一个对象)
            # 旧方法:
            # extract_first()  返回第一个数据。(字符串)
            # extract()  返回全部的数据。(字符串列表)
            text = quote.css('span.text::text').extract_first()   # ::text表示获取里面的文本  (extract_first()方法可以获取到对象里面的文本内容)
            # author = quote.css('small.author::text')  # 作者  (得到一个对象)
            author = quote.css('small.author::text').extract_first()  # 作者  (得到文本内容)
            # tags = quote.css('div.tags a.tag::text')   # 标签(CSS对象)
            tags = quote.css('div.tags a.tag::text').extract()   # 返回全部的数据
            # print(tags)
            # print(text, '      ——————', author, tags)

            # 新方法:
            # get()   返回一条数据
            # getall()  返回全部的数据
            text = quote.css('span.text::text').get()
            author = quote.css('small.author::text').get()
            tags = quote.css('div.tags a.tag::text').getall()
            print(text, tags, '      ——————', author)
        """

        # 方法二:通过xpath进行解析
        html = etree.HTML(response.text)
        quotes_divs = html.xpath('//div[@]')
        for quote_div in quotes_divs:
            text = quote_div.xpath('./span[1]/text()')[0]
            author = quote_div.xpath('./span[2]/small/text()')[0]
            tags = quote_div.xpath('./div[@]/a/text()')
            print(text, tags, '    ------', author)

运行start.py文件的结果:

D:Anacondapython.exe C:/Users/lv/Desktop/scrapy框架的学习/my_Scrapy/start.py
2022-04-03 19:24:47 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: my_Scrapy)
2022-04-03 19:24:47 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 18.7.0, Python 3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2p  14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.19041-SP0
2022-04-03 19:24:47 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-04-03 19:24:47 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'my_Scrapy',
 'NEWSPIDER_MODULE': 'my_Scrapy.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['my_Scrapy.spiders']}
2022-04-03 19:24:47 [scrapy.extensions.telnet] INFO: Telnet Password: e2250e171a87ebd6
2022-04-03 19:24:47 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2022-04-03 19:24:47 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-04-03 19:24:47 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-04-03 19:24:47 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-04-03 19:24:47 [scrapy.core.engine] INFO: Spider opened
2022-04-03 19:24:47 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-04-03 19:24:47 [py.warnings] WARNING: D:Anacondalibsite-packagesscrapyspidermiddlewaresoffsite.py:65: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry http://quotes.toscrape.com/ in allowed_domains.
  warnings.warn(message, URLWarning)

2022-04-03 19:24:47 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-04-03 19:24:48 [scrapy.core.engine] DEBUG: Crawled (404)  (referer: None)
2022-04-03 19:24:48 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: None)
2022-04-03 19:24:48 [scrapy.core.engine] INFO: Closing spider (finished)
2022-04-03 19:24:48 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 448,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 2582,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'elapsed_time_seconds': 1.264597,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 4, 3, 11, 24, 48, 658256),
 'httpcompression/response_bytes': 11053,
 'httpcompression/response_count': 1,
 'log_count/DEBUG': 2,
 'log_count/INFO': 10,
 'log_count/WARNING': 1,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2022, 4, 3, 11, 24, 47, 393659)}
2022-04-03 19:24:48 [scrapy.core.engine] INFO: Spider closed (finished)
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” ['change', 'deep-thoughts', 'thinking', 'world']     ------ Albert Einstein
“It is our choices, Harry, that show what we truly are, far more than our abilities.” ['abilities', 'choices']     ------ J.K. Rowling
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.” ['inspirational', 'life', 'live', 'miracle', 'miracles']     ------ Albert Einstein
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.” ['aliteracy', 'books', 'classic', 'humor']     ------ Jane Austen
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.” ['be-yourself', 'inspirational']     ------ Marilyn Monroe
“Try not to become a man of success. Rather become a man of value.” ['adulthood', 'success', 'value']     ------ Albert Einstein
“It is better to be hated for what you are than to be loved for what you are not.” ['life', 'love']     ------ André Gide
“I have not failed. I've just found 10,000 ways that won't work.” ['edison', 'failure', 'inspirational', 'paraphrased']     ------ Thomas A. Edison
“A woman is like a tea bag; you never know how strong it is until it's in hot water.” ['misattributed-eleanor-roosevelt']     ------ Eleanor Roosevelt
“A day without sunshine is like, you know, night.” ['humor', 'obvious', 'simile']     ------ Steve Martin

Process finished with exit code 0

2. 翻页爬取数据
  • 修改的主要是spider1.py文件中的部分语句。
import scrapy
from lxml import etree
from my_Scrapy.items import MyScrapyItem   # 从items文件中导入MyScrapyItem类


class Spider1Spider(scrapy.Spider):
    # spider(爬虫)名称,需要记住,通过名字来启动爬虫:
    name = 'spider1'
    # # 允许爬取的域名:可更改(限制爬虫,不要爬到其他网站去了)
    # allowed_domains = ['http://quotes.toscrape.com/']   # 不设置限制,就可以一直爬取下一页了
    # 初始request请求:
    start_urls = ['http://quotes.toscrape.com/']

    # 解析方法:
    def parse(self, response):
        # print(response.text)
        """
        解析返回响应,提取数据或进一步生成要处理的请求。
        :param response:
        :return:
        """
        # 方法一:通过css选择器来进行解析
        """ 
        quotes = response.css('.quote')   # 列表
        for quote in quotes:
            # text = quote.css('span.text::text')   # ::text表示获取里面的文本(注意它是一个对象)
            # 旧方法:
            # extract_first()  返回第一个数据。(字符串)
            # extract()  返回全部的数据。(字符串列表)
            text = quote.css('span.text::text').extract_first()   # ::text表示获取里面的文本  (extract_first()方法可以获取到对象里面的文本内容)
            # author = quote.css('small.author::text')  # 作者  (得到一个对象)
            author = quote.css('small.author::text').extract_first()  # 作者  (得到文本内容)
            # tags = quote.css('div.tags a.tag::text')   # 标签(CSS对象)
            tags = quote.css('div.tags a.tag::text').extract()   # 返回全部的数据
            # print(tags)
            # print(text, '      ——————', author, tags)

            # 新方法:
            # get()   返回一条数据
            # getall()  返回全部的数据
            text = quote.css('span.text::text').get()
            author = quote.css('small.author::text').get()
            tags = quote.css('div.tags a.tag::text').getall()
            print(text, tags, '      ——————', author)
        """

        # 方法二:通过xpath进行解析
        html = etree.HTML(response.text)
        quotes_divs = html.xpath('//div[@]')
        for quote_div in quotes_divs:
            text = quote_div.xpath('./span[1]/text()')[0]
            author = quote_div.xpath('./span[2]/small/text()')[0]
            tags = quote_div.xpath('./div[@]/a/text()')
            # print(text, tags, '    ------', author)

            # 将数据放入容器中,便于保存数据
            item = MyScrapyItem()
            item['text'] = text
            item['author'] = author
            item['tags'] = tags
            # print(item)

            # 通过迭代的方式将字典中的每一条数据交给Pipeline
            yield item

        # 定义翻页操作
        next = response.css('ul.pager li.next a::attr("href")').get()   # 获得“下一页”a标签按钮上面href属性值
        print(next)  # https://blog.csdn.net/page/2/
        url = self.start_urls[0]   # 获取当前爬取的url(方法一)
        # print(url)
        url = response.url    # 获取当前爬取的url(方法二)
        # print(url)
        # 拼接形成下一页的url
        url = response.urljoin(next)
        print(url)
        # 将请求交给调度器,重新构造下一个请求
        yield scrapy.Request(url, callback=self.parse)   # 对于新的请求,仍然执行parse()解析方法

运行结果部分截图:

7. 保存数据 1. 通过执行scrapy命令进行保存数据 1. 方式一:在终端执行命令
# scrapy crawl 爬虫文件名 -o 数据保存文件名
scrapy crawl spider1 -o demo.csv

2. 方式二:通过修改start.py启动文件的cmd命令行语句
# 使用Scrapy框架的该爬虫是一个项目,不能在爬虫文件中用右键运行,需要在终端里面输入scrapy crawl 爬虫文件名“的命令进行启动。
# 如果不想用终端输入命令的启动方式,可以创建该“start.py”文件
from scrapy import cmdline

# cmdline.execute('scrapy crawl spider1'.split())   # 调用终端命令
cmdline.execute('scrapy crawl spider1 -o demo.csv'.split())

# 红色的不是报错,而是scrapy框架自行打印的初始化信息。白色的内容就是print()语句输出的内容。

2. 通过自定义的方式保存(修改pipelines.py文件)
  1. 修改pipelines.py文件如下:
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class MyScrapyPipeline:
    def process_item(self, item, spider):
        with open('demo.txt', 'a', encoding="utf-8") as f:
            f.write(item['text'] + '           ——' + item['author'] + "n")
        return item

  1. 将settings.py文件中pipelines.py文件对应的注释取消掉(否则就无法成功将数据保存在txt文件中)
# Scrapy settings for my_Scrapy project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'my_Scrapy'

SPIDER_MODULES = ['my_Scrapy.spiders']
NEWSPIDER_MODULE = 'my_Scrapy.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'my_Scrapy (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'my_Scrapy.middlewares.MyScrapySpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'my_Scrapy.middlewares.MyScrapyDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'my_Scrapy.pipelines.MyScrapyPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

  1. 运行结果截图:
8. 运行项目 1. 在终端内运行
# scrapy crawl 爬虫文件名
scrapy crawl spider1

最前面是爬虫运行的提示信息:

中间的就是网页源代码:

最后面是关闭爬虫的提示信息:

2. 通过PyCharm运行

需要在项目文件夹下创建一个启动项目的文件start.py:

# 使用Scrapy框架的该爬虫是一个项目,不能在爬虫文件中用右键运行,需要在终端里面输入scrapy crawl 爬虫文件名“的命令进行启动。
# 如果不想用终端输入命令的启动方式,可以创建该“start.py”文件
from scrapy import cmdline

cmdline.execute('scrapy crawl spider1'.split())   # 调用终端命令

# 红色的不是报错,而是爬虫的初始化信息。白色的内容就是print()语句输出的内容

运行结果:

D:Anacondapython.exe C:/Users/lv/Desktop/scrapy框架的学习/my_Scrapy/start.py
2022-04-03 16:34:35 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: my_Scrapy)
2022-04-03 16:34:35 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 18.7.0, Python 3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2p  14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.19041-SP0
2022-04-03 16:34:35 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-04-03 16:34:35 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'my_Scrapy',
 'NEWSPIDER_MODULE': 'my_Scrapy.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['my_Scrapy.spiders']}
2022-04-03 16:34:35 [scrapy.extensions.telnet] INFO: Telnet Password: b9d4a8fccbb5b978
2022-04-03 16:34:35 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2022-04-03 16:34:35 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-04-03 16:34:35 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-04-03 16:34:35 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-04-03 16:34:35 [scrapy.core.engine] INFO: Spider opened
2022-04-03 16:34:35 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-04-03 16:34:35 [py.warnings] WARNING: D:Anacondalibsite-packagesscrapyspidermiddlewaresoffsite.py:65: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry http://quotes.toscrape.com/ in allowed_domains.
  warnings.warn(message, URLWarning)

2022-04-03 16:34:35 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-04-03 16:34:36 [scrapy.core.engine] DEBUG: Crawled (404)  (referer: None)
2022-04-03 16:34:36 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: None)



	
	Quotes to Scrape
    
    


    
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” by (about)
“It is our choices, Harry, that show what we truly are, far more than our abilities.” by (about)
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.” by (about)
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.” by (about)
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.” by (about)
“Try not to become a man of success. Rather become a man of value.” by (about)
“It is better to be hated for what you are than to be loved for what you are not.” by (about)
Tags: life love
“I have not failed. I've just found 10,000 ways that won't work.” by (about)
“A woman is like a tea bag; you never know how strong it is until it's in hot water.” by (about)
“A day without sunshine is like, you know, night.” by (about)
2022-04-03 16:34:36 [scrapy.core.engine] INFO: Closing spider (finished) 2022-04-03 16:34:36 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 448, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 2578, 'downloader/response_count': 2, 'downloader/response_status_count/200': 1, 'downloader/response_status_count/404': 1, 'elapsed_time_seconds': 1.29309, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2022, 4, 3, 8, 34, 36, 608493), 'httpcompression/response_bytes': 11053, 'httpcompression/response_count': 1, 'log_count/DEBUG': 2, 'log_count/INFO': 10, 'log_count/WARNING': 1, 'response_received_count': 2, 'robotstxt/request_count': 1, 'robotstxt/response_count': 1, 'robotstxt/response_status_count/404': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2022, 4, 3, 8, 34, 35, 315403)} 2022-04-03 16:34:36 [scrapy.core.engine] INFO: Spider closed (finished) Process finished with exit code 0
3. srcapy shell 的使用 1. 在终端内使用scrapy shell命令进行单次请求内容提取的测试

爬取网址:https://docs.scrapy.org/en/latest/_static/selectors-sample1.html
在终端内输入命令:

scrapy shell https://docs.scrapy.org/en/latest/_st
atic/selectors-sample1.html
Microsoft Windows [版本 10.0.19042.1586]
(c) Microsoft Corporation。保留所有权利。
(base) C:Users吕成鑫Desktopscrapy框架的学习my_Scr
apy>scrapy shell https://docs.scrapy.org/en/latest/_st
atic/selectors-sample1.html
2022-04-04 20:04:11 [scrapy.utils.log] INFO: Scrapy 2.
5.1 started (bot: my_Scrapy)
2022-04-04 20:04:11 [scrapy.utils.log] INFO: Versions:
 lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.1.0, parsel
1.6.0, w3lib 1.22.0, Twisted 18.7.0, Python 3.7.0 (def
ault, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64
)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2p  14 Aug 2018), cr
yptography 2.3.1, Platform Windows-10-10.0.19041-SP0
2022-04-04 20:04:11 [scrapy.utils.log] DEBUG: Using re
actor: twisted.internet.selectreactor.SelectReactor
2022-04-04 20:04:11 [scrapy.crawler] INFO: Overridden
settings:
{'BOT_NAME': 'my_Scrapy',
 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilte
r',
 'LOGSTATS_INTERVAL': 0,
 'NEWSPIDER_MODULE': 'my_Scrapy.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['my_Scrapy.spiders']}
2022-04-04 20:04:11 [scrapy.extensions.telnet] INFO: T
elnet Password: 358ca5f9dee7f2d7
2022-04-04 20:04:11 [scrapy.middleware] INFO: Enabled
extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole']
2022-04-04 20:04:11 [scrapy.middleware] INFO: Enabled
downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMidd
leware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddle
ware',
 'scrapy.downloadermiddlewares.downloadtimeout.Downloa
dTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultH
eadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMidd
leware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',

 'scrapy.downloadermiddlewares.redirect.MetaRefreshMid
dleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCom
pressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddle
ware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddlewa
re',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMidd
leware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']

2022-04-04 20:04:11 [scrapy.middleware] INFO: Enabled
spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddlewa
re',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',

 'scrapy.spidermiddlewares.referer.RefererMiddleware',

 'scrapy.spidermiddlewares.urllength.UrlLengthMiddlewa
re',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-04-04 20:04:11 [scrapy.middleware] INFO: Enabled
item pipelines:
['my_Scrapy.pipelines.MyScrapyPipeline']
2022-04-04 20:04:11 [scrapy.extensions.telnet] INFO: T
elnet console listening on 127.0.0.1:6023
2022-04-04 20:04:11 [scrapy.core.engine] INFO: Spider
opened
2022-04-04 20:04:12 [scrapy.core.engine] DEBUG: Crawle
d (200)  (refe
rer: None)
2022-04-04 20:04:14 [scrapy.core.engine] DEBUG: Crawle
d (200)  (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Reques
t, scrapy.Selector, etc)
[s]   crawler    
[s]   item       {}
[s]   request    
[s]   response   <200 https://docs.scrapy.org/en/lates
t/_static/selectors-sample1.html>
[s]   settings   
[s]   spider     
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update
 local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Reque
st and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
In [1]: response
Out[1]: <200 https://docs.scrapy.org/en/latest/_static
/selectors-sample1.html>

In [2]: response.text
Out[2]: "n n  n  Example websiten 
n n  n nnn"

In [3]: response.xpath('//a')
Out[3]: 
[,
 ,
 ,
 ,
 ]

In [4]: response.xpath('//a').xpath('./img')
Out[4]: 
[,
 ,
 ,
 ,
 ]

In [5]: response.xpath('//a').xpath('./img')[0]
Out[5]: 

In [6]: response.xpath('//a').xpath('./img').getall()
   ...: 
Out[6]: 
['',
 '',
 '',
 '',
 '']

In [7]: response.xpath('//a').xpath('./img').get()
Out[7]: ''

In [8]: result = response.xpath('//a')

In [9]: result
Out[9]: 
[,
 ,
 ,
 ,
 ]

In [10]: result.xpath('./img').getall()
Out[10]: 
['',
 '',
 '',
 '',
 '']

In [11]: response.xpath("//img")
Out[11]: 
[,
 ,
 ,
 ,
 ]

In [12]: response.css('a')
Out[12]: 
[,
 ,
 ,
 ,
 ]

In [13]: response.css('div#images')
Out[13]: []

In [14]: response.css('div#images').get()
Out[14]: ''

In [15]: response.xpath('//a/text()').re('Name:s(.*)')
Out[15]: ['My image 1 ', 'My image 2 ', 'My image 3 ', 'My image 4 ', 'My image
 5 ']

In [16]: response.re('.*')      # 不能这样直接使用re,需要在解析的内容后面使用re正则表达式
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
 in ()
----> 1 response.re('.*')

AttributeError: 'HtmlResponse' object has no attribute 're'

In [17]: 

4. 实现翻页功能

如何翻页?

  • 回忆:

    • requests模块时如何发送翻页的请求的?
      • 1.找到下一页的地址
      • 2.之后调用requests.get(url)
  • 思路:

    • 1.找到下一页的地址
    • 2.构造一个关于下一页url地址的request请求传递给调度器
1. 通过在最后进行拼接成url和回调实现翻页
import scrapy
from lxml import etree
from my_Scrapy.items import MyScrapyItem   # 从items文件中导入MyScrapyItem类


class Spider2Spider(scrapy.Spider):
    # spider(爬虫)名称,需要记住,通过名字来启动爬虫:
    name = 'spider2'
    # # 允许爬取的域名:可更改(限制爬虫,不要爬到其他网站去了)
    # allowed_domains = ['quotes.toscrape.com/']   # 不设置限制,就可以一直爬取下一页了
    # 初始request请求:
    base_url = 'http://quotes.toscrape.com/page/{}/'
    page = 1
    start_urls = [base_url.format(page)]

    # 解析方法:
    def parse(self, response):
        # print(response.text)
        """
        解析返回响应,提取数据或进一步生成要处理的请求。
        :param response:
        :return:
        """
        # 方法一:通过css选择器来进行解析
        """ 
        quotes = response.css('.quote')   # 列表
        for quote in quotes:
            # text = quote.css('span.text::text')   # ::text表示获取里面的文本(注意它是一个对象)
            # 旧方法:
            # extract_first()  返回第一个数据。(字符串)
            # extract()  返回全部的数据。(字符串列表)
            text = quote.css('span.text::text').extract_first()   # ::text表示获取里面的文本  (extract_first()方法可以获取到对象里面的文本内容)
            # author = quote.css('small.author::text')  # 作者  (得到一个对象)
            author = quote.css('small.author::text').extract_first()  # 作者  (得到文本内容)
            # tags = quote.css('div.tags a.tag::text')   # 标签(CSS对象)
            tags = quote.css('div.tags a.tag::text').extract()   # 返回全部的数据
            # print(tags)
            # print(text, '      ——————', author, tags)

            # 新方法:
            # get()   返回一条数据
            # getall()  返回全部的数据
            text = quote.css('span.text::text').get()
            author = quote.css('small.author::text').get()
            tags = quote.css('div.tags a.tag::text').getall()
            print(text, tags, '      ——————', author)
        """

        # 方法二:通过xpath进行解析
        html = etree.HTML(response.text)
        quotes_divs = html.xpath('//div[@]')
        for quote_div in quotes_divs:
            text = quote_div.xpath('./span[1]/text()')[0]
            author = quote_div.xpath('./span[2]/small/text()')[0]
            tags = quote_div.xpath('./div[@]/a/text()')
            # print(text, tags, '    ------', author)

            # 将数据放入容器中,便于保存数据
            item = MyScrapyItem()
            item['text'] = text
            item['author'] = author
            item['tags'] = tags
            # print(item)
            # 通过迭代的方式将字典中的每一条数据交给Pipeline
            yield item

        self.page += 1
        # 注意:需要控制翻页的结束
        if self.page < 11:
            # 构造下一个请求:(方法一)
            # yield scrapy.Request(self.base_url.format(self.page), callback=self.parse)

            # 构造下一个请求:(方法二)
            yield from response.follow_all(response.css('.pager .next a::attr("href")'), callback=self.parse)
        """
        # 原本定义的翻页操作:
        next = response.css('ul.pager li.next a::attr("href")').get()   # 获得“下一页”a标签按钮上面href属性值
        print(next)  # https://blog.csdn.net/page/2/
        url = self.start_urls[0]   # 获取当前爬取的url(方法一)
        # print(url)
        url = response.url    # 获取当前爬取的url(方法二)
        # print(url)
        # 拼接形成下一页的url
        url = response.urljoin(next)
        print(url)
        # 将请求交给调度器,重新构造下一个请求
        yield scrapy.Request(url, callback=self.parse)   # 对于新的请求,仍然执行parse()解析方法
        """
2. 通过重写strat_requests()方法实现翻页
import scrapy
from lxml import etree
from my_Scrapy.items import MyScrapyItem   # 从items文件中导入MyScrapyItem类


class Spider3Spider(scrapy.Spider):
    # spider(爬虫)名称,需要记住,通过名字来启动爬虫:
    name = 'spider3'
    # # 允许爬取的域名:可更改(限制爬虫,不要爬到其他网站去了)
    # allowed_domains = ['quotes.toscrape.com/']   # 不设置限制,就可以一直爬取下一页了
    # 初始request请求:
    base_url = 'http://quotes.toscrape.com/page/{}/'
    page = 1
    start_urls = [base_url.format(page)]

    # 通过封装方法的形式构造翻页功能:
    def start_requests(self):   # 在爬虫开始请求的时候会执行的操作
        for page in range(1, 11):
            url = self.base_url.format(page)
            yield scrapy.Request(url, callback=self.parse)

    # 解析方法:
    def parse(self, response):
        # print(response.text)
        """
        解析返回响应,提取数据或进一步生成要处理的请求。
        :param response:
        :return:
        """
        # 方法一:通过css选择器来进行解析
        """ 
        quotes = response.css('.quote')   # 列表
        for quote in quotes:
            # text = quote.css('span.text::text')   # ::text表示获取里面的文本(注意它是一个对象)
            # 旧方法:
            # extract_first()  返回第一个数据。(字符串)
            # extract()  返回全部的数据。(字符串列表)
            text = quote.css('span.text::text').extract_first()   # ::text表示获取里面的文本  (extract_first()方法可以获取到对象里面的文本内容)
            # author = quote.css('small.author::text')  # 作者  (得到一个对象)
            author = quote.css('small.author::text').extract_first()  # 作者  (得到文本内容)
            # tags = quote.css('div.tags a.tag::text')   # 标签(CSS对象)
            tags = quote.css('div.tags a.tag::text').extract()   # 返回全部的数据
            # print(tags)
            # print(text, '      ——————', author, tags)

            # 新方法:
            # get()   返回一条数据
            # getall()  返回全部的数据
            text = quote.css('span.text::text').get()
            author = quote.css('small.author::text').get()
            tags = quote.css('div.tags a.tag::text').getall()
            print(text, tags, '      ——————', author)
        """

        # 方法二:通过xpath进行解析
        html = etree.HTML(response.text)
        quotes_divs = html.xpath('//div[@]')
        for quote_div in quotes_divs:
            text = quote_div.xpath('./span[1]/text()')[0]
            author = quote_div.xpath('./span[2]/small/text()')[0]
            tags = quote_div.xpath('./div[@]/a/text()')
            # print(text, tags, '    ------', author)

            # 将数据放入容器中,便于保存数据
            item = MyScrapyItem()
            item['text'] = text
            item['author'] = author
            item['tags'] = tags
            # print(item)
            # 通过迭代的方式将字典中的每一条数据交给Pipeline
            yield item

        """
        self.page += 1
        # 注意:需要控制翻页的结束
        if self.page < 11:
            # 构造下一个请求:(方法一)
            # yield scrapy.Request(self.base_url.format(self.page), callback=self.parse)

            # 构造下一个请求:(方法二)
            # 该方法是2.0版本之后出现的    拼接请求,进行回调
            yield from response.follow_all(response.css('.pager .next a::attr("href")'), callback=self.parse)
        """
        """
        # 原本定义的翻页操作:
        next = response.css('ul.pager li.next a::attr("href")').get()   # 获得“下一页”a标签按钮上面href属性值
        print(next)  # https://blog.csdn.net/page/2/
        url = self.start_urls[0]   # 获取当前爬取的url(方法一)
        # print(url)
        url = response.url    # 获取当前爬取的url(方法二)
        # print(url)
        # 拼接形成下一页的url
        url = response.urljoin(next)
        print(url)
        # 将请求交给调度器,重新构造下一个请求
        yield scrapy.Request(url, callback=self.parse)   # 对于新的请求,仍然执行parse()解析方法
        """
3. 修改start.py文件保存数据
# 使用Scrapy框架的该爬虫是一个项目,不能在爬虫文件中用右键运行,需要在终端里面输入scrapy crawl 爬虫文件名“的命令进行启动。
# 如果不想用终端输入命令的启动方式,可以创建该“start.py”文件
from scrapy import cmdline

# cmdline.execute('scrapy crawl spider1'.split())   # 调用终端命令
# cmdline.execute('scrapy crawl spider1 -o demo.csv'.split())
# cmdline.execute('scrapy crawl spider2'.split())   # 调用终端命令
cmdline.execute('scrapy crawl spider3'.split())   # 调用终端命令

# 红色的不是报错,而是爬虫的初始化信息。白色的内容就是输出内容

5. Scrapy框架-案例2 1. 分析网站
  1. 目标网站:腾讯招聘网站
  2. 目标:
    1. 爬取招聘岗位信息
    2. 翻页
      虚假的url:https://talent.antgroup.com/off-campus
  3. 数据加载方式:动态和静态
    抓包获取的含有数据的data-url:
    第1页:
    https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1649127078399&countryId=&cityId=&bgIds=&productId=&categoryId=40001001,40001002,40001003,40001004,40001005,40001006&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn
    第2页:
    https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1649127078399&countryId=&cityId=&bgIds=&productId=&categoryId=40001001,40001002,40001003,40001004,40001005,40001006&parentCategoryId=&attrId=&keyword=&pageIndex=2&pageSize=10&language=zh-cn&area=cn
    详情页:
    url:https://careers.tencent.com/jobdesc.html?postId=1310124481703845888
    data-url:https://careers.tencent.com/tencentcareer/api/post/ByPostId?timestamp=1649156817199&postId=1310124481703845888&language=zh-cn
  4. 爬取思路:
    1. 第一页url
    2. 解析第一页上每个岗位对应postid
    3. 构造url
2. 实现步骤
  1. 创建项目
scrapy startproject tencent
  1. 创建爬虫程序
cd tencent
scrapy gensipider spider1 tencent.com

运行结果:

C:UserslvDesktopscrapy框架的学习>scrapy startproject tencent
New Scrapy project 'tencent', using template directory 'd:anacondalibsite-packagesscrapytemplatesproject', created in: C:UserslvDesktopscrapy框架的学习tencent

You can start your first spider with:
    cd tencent
    scrapy genspider example example.com

C:UserslvDesktopscrapy框架的学习>cd tencent

C:UserslvDesktopscrapy框架的学习tencent>scrapy genspider spider1 tencent.com
Created spider 'spider1' using template 'basic' in module:
  tencent.spiders.spider1

C:UserslvDesktopscrapy框架的学习tencent>
  1. 用PyCharm打开tencent项目:
  2. 在命令行使用如下命令生成一个spider1.py文件:
scrapy genspider spider1 tencent.com
  1. 编辑spider1.py文件如下:

  1. 打开settings.py文件下的pipelindes的注释:
补充一:Spider类的使用 1. Spider的运行流程
  1. 定义爬取网站的逻辑
  2. 分析爬取下来的页面
2. Spider类的分析
  • name:设置爬虫名称。
  • allowed_domains:允许访问的域名,防止爬虫爬到其他网址去。
  • start_urls:请求的url列表。
  • custom_settings:一个字典,专属于本spider的配置,这个配置会覆盖项目的全局配置,这个配置必须
  • crawler:该属性由from_crawler()方法设置,代表spider对应的爬虫对象。可以通过该属性来获取项目的配置信息。
  • closed:当前spider关闭时,方法会被调用,释放一些资源。
补充二:Request对象 1. 介绍
  • Request对象是在构造新的请求时需要用到的scrapy的一个对象。
    例如:
yield scrapy.Request(url=detail_url, callback=self.parse_detail)
2. 参数说明
  • url:新请求的url地址。该url会被放入队列中。
  • callback:回调的解析数据的函数。
  • priority:请求的优先级。(自定义队列中哪个url需要先被请求。)默认是0,调度器进行request调度时使用它。数值越大,越优先被调度执行。
  • method:请求方式,默认是“GET”。
  • dont_filter:是否需要重复请求,默认为False。
  • errback:设置请求发生错误后的处理方法,默认为None。(很少用到)
    例如:
    def parse(self, response):
    	...
        yield scrapy.Request(url=detail_url, callback=self.parse_detail, errback=self.func)

    def func(self):
        print("请求出现错误后执行的方法")
  • body:request内容。
  • headers:请求头。
  • cookies
  • meta:通过response携带参数进行传递。相当于额外附加的信息。
    例如:
    def parse(self, response):
        # 解析数据(由于获取到的不是网页源代码,而是数据包,是字典或json类型的)
        data = json.loads(response.text)
        for job in data['Data']['Posts']:
            item = TencentItem()

            post_id = job['PostId']
            # print(post_id)
            item['job_name'] = job['RecruitPostName']

            # 构建详情页url
            detail_url = self.two_url.format(post_id)
            print(detail_url)

            # 构造请求:
            yield scrapy.Request(url=detail_url, callback=self.parse_detail, meta={"item": item})

    # 解析详情页面的数据
    def parse_detail(self, response):
        item = response.meta.get('item')
        print(item)
  • encoding:编码格式,默认“utf-8”。
  • cb_kwargs:设置回调方法需要额外携带的参数,可以通过字典传递。
    例如:
	def parse(self, response):
			...
            # 构造请求:
            yield scrapy.Request(url=detail_url, callback=self.parse_detail, cb_kwargs={"num": 1})

    # 解析详情页面的数据
    def parse_detail(self, response, num):
        print(num)
补充三:CSS选择器
"""
解析工具:
    1. 正则表达式                效率高      语法难记
    2. xpath                   效率中等    语法中等
    3. BS4(bs语法和css选择器)     效率低     语法最简单
"""
from bs4 import BeautifulSoup
# 推荐一个第三方库: parsel
import parsel   # 内置了正则、xpath、css三种选择器


html = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a wel.

...

""" # 一、通过BeautifulSoup模块使用css选择器: # 解析 # lxml是第三方的解析器,比起默认的html.parser解析器速度快很多 soup = BeautifulSoup(html, features="lxml") # BeautifulSoup会自动补全不完整的html(例如加上、等) # print(soup) # 1. 通过标签名称进行查找 a_tags = soup.select('a') print(a_tags) # 2. 通过类名称进行查找 sister_class = soup.select('.sister') print(sister_class) # 3. 通过id名进行查找 link1_id = soup.select("#link1") print(link1_id) # 4. 组合查找 a_link2 = soup.select("p #link2") print(a_link2) a_link2 = soup.select("p > #link2") # > 代表直接的下一级 print(a_link2) p_sister_class = soup.select("p > .sister") print(p_sister_class) # 同一个标签的id和class不能同时用 # p_sister_class_id = soup.select("p > .sister#link1") # print(p_sister_class_id) # 5. 通过属性查找 a_href = soup.select('a[href="http://example.com/elsie"]') print(a_href) # 6. 获取标签内的文本内容 text1 = soup.select('title')[0].get_text() print(text1) # 7. 获取标签属性的值(如获取href属性的值) href = soup.select('a#link1')[0]['href'] print(href) print("---"*20) # 二、通过parsel模块使用CSS选择器: selector = parsel.Selector(html) # 创建选择器对象 # selector.re() # selector.xpath() # selector.css() # 1. 通过标签名查找 object_list = selector.css("a") print(object_list.getall()) # getall()方法获取全部满足的结果 # for item in object_list: # print(item.get()) # 2. 通过类名称进行查找 print(selector.css('.sister').get()) # get()方法获取第一个满足条件的结果 print(selector.css('.sister').getall()) # 3. 通过id名进行查找 print(selector.css('#link1').getall()) # 4. 组合查找 print(selector.css('p.story a#link2').getall()) # 5. 通过属性查找 print(selector.css('.story').get()) # 6. 获取标签内的文本内容 print(selector.css('p > #link1::text').get()) # 7. 获取标签属性的值(如获取href属性的值) print(selector.css('p > #link1::attr(href)').get()) # 8. 伪类选择器 print(selector.css('a').getall()[1]) print(selector.css('a:nth-child(1)').getall()) # 选择第几个
转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/875604.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号