爬虫python系列【从request到scrapy框架】总结

爬虫

作者：Ychhh_

铺垫内容爬虫分类

通用爬虫：
- 抓取系统重要组成部分
聚焦爬虫：
- 建立在通用爬虫的基础之上
- 抓取的为抓取页面局部内容
增量式爬虫：
- 检测网站中数据的更新情况

反爬机制

门户网站，可以通过指定相应的策略，防止爬虫程序进行数据的窃取
反反爬策略：破解反爬策略，获取数据

相关协议

robots.txt 协议：
- 君子协议。规定了网站中哪些数据可以被爬取，哪些不可以被爬取
http协议：
- 常用客户端与服务器的通信协议
常用请求头信息：
- user-Agent:请求载体的身份标识
- connection:请求完毕后是断开连接还是保持连接
常用响应头信息：
- content-type:服务器相应客户端的数据类型
https协议：
- 安全的超文本传输协议
加密方式：
- 对称密钥加密：
  密文和密钥均由客户机发送给服务器
  缺陷：密钥和密文可能会被中间机构拦截
- 非对称密钥加密：
  密文由客户机发送给服务器
  密钥由服务器发送给客户机
  缺陷：不能保证客户机拿到的密钥一定由服务器提供
- 证书密钥加密（https）：
  由第三方认证机制进行密钥防伪认证

requests模块 requests作用

模拟浏览器发送请求

response返回种类：
- text：文本格式
- json：json对象
- content：图片格式

UA伪装(反爬机制)

门户网站若检测到请求载体为request而不是浏览器,则会使得拒绝访问

聚焦爬虫数据解析分类

正则
bs4
xpath

bs4

数据解析原理
1. 标签定位
2. 提取标签属性中的数据值

bs4数据解析原理：

 1. 实例化beautysoup对象，并将源码数据加载到beautysoup中
 2. 通过调用beautysoup对象中相关属性和方法进行标签定位和数据提取

属性定位：
- soup.tagName:找到第一次出现的标签的属性
- soup.find():
  1. find(tagName):等同于soup.tagName
  2. find(tagName,class / attr / id …):按照属性进行定位
- soup.find_all():查找符合要求的所有标签（列表新式),也可以作为属性定位
- soup.select():
  1. 标签选择器
  2. 层级选择器：
  - 父标签 > 子标签（一个层即）
  - ‘ ’空格表示多个层即
- Attention:对于find和select的结果非同一对象
获取标签中的内容：
- soup.text
- soup.string
- soup.get_text()

代码样例（三国演义爬取）

import requests
import json
from bs4 import BeautifulSoup

if __name__ == "__main__":

    url = "https://www.shicimingju.com/book/sanguoyanyi.html"

    headers = {
        "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36"
    }

    response = requests.get(url=url,headers=headers)
    response.encoding = response.apparent_encoding #设置编码格式
    """
    其中 r.encoding 根据响应头中的 charset 判断网站编码，如果没有设置则默认返回 iso-8859-1 编码，而r.apparent_encoding
则通过网页内容来判断其编码。令r.encoding=r.apparent_encoding就不会出现乱码问题。
    """

    html = response.text

    # print(html)
    soup = BeautifulSoup(html,'lxml')
    muluList = soup.select(".book-mulu a")
    muluRecord = []
    for mulu in muluList:
        muluRecord.append(mulu.text)
    pageNum = len(muluRecord)
    dataTotalUrl = "https://www.shicimingju.com/book/sanguoyanyi/%d.html"
    for i,title in enumerate(muluRecord):
        dataUrl = dataTotalUrl%(i + 1)
        response = requests.get(url=dataUrl,headers=headers)
        response.encoding = response.apparent_encoding
        dataHtml = response.text

        dataSoup = BeautifulSoup(dataHtml,'lxml')


        data = dataSoup.find("div",class_="chapter_content").text
        data = data.replace("　　","n")
        path = r"C:UsersY_chDesktopspider_testdatatextsanguo\" + title + ".txt"
        with open(path,'w',encoding="utf-8") as fp:
            fp.write(data)
            print("第%d篇下载完毕"%(i + 1)

xpath

数据解析原理：
1. 实例化etree对象，且需要将页面源码数据加载到被解析对象中去
2. 调用etree中的方法，配合着etree中的xpath方法定位
解析方法：
1. 将本地的html源码数据加载到etree中
  - etree.parse(filepath)
2. 可以将互联网上获得的源码数据加载到etree中去
  - etree.HTML(text)
xpath使用：
- 绝对路径:/xx/xx/x
- 省略路径：//xx
- 属性定位：//tagName[@attr = “”]
- 索引定位：//tagName[@attr=""]/xx
- 重复索引：//tagName[@attr]//p[pos]
- 文本获取：//tagName/text()
- 属性获取：//tagName/@attr

代码样例（4K图片爬取）

import json
from lxml import etree
import requests

if __name__ == "__main__":
    url = "https://pic.netbian.com/4kdongman/index_%d.html"

    headers = {
        "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36"
    }
    pageNum = 2

    for page in range(pageNum):
        if page == 0:
            new_url = "https://pic.netbian.com/4kdongman/"
        else:
            new_url = url % (page + 1)

        response = requests.get(url=new_url, headers=headers)

        html_code = response.text

        tree = etree.HTML(html_code)

        urlList = tree.xpath("//ul[@class="clearfix"]//img//@src")

        urlHead = "https://pic.netbian.com"
        path = r"C:UsersY_chDesktopspider_testdatapic4K\"
        picUrlList = []
        for index, eachUrl in enumerate(urlList):
            picUrl = urlHead + eachUrl
            picUrlList.append(picUrl)

        for index, picUrl in enumerate(picUrlList):
            picReq = requests.get(url=picUrl, headers=headers)
            pic = picReq.content

            picPath = path + str(page)+ "." +str(index) + ".jpg"
            with open(picPath, 'wb') as fp:
                fp.write(pic)
                print("第%d页 第%d张图片下载成功!" % ((page + 1),index + 1))

验证码识别

验证码为门户网站的反爬机制
通过爬取获得img再通过第三方验证码识别软件进行验证码的识别

代码样例

import json
import  requests
from lxml import etree
from verication import vercation

if __name__ == "__main__":
    url = "https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36"
    }
    response = requests.get(url=url,headers=headers)

    tree = etree.HTML(response.text)

    varication_path = tree.xpath("//img[@id="imgCode"]/@src")
    picUrl = "https://so.gushiwen.cn" + varication_path[0]

    pic = requests.get(url=picUrl,headers=headers).content
    print(vercation(pic=pic))





#!/usr/bin/env python
# coding:utf-8

import requests
from hashlib import md5

class Chaojiying_Client(object):

    def __init__(self, username, password, soft_id):
        self.username = username
        password =  password.encode('utf8')
        self.password = md5(password).hexdigest()
        self.soft_id = soft_id
        self.base_params = {
            'user': self.username,
            'pass2': self.password,
            'softid': self.soft_id,
        }
        self.headers = {
            'Connection': 'Keep-Alive',
            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
        }

    def PostPic(self, im, codetype):
        """
        im: 图片字节
        codetype: 题目类型 参考 http://www.chaojiying.com/price.html
        """
        params = {
            'codetype': codetype,
        }
        params.update(self.base_params)
        files = {'userfile': ('ccc.jpg', im)}
        r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files, headers=self.headers)
        return r.json()

    def ReportError(self, im_id):
        """
        im_id:报错题目的图片ID
        """
        params = {
            'id': im_id,
        }
        params.update(self.base_params)
        r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
        return r.json()


def vercation(pic,picCode=1902,picMoudle=None):
    chaojiying = Chaojiying_Client('1325650083', 'ych3362632', '94271a5f53dc7b0e34efdb06a88692c1')
    if picMoudle == None:
        return chaojiying.PostPic(pic, picCode)["pic_str"]
    else :
        im = open(pic, 'rb').read()  # 本地图片文件路径 来替换 a.jpg 有时WIN系统须要//
        return chaojiying.PostPic(im, picCode)["pic_str"]
# if __name__ == '__main__':
# 	chaojiying = Chaojiying_Client('1325650083', 'ych3362632', '94271a5f53dc7b0e34efdb06a88692c1')	#用户中心>>软件ID 生成一个替换 96001
# 	im = open('a.jpg', 'rb').read()													#本地图片文件路径 来替换 a.jpg 有时WIN系统须要//
# 	print (chaojiying.PostPic(im, 1902))												#1902 验证码类型  官方网站>>价格体系 3.4+版 print 后要加()

代理

代理是什么：
- 代理服务器
代理的作用：
- 突破自身IP的限制
- 隐藏自身真实的IP
代理相关的网站：
- 快代理
- 西祠代理
- www.goubanjia.com
代理的透明度：
- 透明：服务器知到代理ip和真实ip
- 匿名：服务器知到代理ip，但不知道真实ip
- 高匿：服务器即不知道代理ip，也不知道真实ip
在python中将代理ip作为proxies参数作为requests的请求参数
http的代理只能对hettp服务器进行请求,https的代理只能对hettps的服务器进行请求

代码样例

from lxml import etree
import requests

if __name__ == "__main__":
    url = "https://www.baidu.com/s?wd=ip"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36"
    }
    proxies = {
        "https":"221.110.147.50:3128"
    }

    response = requests.get(url=url,headers=headers,proxies=proxies)

    with open(r"C:UsersY_chDesktopspider_testdd.html",'w') as fp:
        fp.write(response.text)

异步爬虫

作用：在爬虫中使用异步实现高性能的数据爬取操作

异步爬虫的方式

多线程（不建议）：
- 好处：可以为相关阻塞的操作开启线程或进程，阻塞的操作就可以异步执行
- 弊端：无法无限制的开启线程和进程
线程池：
- 好处：降低系统对线程或者进程的创建和销毁的效率，从而更好的降低对系统的开销
- 弊端：线程池的线程个数有上线
单线程 + 异步协程

selenium模块

浏览器驱动程序(谷歌)：
- http://chromedriver.storage.googleapis.com/index.html
selenium于爬虫之间的联系：
- 便捷的获取网站中动态加载的数据（使用etree和soup不能解析的js文件也可以获取）
- 便捷的实现模拟登录

样例代码（爬取pear视频）:

from selenium import webdriver
from lxml import etree
import requests
import time
from multiprocessing.dummy import Pool
"""
    使用线程池爬取，容易被反爬虫措施进行拦截！！！


"""
headers = {
        "Useer-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36"
    }

def getUrlList():
    url = "https://www.pearvideo.com/category_5"

    response = requests.get(url=url, headers=headers)
    htmlHead = 'https://www.pearvideo.com/'
    initHtml = etree.HTML(response.text)
    videoUrlList = initHtml.xpath("//ul[@class="category-list clearfix"]//a/@href")
    print((videoUrlList))

    videoHtml = []
    for each in videoUrlList:
        videoHtml.append(htmlHead + each)

    return videoHtml


def get_video(url):
    if url == None:
        return
    bro.get(url=url)
    page_text = bro.page_source
    tree = etree.HTML(page_text)

    try:
        videoUrl = tree.xpath("//div[@class="main-video-box"]/div//@src")[0]
        name = tree.xpath("//h1[@class="video-tt"]/text()")[0]
        video = requests.get(url=videoUrl, headers=headers).content
        path = r"C:UsersY_chDesktopspider_testdatavideopear\" + name + ".mp4"
        with open(path, 'wb') as fp:
            fp.write(video)
            print(name + " 视频下载成功！")
    except IndexError as e:
        print(url)



bro = webdriver.Chrome('./chromedriver.exe')

url = getUrlList()
get_video(url[1])
pool = Pool(len(url))
print(len(url))
pool.map(get_video,url)
pool.close()
pool.join()

time.sleep(10)
bro.close()

发起请求：
- 通过get方法进行url的请求
标签定位：
- 通过find的系列函数得到指定标签元素
标签交互：
- 通过send_keys(“xxx”)进行标签的交互
执行js代码：
- 通过执行excute_script("")来是页面执行js代码
页面的前进后退：
- back()
- forward()
关闭浏览器：
- close()
网页保存截图：
- save_screenshoot("./filepath")

iframe处理

如果定位的标签被嵌套在iframe的子页面当中则不能直接使用bro.find系列函数进行定位，需要如下步骤：
- ```
bro.switch_to.frame("ifrmaeResult") #切换frame框
bro.find_element_by_id("1")	
```

动作链

当需要在浏览器中进行动作处理时，使用webdriver的动作链进行处理

代码样例：

def drop_test():
    bro = webdriver.Chrome("chromedriver.exe")
    bro.get("https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable")


    bro.switch_to.frame("iframeResult")
    div = bro.find_element_by_id("draggable")

    #构造动作链
    action = ActionChains(bro) #构造动作链实例
    action.click_and_hold(div) #点击并长按

    for i in range(5):
        #move_by_offset():
            #xoffset,yoffset:两个方向移动的像素
        #perform():
            #立即执行
        action.move_by_offset(xoffset=18,yoffset=0).perform()
        time.sleep(0.3)

    #释放动作链
    action.release()
    time.sleep(5)
    bro.close()
    print(div)

无头浏览器

使得浏览器行为执行为无可视化界面

添加代码：

from selenium.webdriver.chrome.options import  Options

chrome_option = Options()
chrome_option.add_argument("--headless")
chrome_option.add_argument("--disable-gpu")
bro = webdriver.Chrome("./chromedriver.exe",chrome_options=chrome_option) #在driver的实例化中添加chrome_options的属性

selenium屏蔽规避

对于某些网站拒绝selenium的请求，使得selenium无法获取服务器连接，需要添加相应代码进行规避

添加代码：

# chrome79以前的版本
def evade():
    option = ChromeOptions()
    option.add_experimental_option("excludeSwitches",["enable-automation"])
    bro = webdriver.Chrome("./chromedriver.exe",options=option)

  #chrome79以后的版本
  from selenuim import webdriver
  driver = webdriver.Chrome()
  driver.execute_cdp_cmd("Page.addscriptToevaluateOnNewdocument", {
    "source": """
      Object.defineProperty(navigator, 'webdriver', {
        get: () => undefined
      })
    """
  })

Scrapy框架初始scrapy

什么是框架：
- 集成了很多功能并且具有很强通用性的一个项目模板
什么是scrapy:
- 爬虫的封装好的框架
- 功能：
  1. 高性能的持久化存储
  2. 异步的数据下载
  3. 高性能的数据解析
  4. 分布式
scrapy安装：
- pip install wheel
- 在https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted中找到python对应版本的twisted版本，放在指定目录中，并执行
  
  pip install

pip install pywin32

pip install scrapy

scrapy工程的创建：
1. 进入包含scrapy的python环境
2. 执行scrapy startprojecr XXXPro
3. 创建成功
在scrapy的子目录中创建爬虫给的源文件

scrapy genspider <初始url>
执行工程：

scrapy crawl

相关起始的参数的解释：

#爬虫文件的名称，是爬虫源文件的唯一表示符
    name = 'test'

    #爬虫文件允许请求的url,如果请求的url不在该列表内，则不允许进行请求(在通常情况下不使用该参数)
    allowed_domains = ['www.xxx.com'] #在一般情况下需要将该参数列表进行注释

    #爬虫文件的起始url，即爬虫自动进行访问的url
    start_urls = ['http://www.xxx.com/']

修改robots.txt的执行参数为false

ROBOTSTXT_OBEY = False #需要将其修改为false，否则被网站拒绝

隐藏返回response中的日志内容：

scrapy crawl --nolog

缺陷：如果response错误，无任何提示信息

针对 --log 的缺陷，改进：

在setting文件中设置：LOG_LEVEL = “ERROR”

通过请求获得url的相应存储在parse中的response中，通过response.xpath进行解析，解析后的data数据通过extract进行提取

    def parse(self, response):
        div_list = response.xpath("//div[@class="content"]/span/text()").extract()
        print(''.join(div_list))

scrapy数据的持久化存储

scrapy的持久化存储：
- 基于终端的存储：
  
  scrapy crawl -o
  
  注意：
```
1. 只可以将parse函数的**返回值**进行存储到**本地文件（不可以直接存储到数据库中）**中
2. 只能存储为指定的文件类型：【'json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle'】
```
- 基于管道的存储：
  - 编码流程：
    1. 数据解析
    2. 在item类中定义相关的属性用于数据的封装
    3. 将解析的数据封装到item中
    4. 将item类型的对象提交到管道进行持久化存储
    5. 在管道类的process_item进行数据的保存
    6. 在配置文件中开启管道

本地保存实例

  ```python
  #item.py
  class QiuabaiproItem(scrapy.Item):
      # define the fields for your item here like:
      # name = scrapy.Field()
      content = scrapy.Field() #数据封装
  
  ```
  
  ```python
  #pipelines.py
  class QiuabaiproPipeline(object):
      fp = None
      def open_spider(self,spider):
          print("start")
          self.fp = open("./qiubi.txt",'w',encoding='utf-8')
  
      def process_item(self, item, spider):
          content = item["content"]
          self.fp.write(content)
          return item
  
      def close_spider(self,spider):
          print("finsih")
          self.fp.close()
  
  ```
  
  ```python
  #开启管道
  #settings.py
  ITEM_PIPELINES = {
     'qiuabaiPro.pipelines.QiuabaiproPipeline': 300, #后面的数值为优先级，数值越小优先级越高
  }
  
  ```
  
  ```python
  #parse.py
  #通过yield关键词进行管道的提交
  def parse(self, response):
          div_list = response.xpath("//div[@class="content"]/span/text()").extract()
          yield div_list[0]
  
          return div_list
  ```

数据库保存实例

#pipelines.py

class MysqlPipeline(object):
    conn = None
    cursor = None
    def open_spider(self,spider):
        self.conn = pymysql.Connect(host='localhost',port=3307,user="root",passwd="ych3362632",db="test",charset="utf8") #此处只能为utf8不可以是utf-8
    def process_item(self,item,spider):
        self.cursor =  self.conn.cursor()
        try:
            print(len(item["name"]))
            self.cursor.execute("insert into spider (`name`) values ("%s")" % item["name"])
            self.conn.commit()

        except Exception as e:
            print(e)
            self.conn.rollback()

        return item

    def close_spider(self,spider):
        self.conn.close()
        self.cursor.close()

#settings.py
ITEM_PIPELINES = {
   'qiuabaiPro.pipelines.QiuabaiproPipeline': 300, 
   'qiuabaiPro.pipelines.MysqlPipeline': 301,  #将新建的管道类添加到setting文件中，另外若管道类的优先级低，则高优先级管道类中的process_item的函数需要返回item给低优先级使用
}

存储总结

两种方式实现持久化存储：
- 命令行形式（需要parse返回，且存储类型固定，并且不能存在数据库中）
- 管道形式：除了配置麻烦外全是优点
面试题：将爬取的数据一份存储到本地，一份存储到数据库中如何实现：
- 建立两个pipe类文件，并将创建的类在配置文件中进行设置
- 若多个管道类文件都进行同步存储，需要高优先级的process_item对item进行返回，使得低优先级的管道类可以获得item数据

全站数据请求

初始URL一般为索要爬取的网站首页的url通过index或者网站页码的关系设置url_list

通过scrapy.Request方法进行递归获取网页完整内容

class BeautiSpider(scrapy.Spider):
    name = 'beauti'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.duodia.com/daxuexiaohua/']
    url = "https://www.duodia.com/daxuexiaohua/list_%d.html"
    pageNum = 1
    def parse(self, response):
        div_list = response.xpath("//*[@id="main"]/div[2]/article")
        for div in div_list:
            name = div.xpath(".//a/@title").extract()
            print("".join(name))

        if self.pageNum <= 5:
            new_url = self.url % self.pageNum
            self.pageNum += 1
            yield scrapy.Request(url=new_url,callback=self.parse) #递归调用，callback专门用于数据解析

五大核心组件

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-wzUmZtvq-1634221942757)(C:UsersY_chDesktopspider_testdatamd_data11.webp)]

引擎（Scrapy）：
- 用于处理整个系统的数据流处理，触发事物（核心框架）
调度器(Scheduler)：
- 用来接收引擎发过来的请求，压入队列中，并在引擎再次请求的时候返回，可以想象成一个URL（抓取网页的网址或者说时链接）的优先队列，由他来决定下一个要抓取的网址是什么，同时去除重复的网址
下载器（Downloader）:
- 用于下载网页的内容，并将网页内容返回给spider（Scrapy下载器时建立在twisted整个高效的异步模型上的）
项目管道(Pipeline):
- 负责处理爬虫从网页抽取的实体，主要高能是持久化实体，验证实体的有效性，清楚不需要的信息，当页面被爬虫解析后，被发送到项目管道，并经过几个特定的次序处理数据
爬虫（Spider）:
- 爬虫是主要干活的，用于从特定的网页中提取自己需要的信息，即所谓的实体（item）。用户也可以从中提取链接，让Scrapy继续抓取下一个页面

请求传参

对于全站数据爬取时，对于详情页的爬取需要用到请求传参，即将item对象传入不同函数中

代码实现：

 # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.zhipin.com/job_detail/?query=python']
    home_url = "https://www.zhipin.com/"


    def detail_parse(self,response):
        item = response.meta["item"]
        content = response.xpath("//*[@id="main"]/div[3]/div/div[2]/div[2]/div[1]/div//text()").extract()
        item["content"] = content
        yield item


    def parse(self, response):
        print(response)
        li_list = response.xpath("//*[@id="main"]/div/div[3]//li")
        for li in li_list:
            name_div = li.xpath("//span[@class="job-title"]")
            title = name_div.xpath("/span[@class="job-name"]/a/@title").extract()
            name = name_div.xpath("/span[@class="job-area-wrapper"]/span/text()").extract()
            li_name = title + " " + name


            detail = li.xpath("//div[@class="primary-wrapper"]/div/@href").extract()
            new_url = "https://www.zhipin.com/" + detail
            item =  BoproItem()
            item["name"] = li_name

            yield scrapy.Request(url=new_url,callback=self.etail_parse,meta={"item":item}) #item传入到其他函数中使用

图片管道类(ImagesPipeline)的使用

通过使用scrapy.pipelines.images中的ImagesPipelines类进行图片地址自动获取和下载
需要重写scrapy.pipelines.images中的ImagesPipelines中的函数

在setting中设置图片的存储路径

#pipelines.py
from scrapy.pipelines.images import ImagesPipeline
import scrapy
class ImageLine(ImagesPipeline):

    #根据图片地址进行请求
    def get_media_requests(self, item, info):
        yield scrapy.Request(item["src"][0]) #记住时scrapy.request!!!!!!!!!!

    #指定图片存储路径
    def file_path(self, request, response=None, info=None, *, item=None):
        return item["name"][0] + ".jpg"

    def item_completed(self, results, item, info):
        return item #返回给下一个即将执行的item

#settings.py
ITEM_PIPELINES = {
   'imagePro.pipelines.ImageLine': 300,
}
IMAGES_STORE = "./data/pic/beauty"

中间件的使用(middlewares):

拦截请求：

UA伪装：process_request

 def process_request(self, request, spider): #进行UA伪装
        request.headers["User-Agent"] = xxx
        return None

代理IP：process_exception

 def process_exception(self, request, exception, spider): #进行IP更换
       request.meta["proxy"] = xxx

拦截响应：

篡改相应数据，响应对象：process_response

 def process_response(self, request, response, spider):
        #挑选出指定对象进行修改
        #通过url进行request
        #通过resquest对response进行指定

        if request.url in spider.href_list: #获得动态加载的页面
            bro = spider.bro
            bro.get(request.url)
            page_text = bro.page_source
            new_response = HtmlResponse(url=request.url,body=page_text,encoding="utf-8",request=request)
            return new_response
        else:
            return response

爬虫python系列【从request到scrapy框架】总结

Python相关栏目本月热门文章