一、middlewares下载中间件的基本介绍
1、使用方法∶
编写一个Downloader Middlewares和我们编写一个pipeline 一样,定义一个类,然后在 setting中开启
2、Downloader Middlewares默认的方法∶
a、process_request(self, request, spider):
当每个request通过下载中间件时,该方法被调用。(动态更换ip和USER_AGENT)
b、process_response(self,request, response, spider):
当下载器完成http请求,传递响应给引擎的时候调用
二、动态更换USER_AGENT
1、在settings.py中添加自定义的USER_AGENTS
2、在Middlewares.py中定义一个类,重写process_request方法,给request的headers赋值
class RandomUserAgent(object):
def process_request(self, request,spider):
useragent = random.choice(spider.settings.get(USER_AGENTS))
request.headers["User-Agent"] = useragent
3、在 setting中开启
SPIDER_MIDDLEWARES = {
'myFistScrapy.middlewares.RandomUserAgent': 543,
}
三、动态更换IP
步骤与动态更换USER_AGENT一样,但分两组情况写Middlewares.py
a、免费的ip代理即不需要账号、密码
import random
class IpProxyDownloadMiddleware(object):
'''PROXIES可以写在文件中,用的时候读取,也可以写在settings.py中'''
PROXIES = [
'156.15155.8:53624'
]
def process_request(self, request, spider):
proxy = random.choice(self.PROXIES)
request.meta['proxy'] = proxy
b、不免费的ip代理
import base64
class IpProxyDownloadMiddleware(object):
def process_request(self, request, spider):
proxy = 'ip:端口'
user_password = '用户名:密码'
request.meta['proxy'] = proxy
user_password = base64.b64encode(user_password.encode('utf-8'))
request.headers['Proxy-Authorization'] = 'Basic' + user_password.decode('utf-8')
四、对代理ip进行测试
在爬虫py中用下面代码进行测试
import scrapy
import json
class IpSpiderSpider(scrapy.Spider):
name = 'ip_spider'
allowed_domains = ['hao.360.com']
start_urls = ['https://hao.360.com/?a1004']
def parse(self, response):
origin = json.loads(response.text)['origin']
print('-'*20)
print(origin)
print('=-'*20)



