栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 面试经验 > 面试问答

如何编写一个DownloadHandler以便于通过socksipy发出请求的scrapy?

面试问答 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

如何编写一个DownloadHandler以便于通过socksipy发出请求的scrapy?

做了之后

pip install txsocksx
,我需要更换
scrapy
ScrapyAgent
使用
txsocksx.http.SOCKS5Agent

我只是复制代码

HTTP11DownloadHandler
,并
ScrapyAgent
scrapy/core/downloader/handlers/http.py
,子类他们写了这样的代码:

class TorProxyDownloadHandler(HTTP11DownloadHandler):    def download_request(self, request, spider):        """Return a deferred for the HTTP download"""        agent = ScrapyTorAgent(contextFactory=self._contextFactory, pool=self._pool)        return agent.download_request(request)class ScrapyTorAgent(ScrapyAgent):    def _get_agent(self, request, timeout):        bindaddress = request.meta.get('bindaddress') or self._bindAddress        proxy = request.meta.get('proxy')        if proxy: _, _, proxyHost, proxyPort, proxyParams = _parse(proxy) scheme = _parse(request.url)[0] omitConnectTunnel = proxyParams.find('noconnect') >= 0 if  scheme == 'https' and not omitConnectTunnel:     proxyConf = (proxyHost, proxyPort,       request.headers.get('Proxy-Authorization', None))     return self._TunnelingAgent(reactor, proxyConf,         contextFactory=self._contextFactory, connectTimeout=timeout,         bindAddress=bindaddress, pool=self._pool) else:     _, _, host, port, proxyParams = _parse(request.url)     proxyEndpoint = TCP4ClientEndpoint(reactor, proxyHost, proxyPort,         timeout=timeout, bindAddress=bindaddress)     agent = SOCKS5Agent(reactor, proxyEndpoint=proxyEndpoint)     return agent        return self._Agent(reactor, contextFactory=self._contextFactory, connectTimeout=timeout, bindAddress=bindaddress, pool=self._pool)

在settings.py中,需要执行以下操作:

DOWNLOAD_HANDLERS = {    'http': 'crawler.http.TorProxyDownloadHandler'}

现在通过诸如Tor之类的袜子代理与Scrapy进行代理。



转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/381148.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号