有很多解决方案是一个非常深入的话题。但是,如果你想应用在帖子中定义的逻辑,则可以使用scrapy Downloader Middlewares。
就像是:
class CaptchaMiddleware(object): max_retries = 5 def process_response(request, response, spider): if not request.meta.get('solve_captcha', False): return response # only solve requests that are marked with meta key catpcha = find_catpcha(response) if not captcha: # it might not have captcha at all! return response solved = solve_captcha(captcha) if solved: response.meta['catpcha'] = captcha response.meta['solved_catpcha'] = solved return response else: # retry page for new captcha # prevent endless loop if request.meta.get('catpcha_retries', 0) == 5: logging.warning('max retries for captcha reached for {}'.format(request.url)) raise IgnoreRequest request.meta['dont_filter'] = True request.meta['captcha_retries'] = request.meta.get('captcha_retries', 0) + 1 return request此示例将拦截每个响应并尝试解决验证码。如果失败,它将重试该页面以获取新的验证码;如果成功,它将添加一些元密钥以响应已解决的验证码值。
在蜘蛛中,你可以这样使用它:
class MySpider(scrapy.Spider): def parse(self, response): url = ''# url that requires captcha yield Request(url, callback=self.parse_captchad, meta={'solve_captcha': True},errback=self.parse_fail) def parse_captchad(self, response): solved = response['solved'] # do stuff def parse_fail(self, response): # failed to retrieve captcha in 5 tries :( # do stuff


