栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 面试经验 > 面试问答

通过链接,Scrapy Web爬虫框架

面试问答 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

通过链接,Scrapy Web爬虫框架

CrawlSpider继承了baseSpider。它只是添加了规则来提取和跟踪链接。如果这些规则不够灵活,请使用baseSpider:

class USpider(baseSpider):    """my spider. """    start_urls = ['http://www.amazon.com/s/?url=search-alias%3Dapparel&sort=relevance-fs-browse-rank']    allowed_domains = ['amazon.com']    def parse(self, response):        '''Parse main category search page and extract subcategory search link.'''        self.log('Downloaded category search page.', log.DEBUG)        if response.meta['depth'] > 5: self.log('Categories depth limit reached (recursive links?). Stopping further following.', log.WARNING)        hxs = HtmlXPathSelector(response)        subcategories = hxs.select("//div[@id='refinements']/*[starts-with(.,'Department')]/following-sibling::ul[1]/li/a[span[@]]/@href").extract()        for subcategory in subcategories: subcategorySearchlink = urlparse.urljoin(response.url, subcategorySearchlink) yield Request(subcategorySearchlink, callback = self.parseSubcategory)    def parseSubcategory(self, response):        '''Parse subcategory search page and extract item links.'''        hxs = HtmlXPathSelector(response)        for itemlink in hxs.select('//a[@]/@href').extract(): itemlink = urlparse.urljoin(response.url, itemlink) self.log('Requesting item page: ' + itemlink, log.DEBUG) yield Request(itemlink, callback = self.parseItem)        try: nextPagelink = hxs.select("//a[@id='pagnNextlink']/@href").extract()[0] nextPagelink = urlparse.urljoin(response.url, nextPagelink) self.log('nGoing to next search page: ' + nextPagelink + 'n', log.DEBUG) yield Request(nextPagelink, callback = self.parseSubcategory)        except: self.log('Whole category parsed: ' + categoryPath, log.DEBUG)    def parseItem(self, response):        '''Parse item page and extract product info.'''        hxs = HtmlXPathSelector(response)        item = UItem()        item['brand'] = self.extractText("//div[@]/span[1]/a[1]", hxs)        item['title'] = self.extractText("//span[@id='btAsinTitle']", hxs)        ...

即使baseSpider的start_urls对您来说不够灵活,也请重写start_requests方法。



转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/660858.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号