栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 面试经验 > 面试问答

使用Scrapy抓取Python数据

面试问答 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

使用Scrapy抓取Python数据

基本上,你有很多工具可供选择:

  • scrapy
  • beautifulsoup
  • lxml
  • mechanize
  • requests (and grequests)
  • selenium
  • ghost.py

这些工具具有不同的用途,但可以根据任务将它们混合在一起。

Scrapy是用于抓取网站,提取数据的功能强大且非常智能的工具。但是,当涉及到操作页面时:单击按钮,填写表格-变得更加复杂:

  • 有时,通过直接在scrapy中直接进行基础表单操作来模拟填充/提交表单很容易
  • 有时,你必须使用其他工具来帮助刮伤-如机械化或硒化

如果你使问题更具体,这将有助于你了解应该使用或选择哪种工具。

让我们看一个有趣的scrapy&硒混合物的例子。在这里,硒任务是单击按钮并提供刮擦物品的数据:

import timefrom scrapy.item import Item, Fieldfrom selenium import webdriverfrom scrapy.spider import baseSpiderclass ElyseAvenueItem(Item):    name = Field()class ElyseAvenueSpider(baseSpider):    name = "elyse"    allowed_domains = ["ehealthinsurance.com"]    start_urls = [    'http://www.ehealthinsurance.com/individual-family-health-insurance?action=changeCensus&census.zipCode=48341&census.primary.gender=MALE&census.requestEffectiveDate=06/01/2013&census.primary.month=12&census.primary.day=01&census.primary.year=1971']    def __init__(self):        self.driver = webdriver.Firefox()    def parse(self, response):        self.driver.get(response.url)        el = self.driver.find_element_by_xpath("//input[contains(@class,'btn go-btn')]")        if el: el.click()        time.sleep(10)        plans = self.driver.find_elements_by_class_name("plan-info")        for plan in plans: item = ElyseAvenueItem() item['name'] = plan.find_element_by_class_name('primary').text yield item        self.driver.close()

更新:

这是有关如何在你的情况下使用scrapy的示例:

from scrapy.http import FormRequestfrom scrapy.item import Item, Fieldfrom scrapy.selector import HtmlXPathSelectorfrom scrapy.spider import baseSpiderclass AcrisItem(Item):    borough = Field()    block = Field()    doc_type_name = Field()class AcrisSpider(baseSpider):    name = "acris"    allowed_domains = ["a836-acris.nyc.gov"]    start_urls = ['http://a836-acris.nyc.gov/DS/documentSearch/documentType']    def parse(self, response):        hxs = HtmlXPathSelector(response)        document_classes = hxs.select('//select[@name="combox_doc_doctype"]/option')        form_token = hxs.select('//input[@name="__RequestVerificationToken"]/@value').extract()[0]        for document_class in document_classes: if document_class:     doc_type = document_class.select('.//@value').extract()[0]     doc_type_name = document_class.select('.//text()').extract()[0]     formdata = {'__RequestVerificationToken': form_token,      'hid_selectdate': '7',      'hid_doctype': doc_type,      'hid_doctype_name': doc_type_name,      'hid_max_rows': '10',      'hid_ISIntranet': 'N',      'hid_SearchType': 'DOCTYPE',      'hid_page': '1',      'hid_borough': '0',      'hid_borough_name': 'ALL BOROUGHS',      'hid_ReqID': '',      'hid_sort': '',      'hid_datefromm': '',      'hid_datefromd': '',      'hid_datefromy': '',      'hid_datetom': '',      'hid_datetod': '',      'hid_datetoy': '', }     yield FormRequest(url="http://a836-acris.nyc.gov/DS/documentSearch/documentTypeResult", method="POST", formdata=formdata, callback=self.parse_page, meta={'doc_type_name': doc_type_name})    def parse_page(self, response):        hxs = HtmlXPathSelector(response)        rows = hxs.select('//form[@name="DATA"]/table/tbody/tr[2]/td/table/tr')        for row in rows: item = AcrisItem() borough = row.select('.//td[2]/div/font/text()').extract() block = row.select('.//td[3]/div/font/text()').extract() if borough and block:     item['borough'] = borough[0]     item['block'] = block[0]     item['doc_type_name'] = response.meta['doc_type_name']     yield item保存spider.py并通过和运行scrapy runspider spider.py -o output.json,output.json你将看到:{"doc_type_name": "ConDEMNATION PROCEEDINGS ", "borough": "Borough", "block": "Block"}{"doc_type_name": "CERTIFICATE OF REDUCTION ", "borough": "Borough", "block": "Block"}{"doc_type_name": "COLLATERAL MORTGAGE ", "borough": "Borough", "block": "Block"}{"doc_type_name": "CERTIFIED COPY OF WILL ", "borough": "Borough", "block": "Block"}{"doc_type_name": "ConFIRMATORY DEED ", "borough": "Borough", "block": "Block"}{"doc_type_name": "CERT NonATTCHMENT FED TAX LIEN ", "borough": "Borough", "block": "Block"}...

希望能有所帮助。



转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/376589.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号