使用Scrapy抓取Python数据

基本上，你有很多工具可供选择：

scrapy
beautifulsoup
lxml
mechanize
requests (and grequests)
selenium
ghost.py

这些工具具有不同的用途，但可以根据任务将它们混合在一起。

Scrapy是用于抓取网站，提取数据的功能强大且非常智能的工具。但是，当涉及到操作页面时：单击按钮，填写表格-变得更加复杂：

有时，通过直接在scrapy中直接进行基础表单操作来模拟填充/提交表单很容易
有时，你必须使用其他工具来帮助刮伤-如机械化或硒化

如果你使问题更具体，这将有助于你了解应该使用或选择哪种工具。

让我们看一个有趣的scrapy＆硒混合物的例子。在这里，硒任务是单击按钮并提供刮擦物品的数据：

import timefrom scrapy.item import Item, Fieldfrom selenium import webdriverfrom scrapy.spider import baseSpiderclass ElyseAvenueItem(Item):    name = Field()class ElyseAvenueSpider(baseSpider):    name = "elyse"    allowed_domains = ["ehealthinsurance.com"]    start_urls = [    'http://www.ehealthinsurance.com/individual-family-health-insurance?action=changeCensus&census.zipCode=48341&census.primary.gender=MALE&census.requestEffectiveDate=06/01/2013&census.primary.month=12&census.primary.day=01&census.primary.year=1971']    def __init__(self):        self.driver = webdriver.Firefox()    def parse(self, response):        self.driver.get(response.url)        el = self.driver.find_element_by_xpath("//input[contains(@class,'btn go-btn')]")        if el: el.click()        time.sleep(10)        plans = self.driver.find_elements_by_class_name("plan-info")        for plan in plans: item = ElyseAvenueItem() item['name'] = plan.find_element_by_class_name('primary').text yield item        self.driver.close()

更新：

这是有关如何在你的情况下使用scrapy的示例：

from scrapy.http import FormRequestfrom scrapy.item import Item, Fieldfrom scrapy.selector import HtmlXPathSelectorfrom scrapy.spider import baseSpiderclass AcrisItem(Item):    borough = Field()    block = Field()    doc_type_name = Field()class AcrisSpider(baseSpider):    name = "acris"    allowed_domains = ["a836-acris.nyc.gov"]    start_urls = ['http://a836-acris.nyc.gov/DS/documentSearch/documentType']    def parse(self, response):        hxs = HtmlXPathSelector(response)        document_classes = hxs.select('//select[@name="combox_doc_doctype"]/option')        form_token = hxs.select('//input[@name="__RequestVerificationToken"]/@value').extract()[0]        for document_class in document_classes: if document_class:     doc_type = document_class.select('.//@value').extract()[0]     doc_type_name = document_class.select('.//text()').extract()[0]     formdata = {'__RequestVerificationToken': form_token,      'hid_selectdate': '7',      'hid_doctype': doc_type,      'hid_doctype_name': doc_type_name,      'hid_max_rows': '10',      'hid_ISIntranet': 'N',      'hid_SearchType': 'DOCTYPE',      'hid_page': '1',      'hid_borough': '0',      'hid_borough_name': 'ALL BOROUGHS',      'hid_ReqID': '',      'hid_sort': '',      'hid_datefromm': '',      'hid_datefromd': '',      'hid_datefromy': '',      'hid_datetom': '',      'hid_datetod': '',      'hid_datetoy': '', }     yield FormRequest(url="http://a836-acris.nyc.gov/DS/documentSearch/documentTypeResult", method="POST", formdata=formdata, callback=self.parse_page, meta={'doc_type_name': doc_type_name})    def parse_page(self, response):        hxs = HtmlXPathSelector(response)        rows = hxs.select('//form[@name="DATA"]/table/tbody/tr[2]/td/table/tr')        for row in rows: item = AcrisItem() borough = row.select('.//td[2]/div/font/text()').extract() block = row.select('.//td[3]/div/font/text()').extract() if borough and block:     item['borough'] = borough[0]     item['block'] = block[0]     item['doc_type_name'] = response.meta['doc_type_name']     yield item保存spider.py并通过和运行scrapy runspider spider.py -o output.json，output.json你将看到：{"doc_type_name": "ConDEMNATION PROCEEDINGS ", "borough": "Borough", "block": "Block"}{"doc_type_name": "CERTIFICATE OF REDUCTION ", "borough": "Borough", "block": "Block"}{"doc_type_name": "COLLATERAL MORTGAGE ", "borough": "Borough", "block": "Block"}{"doc_type_name": "CERTIFIED COPY OF WILL ", "borough": "Borough", "block": "Block"}{"doc_type_name": "ConFIRMATORY DEED ", "borough": "Borough", "block": "Block"}{"doc_type_name": "CERT NonATTCHMENT FED TAX LIEN ", "borough": "Borough", "block": "Block"}...

希望能有所帮助。

使用Scrapy抓取Python数据

面试问答相关栏目本月热门文章