基本上,你有很多工具可供选择:
- scrapy
- beautifulsoup
- lxml
- mechanize
- requests (and grequests)
- selenium
- ghost.py
这些工具具有不同的用途,但可以根据任务将它们混合在一起。
Scrapy是用于抓取网站,提取数据的功能强大且非常智能的工具。但是,当涉及到操作页面时:单击按钮,填写表格-变得更加复杂:
- 有时,通过直接在scrapy中直接进行基础表单操作来模拟填充/提交表单很容易
- 有时,你必须使用其他工具来帮助刮伤-如机械化或硒化
如果你使问题更具体,这将有助于你了解应该使用或选择哪种工具。
让我们看一个有趣的scrapy&硒混合物的例子。在这里,硒任务是单击按钮并提供刮擦物品的数据:
import timefrom scrapy.item import Item, Fieldfrom selenium import webdriverfrom scrapy.spider import baseSpiderclass ElyseAvenueItem(Item): name = Field()class ElyseAvenueSpider(baseSpider): name = "elyse" allowed_domains = ["ehealthinsurance.com"] start_urls = [ 'http://www.ehealthinsurance.com/individual-family-health-insurance?action=changeCensus&census.zipCode=48341&census.primary.gender=MALE&census.requestEffectiveDate=06/01/2013&census.primary.month=12&census.primary.day=01&census.primary.year=1971'] def __init__(self): self.driver = webdriver.Firefox() def parse(self, response): self.driver.get(response.url) el = self.driver.find_element_by_xpath("//input[contains(@class,'btn go-btn')]") if el: el.click() time.sleep(10) plans = self.driver.find_elements_by_class_name("plan-info") for plan in plans: item = ElyseAvenueItem() item['name'] = plan.find_element_by_class_name('primary').text yield item self.driver.close()更新:
这是有关如何在你的情况下使用scrapy的示例:
from scrapy.http import FormRequestfrom scrapy.item import Item, Fieldfrom scrapy.selector import HtmlXPathSelectorfrom scrapy.spider import baseSpiderclass AcrisItem(Item): borough = Field() block = Field() doc_type_name = Field()class AcrisSpider(baseSpider): name = "acris" allowed_domains = ["a836-acris.nyc.gov"] start_urls = ['http://a836-acris.nyc.gov/DS/documentSearch/documentType'] def parse(self, response): hxs = HtmlXPathSelector(response) document_classes = hxs.select('//select[@name="combox_doc_doctype"]/option') form_token = hxs.select('//input[@name="__RequestVerificationToken"]/@value').extract()[0] for document_class in document_classes: if document_class: doc_type = document_class.select('.//@value').extract()[0] doc_type_name = document_class.select('.//text()').extract()[0] formdata = {'__RequestVerificationToken': form_token, 'hid_selectdate': '7', 'hid_doctype': doc_type, 'hid_doctype_name': doc_type_name, 'hid_max_rows': '10', 'hid_ISIntranet': 'N', 'hid_SearchType': 'DOCTYPE', 'hid_page': '1', 'hid_borough': '0', 'hid_borough_name': 'ALL BOROUGHS', 'hid_ReqID': '', 'hid_sort': '', 'hid_datefromm': '', 'hid_datefromd': '', 'hid_datefromy': '', 'hid_datetom': '', 'hid_datetod': '', 'hid_datetoy': '', } yield FormRequest(url="http://a836-acris.nyc.gov/DS/documentSearch/documentTypeResult", method="POST", formdata=formdata, callback=self.parse_page, meta={'doc_type_name': doc_type_name}) def parse_page(self, response): hxs = HtmlXPathSelector(response) rows = hxs.select('//form[@name="DATA"]/table/tbody/tr[2]/td/table/tr') for row in rows: item = AcrisItem() borough = row.select('.//td[2]/div/font/text()').extract() block = row.select('.//td[3]/div/font/text()').extract() if borough and block: item['borough'] = borough[0] item['block'] = block[0] item['doc_type_name'] = response.meta['doc_type_name'] yield item保存spider.py并通过和运行scrapy runspider spider.py -o output.json,output.json你将看到:{"doc_type_name": "ConDEMNATION PROCEEDINGS ", "borough": "Borough", "block": "Block"}{"doc_type_name": "CERTIFICATE OF REDUCTION ", "borough": "Borough", "block": "Block"}{"doc_type_name": "COLLATERAL MORTGAGE ", "borough": "Borough", "block": "Block"}{"doc_type_name": "CERTIFIED COPY OF WILL ", "borough": "Borough", "block": "Block"}{"doc_type_name": "ConFIRMATORY DEED ", "borough": "Borough", "block": "Block"}{"doc_type_name": "CERT NonATTCHMENT FED TAX LIEN ", "borough": "Borough", "block": "Block"}...希望能有所帮助。



