栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 软件开发 > 后端开发 > Python

Python每日学习总结(五)

Python 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

Python每日学习总结(五)

1.Scrapy爬虫

1.Scrapy框架的安装:

(1)什么是Scrapy框架:Scrapy是一个Python爬虫框架

(2)少坑版安装方式:

2.Scrapy框架常见命令实战:

全局命令(scrapy -h):fatch(爬);runspider(运行一个爬虫)......

项目命令:

3.Scrapy爬虫:

第一个Scrapy爬虫:以爬取糗事百科为例

scrapy startproject name(新建爬虫)

scrapy crawl name(运行爬虫)

4.Scrapy自动爬虫实战:

(1)糗事百科自动爬虫实战(crawl):

import scrapy
from scrapy.linkextractors import linkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.http import Request
from qsauto.items import QsautoItem


class Qiushi1Spider(CrawlSpider):
    name = 'qiushi1'
    allowed_domains = ['qiushibaike.com']
    '''
    start_urls = ['http://qiushibaike.com/']
    '''
    def start_request(self):
        ua={"user-Agent":'Mozilla/5.0(windows NT 6.1; WOW64) Applewebkit/537.36(KHTML, like Gecko) Chrome/49.0.2623.22 SE 2.X metaSr 1.0'}
        yield Request('http://www.qiushibaike.com/',headers=ua)
    rules = (
        Rule(linkExtractor(allow=r'acticle'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        i = QsautoItem
        #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
        #item['name'] = response.xpath('//div[@id="name"]').get()
        #item['description'] = response.xpath('//div[@id="description"]').get()
        i=FirstItem
        i["content"]=response.xpath("//div[@class='content']/span/text()").extract()
        i["link"]=response.xpath("//a[@class='contentHerf']/herf").extract()
        print(i["content"])
        print(i["link"])
        return i

2.自动模拟登陆爬虫实战

(1)自动模拟登陆爬虫实战(豆瓣网):


3.当当网爬虫实战

(1)当当商城爬虫实战(如何将爬到的内容写进数据库):

import scrapy
from dangdang.items import DangdangItem
from scrapy.http import Request

class DdSpider(scrapy.Spider):
    name = 'dd'
    allowed_domains = ['dangdang.com']
    start_urls = ['http://dangdang.com/']

    def parse(self, response):
        item=DangdangItem()
        item["tital"]=response.xpath("//a[@class='pic']/@tital").extract()
        item["link"] = response.xpath("//a[@class='pic']/@href").extract()
        item["comment"] = response.xpath("//a[@name='_1_p']/text").extract()
        yield item
        for i in range()
            url="http://category.dangdang.com/pg"+str(i)+"-cp01.54.06.00.00.00.html"
            yield Request(url,callback=self.parse)

pipelines:

class DangdangPipeline:
    def process_item(self, item, spider):
        for i in range(0,len(item["tital"])):
            tital=item["tital"]
            link=item["link"]
            comment=item["comment"]
            print(tital)
            print(link)
            print(comment)
        return item

items:

class DangdangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    tital=scrapy.Field()
    link=scrapy.Field()
    comment=scrapy.Field()

转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/715298.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号