栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 软件开发 > 后端开发 > Python

网易招聘爬取 每日一练(四)

Python 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

网易招聘爬取 每日一练(四)

网易招聘的实例:

一、安装Scrapy

lxml

pyOpenSSL

Twisted

PyWin32

安装完上述库之后,就可以安装Scrapy了,命令如下:pip install Scrapy

二、创建项目

scrapy startproject wangyi

三、建模

打开items.py,为抓取的信息建模

# Define here the models for your scraped items

#

# See documentation in:

# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class WangyiItem(scrapy.Item):

    # define the fields for your item here like:

    name = scrapy.Field()  #名称

    link = scrapy.Field()  #详情

    depart = scrapy.Field() #部门

    category = scrapy.Field()  #职位

    type = scrapy.Field()      #工作性质

    address = scrapy.Field()   #工作地址

    num = scrapy.Field()       #招聘人数

    date = scrapy.Field()      #发布时间

四、创建爬虫

打开项目wangyi,输入

scrapy genspider job 163.com

五、编写爬虫文件job.py

'''

        enumerate在字典上是枚举、列举的意思

        enumerate参数为可遍历/可迭代的对象(如列表、字符串)

        enumerate多用于在for循环中得到计数,利用它可以同时获得索引和值,即需要index和value值的时候可以使用enumerate

        enumerate()返回的是一个enumerate对象

'''

import scrapy

from wangyi.items import WangyiItem

class JobSpider(scrapy.Spider):

    name = 'job'

    #2.检查域名

    allowed_domains = ['163.com']

    #1,修改原始url

    #start_urls = ['http://163.com/']

    start_urls = ['https://hr.163.com/position/list.do']

    def parse(self, response):

        #3.提取数据

        #获取所有职位节点列表

        #第一行的tr处复制xpath粘贴到单引号中

        #/*;q=0.8',

  'Accept-Language': 'en',

    'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.82 Safari/537.36'

}

# Enable or disable spider middlewares

# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html

#SPIDER_MIDDLEWARES = {

#    'wangyi.middlewares.WangyiSpiderMiddleware': 543,

#}

# Enable or disable downloader middlewares

# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

#DOWNLOADER_MIDDLEWARES = {

#    'wangyi.middlewares.WangyiDownloaderMiddleware': 543,

#}

# Enable or disable extensions

# See https://docs.scrapy.org/en/latest/topics/extensions.html

#EXTENSIONS = {

#    'scrapy.extensions.telnet.TelnetConsole': None,

#}

# Configure item pipelines

# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html

#启用管道

ITEM_PIPELINES = {

   'wangyi.pipelines.WangyiPipeline': 300,

}

# Enable and configure the AutoThrottle extension (disabled by default)

# See https://docs.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

六、保存数据,编写pipelines.py文件

# Define your item pipelines here

#

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

# useful for handling different item types with a single interface

from itemadapter import ItemAdapter

import json

class WangyiPipeline:

    def __init__(self):

        # 创建文件时一定不能少了encoding,否则即使不以ascii码解析,输出的中文也是乱码

        self.file = open('wangyi.json', 'w', encoding='utf-8')

    def process_item(self, item, spider):

        # 将item强制转换为字典,该操作只能在scrapy中进行

        item = dict(item)

        # 字典序列化

        # json.dumps() 是把python对象转换成json对象的一个过程,生成的是字符串

        str_data = json.dumps(item, ensure_ascii=False) + ',n'

        print(str_data)

        # 将数据写入文件

        self.file.write(str_data)

        # 默认使用完管道后将数据返回给引擎

        # 这里不能用yield,只能用return,否则json文件无数据

        return item

    def __del__(self):

        self.file.close()

七、运行调试

为了方便调试可在项目网易下创建一个run.py文件

#  -*-    coding: utf-8    -*-

'''

Author: YTNetMan

Date: 2022-05-02

File: run.py

功能:

'''

from scrapy import cmdline  #模拟终端操作

cmdline.execute('scrapy crawl job'.split())  # 记得爬虫名改成自己的job

# cmdline.execute('scrapy crawl job --nolog'.split())

注:素材及教学来自B站

转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/853428.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号