网易招聘爬取每日一练（四）

网易招聘的实例：

一、安装Scrapy

lxml

pyOpenSSL

Twisted

PyWin32

安装完上述库之后，就可以安装Scrapy了，命令如下：pip install Scrapy

二、创建项目

scrapy startproject wangyi

三、建模

打开items.py，为抓取的信息建模

# Define here the models for your scraped items

# See documentation in:

# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class WangyiItem(scrapy.Item):

# define the fields for your item here like:

name = scrapy.Field() #名称

link = scrapy.Field() #详情

depart = scrapy.Field() #部门

category = scrapy.Field() #职位

type = scrapy.Field() #工作性质

address = scrapy.Field() #工作地址

num = scrapy.Field() #招聘人数

date = scrapy.Field() #发布时间

四、创建爬虫

打开项目wangyi,输入

scrapy genspider job 163.com

五、编写爬虫文件job.py

'''

enumerate在字典上是枚举、列举的意思

enumerate参数为可遍历/可迭代的对象(如列表、字符串)

enumerate多用于在for循环中得到计数，利用它可以同时获得索引和值，即需要index和value值的时候可以使用enumerate

enumerate()返回的是一个enumerate对象

'''

import scrapy

from wangyi.items import WangyiItem

class JobSpider(scrapy.Spider):

name = 'job'

#2.检查域名

allowed_domains = ['163.com']

#1，修改原始url

#start_urls = ['http://163.com/']

start_urls = ['https://hr.163.com/position/list.do']

def parse(self, response):

#3.提取数据

#获取所有职位节点列表

#第一行的tr处复制xpath粘贴到单引号中

#/*;q=0.8',

'Accept-Language': 'en',

'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.82 Safari/537.36'

}

# Enable or disable spider middlewares

# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html

#SPIDER_MIDDLEWARES = {

# 'wangyi.middlewares.WangyiSpiderMiddleware': 543,

# Enable or disable downloader middlewares

# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

#DOWNLOADER_MIDDLEWARES = {

# 'wangyi.middlewares.WangyiDownloaderMiddleware': 543,

# Enable or disable extensions

# See https://docs.scrapy.org/en/latest/topics/extensions.html

#EXTENSIONS = {

# 'scrapy.extensions.telnet.TelnetConsole': None,

# Configure item pipelines

# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html

#启用管道

ITEM_PIPELINES = {

'wangyi.pipelines.WangyiPipeline': 300,

}

# Enable and configure the AutoThrottle extension (disabled by default)

# See https://docs.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

六、保存数据，编写pipelines.py文件

# Define your item pipelines here

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

# useful for handling different item types with a single interface

from itemadapter import ItemAdapter

import json

class WangyiPipeline:

def __init__(self):

# 创建文件时一定不能少了encoding,否则即使不以ascii码解析，输出的中文也是乱码

self.file = open('wangyi.json', 'w', encoding='utf-8')

def process_item(self, item, spider):

# 将item强制转换为字典，该操作只能在scrapy中进行

item = dict(item)

# 字典序列化

# json.dumps() 是把python对象转换成json对象的一个过程，生成的是字符串

str_data = json.dumps(item, ensure_ascii=False) + ',n'

print(str_data)

# 将数据写入文件

self.file.write(str_data)

# 默认使用完管道后将数据返回给引擎

# 这里不能用yield,只能用return，否则json文件无数据

return item

def __del__(self):

self.file.close()

七、运行调试

为了方便调试可在项目网易下创建一个run.py文件

# -*- coding: utf-8 -*-

'''

Author: YTNetMan

Date: 2022-05-02

File: run.py

功能：

'''

from scrapy import cmdline #模拟终端操作

cmdline.execute('scrapy crawl job'.split()) # 记得爬虫名改成自己的job

# cmdline.execute('scrapy crawl job --nolog'.split())

注：素材及教学来自B站

网易招聘爬取 每日一练（四）

Python相关栏目本月热门文章

网易招聘爬取每日一练（四）