网易招聘的实例:
一、安装Scrapy
lxml
pyOpenSSL
Twisted
PyWin32
安装完上述库之后,就可以安装Scrapy了,命令如下:pip install Scrapy
二、创建项目
scrapy startproject wangyi
三、建模
打开items.py,为抓取的信息建模
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class WangyiItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field() #名称
link = scrapy.Field() #详情
depart = scrapy.Field() #部门
category = scrapy.Field() #职位
type = scrapy.Field() #工作性质
address = scrapy.Field() #工作地址
num = scrapy.Field() #招聘人数
date = scrapy.Field() #发布时间
四、创建爬虫
打开项目wangyi,输入
scrapy genspider job 163.com
五、编写爬虫文件job.py
'''
enumerate在字典上是枚举、列举的意思
enumerate参数为可遍历/可迭代的对象(如列表、字符串)
enumerate多用于在for循环中得到计数,利用它可以同时获得索引和值,即需要index和value值的时候可以使用enumerate
enumerate()返回的是一个enumerate对象
'''
import scrapy
from wangyi.items import WangyiItem
class JobSpider(scrapy.Spider):
name = 'job'
#2.检查域名
allowed_domains = ['163.com']
#1,修改原始url
#start_urls = ['http://163.com/']
start_urls = ['https://hr.163.com/position/list.do']
def parse(self, response):
#3.提取数据
#获取所有职位节点列表
#第一行的tr处复制xpath粘贴到单引号中
#/*;q=0.8',
'Accept-Language': 'en',
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.82 Safari/537.36'
}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'wangyi.middlewares.WangyiSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'wangyi.middlewares.WangyiDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#启用管道
ITEM_PIPELINES = {
'wangyi.pipelines.WangyiPipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
六、保存数据,编写pipelines.py文件
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import json
class WangyiPipeline:
def __init__(self):
# 创建文件时一定不能少了encoding,否则即使不以ascii码解析,输出的中文也是乱码
self.file = open('wangyi.json', 'w', encoding='utf-8')
def process_item(self, item, spider):
# 将item强制转换为字典,该操作只能在scrapy中进行
item = dict(item)
# 字典序列化
# json.dumps() 是把python对象转换成json对象的一个过程,生成的是字符串
str_data = json.dumps(item, ensure_ascii=False) + ',n'
print(str_data)
# 将数据写入文件
self.file.write(str_data)
# 默认使用完管道后将数据返回给引擎
# 这里不能用yield,只能用return,否则json文件无数据
return item
def __del__(self):
self.file.close()
七、运行调试
为了方便调试可在项目网易下创建一个run.py文件
# -*- coding: utf-8 -*-
'''
Author: YTNetMan
Date: 2022-05-02
File: run.py
功能:
'''
from scrapy import cmdline #模拟终端操作
cmdline.execute('scrapy crawl job'.split()) # 记得爬虫名改成自己的job
# cmdline.execute('scrapy crawl job --nolog'.split())
注:素材及教学来自B站



