栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 软件开发 > 后端开发 > Python

Python爬虫----bs4入门到精通(一)

Python 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

Python爬虫----bs4入门到精通(一)

Python爬虫----bs4入门到精通(一)

文章目录
  • Python爬虫----bs4入门到精通(一)
  • BeautifulSoup4介绍
    • 基本概念
    • 源码分析
  • bs4快速入门
    • 一、安装
    • 二、导入模块
    • 三、创建soup对象
  • bs4对象种类
    • 代码演示,详细注解
  • 遍历文档树
    • contents,children,descendants
      • 代码演示,详细注解
    • string ,strings,stripped_strings
      • 代码演示,详细注解
    • parent 和 parents
      • 代码演示,详细注解
  • find() 和 find_all()----[重点学习]
    • 代码演示,详细注解
  • 案例练习,复习总结bs4


BeautifulSoup4介绍

Beautiful Soup是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.

1、用来解析数据

2、用不同的解析模块处理不同的网页结构的网站

正则:用正则表达式去匹配数据 比较复杂
xpath:语法 节点关系较为难找
bs4:find() find_all() 方法

基本概念

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的网页信息提取库

源码分析

c class 类
m method 方法
f field 字段
p 被@property所修饰的方法 可以当成属性来用
v variable 变量

bs4快速入门 一、安装

要先安装 lxml:pip install lxml
再装bs4: pip install bs4

二、导入模块
from bs4 import BeautifulSoup
三、创建soup对象
soup = BeautifulSoup(html_doc, 'lxml')
bs4对象种类

● tag : 标签
● NavigableString : 可导航的字符串
● BeautifulSoup : bs对象
● Comment : 注释

代码演示,详细注解
# 1、导入模块
from bs4 import BeautifulSoup

"""
仅作了解即可
● tag : 标签
● NavigableString : 可导航的字符串
● BeautifulSoup : bs对象
● Comment : 注释
"""

html_doc = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

""" # 2、创建soup对象 soup = BeautifulSoup(html_doc, features='lxml') # print(type(soup.title)) # 标签 # print('-' * 30) # print(type(soup.a)) # 标签 # print('-' * 30) # print(type(soup.p)) # 标签 # print('-' * 30) # print(type(soup.body)) # 标签 # print('-' * 30) # print(type(soup.title.string)) # 可导航的字符串 # print('-' * 30) # print(type(soup)) # bs对象 # print('-' * 30) # print(type(soup.span.string)) # 注释
遍历文档树 contents,children,descendants

● contents 返回的是一个所有子节点的列表
● children 返回的是一个子节点的迭代器通
● descendants 返回的是一个生成器遍历子子孙孙

代码演示,详细注解
from bs4 import BeautifulSoup

html_doc = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

""" """ contents children descendants ● contents 返回的是一个所有子节点的列表 ● children 返回的是一个子节点的迭代器通 ● descendants 返回的是一个生成器遍历子子孙孙 """ soup = BeautifulSoup(html_doc, 'lxml') head = soup.head a = soup.a # print(head.contents) # [The Dormouse's story] 回的是一个所有子节点的列表 # print('-' * 30) # for i in head.contents: # print(i) # print(head.children) # 返回的是一个子节点的迭代器的对象 # print('-' * 30) # for i in head.children: # (凡是迭代器 都是可以遍历的) # print(i) # print(head.descendants) # 返回的是一个生成器遍历子子孙孙 # print('-' * 30) # for i in a.descendants: # print(i) # 会把换行也当成子节点 匹配到 html = soup.html # print(html.contents) # print(html.descendants) """ [The Dormouse's story, 'n',

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

] """ # for h in html.descendants: # print(h)
string ,strings,stripped_strings

● string获取标签里面的内容
● strings 返回是一个生成器对象用过来获取多个标签内容
● stripped_strings 和strings基本一致 但是它可以把多余的空格去掉

代码演示,详细注解
from bs4 import BeautifulSoup

html_doc = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

""" """ contents children descendants ● contents 返回的是一个所有子节点的列表 ● children 返回的是一个子节点的迭代器通 ● descendants 返回的是一个生成器遍历子子孙孙 """ soup = BeautifulSoup(html_doc, 'lxml') head = soup.head a = soup.a """ 需要重点掌握 string strings stripped_strings ● string获取标签里面的内容 ● strings 返回是一个生成器对象用过来获取多个标签内容 ● stripped_strings 和strings基本一致 但是它可以把多余的空格去掉 """ # 用来获取标签里的文本内容 # print(soup.title.string) # 返回是一个生成器对象用过来获取多个标签内容 # 返回一个生成器对象 # print(html.strings) # # for i in html.strings: # print(i) # 和strings基本一致 但是它可以把多余的空格去掉 # print(html.stripped_strings) # # for i in html.stripped_strings: # print(i) ''' The Dormouse's story The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie , Lacie and Tillie ; and they lived at the bottom of a well. ... '''
parent 和 parents

● parent直接获得父节点
● parents获取所有的父节点

代码演示,详细注解
from bs4 import BeautifulSoup

html_doc = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

""" """ contents children descendants ● contents 返回的是一个所有子节点的列表 ● children 返回的是一个子节点的迭代器通 ● descendants 返回的是一个生成器遍历子子孙孙 """ soup = BeautifulSoup(html_doc, 'lxml') head = soup.head a = soup.a """ 遍历文档树 遍历父节点 parent 和 parents ● parent直接获得父节点 ● parents获取所有的父节点 """ title = soup.title # parent 找直接父节点 # print(title.parent) # 返回一个生成器 # print(title.parents) # # for p in title.parents: # print(p) ''' 1、首先找到title的父节点:The Dormouse's story 2、紧接着找到父节点的父节点(head的父节点): The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

3、最后找到父节点的父节点的父节点(html的父节点):最后找到父节点的父节点的父节点: The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

''' # html的父节点就是整个文档
find() 和 find_all()----[重点学习]

● find_all()方法以列表形式返回所有的搜索到的标签数据
● find()方法返回搜索到的第一条数据
● find_all()方法参数

● name : tag名称
● attr : 标签的属性
● recursive : 是否递归搜索
● text : 文本内容
● limli : 限制返回条数
● kwargs : 关键字参数

代码演示,详细注解
from lxml import etree
from bs4 import BeautifulSoup
import pprint

# html_doc是网页源码
html_doc = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

""" soup = BeautifulSoup(html_doc, 'lxml') # 字符串过滤器 # 获取所有的a标签,find_all 把所有找到的数据放在列表里面返回 # a_list = soup.findAll('a') # print(a_list) # print('-' * 70) # for a in a_list: # print(a) # print('-' * 70) # print(soup.find('a')) # find()返回匹配到的第一个结果 # 找到title节点和p节点 # result= soup.findAll('title','p')# [] 不可行 result = soup.findAll(['title', 'p']) print(result)
案例练习,复习总结bs4
from bs4 import BeautifulSoup

html = """
职位名称 职位类别 人数 地点 发布时间
22989-金融云区块链高级研发工程师(深圳) 技术类 1 深圳 2017-11-25
22989-金融云高级后台开发 技术类 2 深圳 2017-11-25
SNG16-腾讯音乐运营开发工程师(深圳) 技术类 2 深圳 2017-11-25
SNG16-腾讯音乐业务运维工程师(深圳) 技术类 1 深圳 2017-11-25
TEG03-高级研发工程师(深圳) 技术类 1 深圳 2017-11-24
TEG03-高级图像算法研发工程师(深圳) 技术类 1 深圳 2017-11-24
TEG11-高级AI开发工程师(深圳) 技术类 4 深圳 2017-11-24
15851-后台开发工程师 技术类 1 深圳 2017-11-24
15851-后台开发工程师 技术类 1 深圳 2017-11-24
SNG11-高级业务运维工程师(深圳) 技术类 1 深圳 2017-11-24
""" soup = BeautifulSoup(html, 'lxml') # 1、拿到所有tr节点 # tr_list = soup.find_all('tr') # for tr in tr_list: # print(tr) # print('-'*74) # 2、获取第二个tr节点 # print(tr_list[1]) # 3、找到所有的tr节点 # 方式一 # class_list = soup.find_all('tr',class_='even') # for c in class_list: # print(c) # print('-'*74) # 方式二 以字典的方式传入 # class_list = soup.find_all('tr', attrs={'class': 'even'}) # for c in class_list: # print(c) # print('-' * 74) # 4、定位到id="test"的a标签 # a_list = soup.find_all('a', id="test") # for a in a_list: # print(a) # print('-'*74) # a_lists = soup.find_all('a', attrs={'id': "test", 'class': 'test'}) # for a in a_lists: # print(a) # print('-' * 74) # 5、获取所有a标签里面的href属性值 # a_list= soup.find_all('a') # for a in a_list: # 推荐使用第一种 # print(a.get('href')) # print(a.attrs['href']) # print(a['href']) # 6、获取所有职位名称 # 第一个tr是标头可以过滤掉 tr_list = soup.find_all('tr')[1:] for tr in tr_list: a_list = tr.find_all('a') print(a_list)
转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/887575.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号