文章目录
- Python爬虫----bs4入门到精通(一)
- BeautifulSoup4介绍
- 基本概念
- 源码分析
- bs4快速入门
- 一、安装
- 二、导入模块
- 三、创建soup对象
- bs4对象种类
- 代码演示,详细注解
- 遍历文档树
- contents,children,descendants
- 代码演示,详细注解
- string ,strings,stripped_strings
- 代码演示,详细注解
- parent 和 parents
- 代码演示,详细注解
- find() 和 find_all()----[重点学习]
- 代码演示,详细注解
- 案例练习,复习总结bs4
BeautifulSoup4介绍
Beautiful Soup是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.
1、用来解析数据
2、用不同的解析模块处理不同的网页结构的网站
基本概念正则:用正则表达式去匹配数据 比较复杂
xpath:语法 节点关系较为难找
bs4:find() find_all() 方法
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的网页信息提取库
源码分析c class 类
m method 方法
f field 字段
p 被@property所修饰的方法 可以当成属性来用
v variable 变量
要先安装 lxml:pip install lxml
再装bs4: pip install bs4
from bs4 import BeautifulSoup三、创建soup对象
soup = BeautifulSoup(html_doc, 'lxml')bs4对象种类
代码演示,详细注解● tag : 标签
● NavigableString : 可导航的字符串
● BeautifulSoup : bs对象
● Comment : 注释
# 1、导入模块 from bs4 import BeautifulSoup """ 仅作了解即可 ● tag : 标签 ● NavigableString : 可导航的字符串 ● BeautifulSoup : bs对象 ● Comment : 注释 """ html_doc = """遍历文档树 contents,children,descendantsThe Dormouse's story The Dormouse's story
Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.
...
""" # 2、创建soup对象 soup = BeautifulSoup(html_doc, features='lxml') # print(type(soup.title)) #标签 # print('-' * 30) # print(type(soup.a)) # 标签 # print('-' * 30) # print(type(soup.p)) # 标签 # print('-' * 30) # print(type(soup.body)) # 标签 # print('-' * 30) # print(type(soup.title.string)) # 可导航的字符串 # print('-' * 30) # print(type(soup)) # bs对象 # print('-' * 30) # print(type(soup.span.string)) # 注释
代码演示,详细注解● contents 返回的是一个所有子节点的列表
● children 返回的是一个子节点的迭代器通
● descendants 返回的是一个生成器遍历子子孙孙
from bs4 import BeautifulSoup html_doc = """string ,strings,stripped_stringsThe Dormouse's story The Dormouse's story
Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.
...
""" """ contents children descendants ● contents 返回的是一个所有子节点的列表 ● children 返回的是一个子节点的迭代器通 ● descendants 返回的是一个生成器遍历子子孙孙 """ soup = BeautifulSoup(html_doc, 'lxml') head = soup.head a = soup.a # print(head.contents) # [The Dormouse's story ] 回的是一个所有子节点的列表 # print('-' * 30) # for i in head.contents: # print(i) # print(head.children) #返回的是一个子节点的迭代器的对象 # print('-' * 30) # for i in head.children: # (凡是迭代器 都是可以遍历的) # print(i) # print(head.descendants) # 返回的是一个生成器遍历子子孙孙 # print('-' * 30) # for i in a.descendants: # print(i) # 会把换行也当成子节点 匹配到 html = soup.html # print(html.contents) # print(html.descendants) """ [ The Dormouse's story , 'n',The Dormouse's story
Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.
...
] """ # for h in html.descendants: # print(h)
代码演示,详细注解● string获取标签里面的内容
● strings 返回是一个生成器对象用过来获取多个标签内容
● stripped_strings 和strings基本一致 但是它可以把多余的空格去掉
from bs4 import BeautifulSoup html_doc = """parent 和 parentsThe Dormouse's story The Dormouse's story
Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.
...
""" """ contents children descendants ● contents 返回的是一个所有子节点的列表 ● children 返回的是一个子节点的迭代器通 ● descendants 返回的是一个生成器遍历子子孙孙 """ soup = BeautifulSoup(html_doc, 'lxml') head = soup.head a = soup.a """ 需要重点掌握 string strings stripped_strings ● string获取标签里面的内容 ● strings 返回是一个生成器对象用过来获取多个标签内容 ● stripped_strings 和strings基本一致 但是它可以把多余的空格去掉 """ # 用来获取标签里的文本内容 # print(soup.title.string) # 返回是一个生成器对象用过来获取多个标签内容 # 返回一个生成器对象 # print(html.strings) ## for i in html.strings: # print(i) # 和strings基本一致 但是它可以把多余的空格去掉 # print(html.stripped_strings) # # for i in html.stripped_strings: # print(i) ''' The Dormouse's story The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie , Lacie and Tillie ; and they lived at the bottom of a well. ... '''
代码演示,详细注解● parent直接获得父节点
● parents获取所有的父节点
from bs4 import BeautifulSoup html_doc = """find() 和 find_all()----[重点学习]The Dormouse's story The Dormouse's story
Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.
...
""" """ contents children descendants ● contents 返回的是一个所有子节点的列表 ● children 返回的是一个子节点的迭代器通 ● descendants 返回的是一个生成器遍历子子孙孙 """ soup = BeautifulSoup(html_doc, 'lxml') head = soup.head a = soup.a """ 遍历文档树 遍历父节点 parent 和 parents ● parent直接获得父节点 ● parents获取所有的父节点 """ title = soup.title # parent 找直接父节点 # print(title.parent) # 返回一个生成器 # print(title.parents) ## for p in title.parents: # print(p) ''' 1、首先找到title的父节点: The Dormouse's story 2、紧接着找到父节点的父节点(head的父节点):The Dormouse's story The Dormouse's story
Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.
...
3、最后找到父节点的父节点的父节点(html的父节点):最后找到父节点的父节点的父节点:The Dormouse's story The Dormouse's story
Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.
...
''' # html的父节点就是整个文档
代码演示,详细注解● find_all()方法以列表形式返回所有的搜索到的标签数据
● find()方法返回搜索到的第一条数据
● find_all()方法参数● name : tag名称
● attr : 标签的属性
● recursive : 是否递归搜索
● text : 文本内容
● limli : 限制返回条数
● kwargs : 关键字参数
from lxml import etree from bs4 import BeautifulSoup import pprint # html_doc是网页源码 html_doc = """案例练习,复习总结bs4The Dormouse's story The Dormouse's story
Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.
...
""" soup = BeautifulSoup(html_doc, 'lxml') # 字符串过滤器 # 获取所有的a标签,find_all 把所有找到的数据放在列表里面返回 # a_list = soup.findAll('a') # print(a_list) # print('-' * 70) # for a in a_list: # print(a) # print('-' * 70) # print(soup.find('a')) # find()返回匹配到的第一个结果 # 找到title节点和p节点 # result= soup.findAll('title','p')# [] 不可行 result = soup.findAll(['title', 'p']) print(result)
from bs4 import BeautifulSoup html = """
| 职位名称 | 职位类别 | 人数 | 地点 | 发布时间 |
| 22989-金融云区块链高级研发工程师(深圳) | 技术类 | 1 | 深圳 | 2017-11-25 |
| 22989-金融云高级后台开发 | 技术类 | 2 | 深圳 | 2017-11-25 |
| SNG16-腾讯音乐运营开发工程师(深圳) | 技术类 | 2 | 深圳 | 2017-11-25 |
| SNG16-腾讯音乐业务运维工程师(深圳) | 技术类 | 1 | 深圳 | 2017-11-25 |
| TEG03-高级研发工程师(深圳) | 技术类 | 1 | 深圳 | 2017-11-24 |
| TEG03-高级图像算法研发工程师(深圳) | 技术类 | 1 | 深圳 | 2017-11-24 |
| TEG11-高级AI开发工程师(深圳) | 技术类 | 4 | 深圳 | 2017-11-24 |
| 15851-后台开发工程师 | 技术类 | 1 | 深圳 | 2017-11-24 |
| 15851-后台开发工程师 | 技术类 | 1 | 深圳 | 2017-11-24 |
| SNG11-高级业务运维工程师(深圳) | 技术类 | 1 | 深圳 | 2017-11-24 |



