Python爬虫----bs4入门到精通（一）

文章目录

Python爬虫----bs4入门到精通（一）
BeautifulSoup4介绍
- 基本概念
- 源码分析
bs4快速入门
- 一、安装
- 二、导入模块
- 三、创建soup对象
bs4对象种类
- 代码演示，详细注解
遍历文档树
- contents，children，descendants
- - 代码演示，详细注解
- string ，strings，stripped_strings
- - 代码演示，详细注解
- parent 和 parents
- - 代码演示，详细注解
find() 和 find_all()----[重点学习]
- 代码演示，详细注解
案例练习，复习总结bs4

BeautifulSoup4介绍

Beautiful Soup是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.

1、用来解析数据

2、用不同的解析模块处理不同的网页结构的网站

正则：用正则表达式去匹配数据比较复杂
xpath：语法节点关系较为难找
bs4：find() find_all() 方法

基本概念

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的网页信息提取库

源码分析

c class 类
m method 方法
f field 字段
p 被@property所修饰的方法可以当成属性来用
v variable 变量

bs4快速入门一、安装

要先安装 lxml:pip install lxml
再装bs4: pip install bs4

二、导入模块

from bs4 import BeautifulSoup

三、创建soup对象

soup = BeautifulSoup(html_doc, 'lxml')

bs4对象种类

● tag : 标签
● NavigableString : 可导航的字符串
● BeautifulSoup : bs对象
● Comment : 注释

代码演示，详细注解

# 1、导入模块
from bs4 import BeautifulSoup

"""
仅作了解即可
● tag : 标签
● NavigableString : 可导航的字符串
● BeautifulSoup : bs对象
● Comment : 注释
"""

html_doc = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

...


"""

# 2、创建soup对象
soup = BeautifulSoup(html_doc, features='lxml')
# print(type(soup.title))  #  标签
# print('-' * 30)
# print(type(soup.a))  #  标签
# print('-' * 30)
# print(type(soup.p))  #  标签
# print('-' * 30)
# print(type(soup.body))  #  标签
# print('-' * 30)
# print(type(soup.title.string))  #  可导航的字符串
# print('-' * 30)
# print(type(soup))  #  bs对象
# print('-' * 30)
# print(type(soup.span.string))  #  注释

遍历文档树 contents，children，descendants

● contents 返回的是一个所有子节点的列表
● children 返回的是一个子节点的迭代器通
● descendants 返回的是一个生成器遍历子子孙孙

代码演示，详细注解

from bs4 import BeautifulSoup

html_doc = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

...


"""

"""
 contents children descendants 
● contents 返回的是一个所有子节点的列表
● children 返回的是一个子节点的迭代器通
● descendants 返回的是一个生成器遍历子子孙孙
"""

soup = BeautifulSoup(html_doc, 'lxml')
head = soup.head
a = soup.a
# print(head.contents)  # [The Dormouse's story] 回的是一个所有子节点的列表
# print('-' * 30)
# for i in head.contents:
#     print(i)
# print(head.children)  #  返回的是一个子节点的迭代器的对象
# print('-' * 30)
# for i in head.children: # (凡是迭代器 都是可以遍历的)
#     print(i)
# print(head.descendants)  #  返回的是一个生成器遍历子子孙孙
# print('-' * 30)
# for i in a.descendants:
#     print(i)

# 会把换行也当成子节点 匹配到
html = soup.html
# print(html.contents)
# print(html.descendants)
"""
[The Dormouse's story, 'n', 
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

]

"""
# for h in html.descendants:
#     print(h)

string ，strings，stripped_strings

● string获取标签里面的内容
● strings 返回是一个生成器对象用过来获取多个标签内容
● stripped_strings 和strings基本一致但是它可以把多余的空格去掉

代码演示，详细注解

from bs4 import BeautifulSoup html_doc = """ The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

""" """ contents children descendants ● contents 返回的是一个所有子节点的列表 ● children 返回的是一个子节点的迭代器通 ● descendants 返回的是一个生成器遍历子子孙孙 """ soup = BeautifulSoup(html_doc, 'lxml') head = soup.head a = soup.a """ 需要重点掌握 string strings stripped_strings ● string获取标签里面的内容 ● strings 返回是一个生成器对象用过来获取多个标签内容 ● stripped_strings 和strings基本一致但是它可以把多余的空格去掉 """ # 用来获取标签里的文本内容 # print(soup.title.string) # 返回是一个生成器对象用过来获取多个标签内容 # 返回一个生成器对象 # print(html.strings) # # for i in html.strings: # print(i) # 和strings基本一致但是它可以把多余的空格去掉 # print(html.stripped_strings) # # for i in html.stripped_strings: # print(i) ''' The Dormouse's story The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie , Lacie and Tillie ; and they lived at the bottom of a well. ... '''

parent 和 parents

● parent直接获得父节点
● parents获取所有的父节点

代码演示，详细注解

from bs4 import BeautifulSoup html_doc = """ The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

""" """ contents children descendants ● contents 返回的是一个所有子节点的列表 ● children 返回的是一个子节点的迭代器通 ● descendants 返回的是一个生成器遍历子子孙孙 """ soup = BeautifulSoup(html_doc, 'lxml') head = soup.head a = soup.a """ 遍历文档树遍历父节点 parent 和 parents ● parent直接获得父节点 ● parents获取所有的父节点 """ title = soup.title # parent 找直接父节点 # print(title.parent) # 返回一个生成器 # print(title.parents) # # for p in title.parents: # print(p) ''' 1、首先找到title的父节点：The Dormouse's story 2、紧接着找到父节点的父节点（head的父节点）： The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

3、最后找到父节点的父节点的父节点（html的父节点）：最后找到父节点的父节点的父节点： The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

''' # html的父节点就是整个文档

find() 和 find_all()----[重点学习]

● find_all()方法以列表形式返回所有的搜索到的标签数据
● find()方法返回搜索到的第一条数据
● find_all()方法参数

● name : tag名称
● attr : 标签的属性
● recursive : 是否递归搜索
● text : 文本内容
● limli : 限制返回条数
● kwargs : 关键字参数

代码演示，详细注解

from lxml import etree
from bs4 import BeautifulSoup
import pprint

# html_doc是网页源码
html_doc = """
The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

...
"""

soup = BeautifulSoup(html_doc, 'lxml')

# 字符串过滤器
# 获取所有的a标签，find_all 把所有找到的数据放在列表里面返回
# a_list = soup.findAll('a')
# print(a_list)
# print('-' * 70)
# for a in a_list:
#     print(a)
# print('-' * 70)
# print(soup.find('a'))  # find()返回匹配到的第一个结果


# 找到title节点和p节点
# result= soup.findAll('title','p')# [] 不可行
result = soup.findAll(['title', 'p'])
print(result)

案例练习，复习总结bs4

from bs4 import BeautifulSoup

html = """

    
        
            职位名称
            职位类别
            人数
            地点
            发布时间
        
        
            22989-金融云区块链高级研发工程师（深圳）
            技术类
            1
            深圳
            2017-11-25
        
        
            22989-金融云高级后台开发
            技术类
            2
            深圳
            2017-11-25
        
        
            SNG16-腾讯音乐运营开发工程师（深圳）
            技术类
            2
            深圳
            2017-11-25
        
        
            SNG16-腾讯音乐业务运维工程师（深圳）
            技术类
            1
            深圳
            2017-11-25
        
        
            TEG03-高级研发工程师（深圳）
            技术类
            1
            深圳
            2017-11-24
        
        
            TEG03-高级图像算法研发工程师（深圳）
            技术类
            1
            深圳
            2017-11-24
        
        
            TEG11-高级AI开发工程师（深圳）
            技术类
            4
            深圳
            2017-11-24
        
        
            15851-后台开发工程师
            技术类
            1
            深圳
            2017-11-24
        
        
            15851-后台开发工程师
            技术类
            1
            深圳
            2017-11-24
        
        
            SNG11-高级业务运维工程师（深圳）
            技术类
            1
            深圳
            2017-11-24
        
    

"""

soup = BeautifulSoup(html, 'lxml')

# 1、拿到所有tr节点
# tr_list = soup.find_all('tr')
# for tr in tr_list:
#     print(tr)
#     print('-'*74)

# 2、获取第二个tr节点
# print(tr_list[1])

# 3、找到所有的tr节点
# 方式一
# class_list = soup.find_all('tr',class_='even')
# for c in class_list:
#     print(c)
#     print('-'*74)

# 方式二 以字典的方式传入
# class_list = soup.find_all('tr', attrs={'class': 'even'})
# for c in class_list:
#     print(c)
#     print('-' * 74)


# 4、定位到id="test"的a标签
# a_list = soup.find_all('a', id="test")
# for a in a_list:
#     print(a)
#     print('-'*74)
# a_lists = soup.find_all('a', attrs={'id': "test", 'class': 'test'})
# for a in a_lists:
#     print(a)
#     print('-' * 74)

# 5、获取所有a标签里面的href属性值
# a_list= soup.find_all('a')
# for a in a_list:
#     推荐使用第一种
#     print(a.get('href'))
#     print(a.attrs['href'])
#     print(a['href'])

# 6、获取所有职位名称
# 第一个tr是标头可以过滤掉
tr_list = soup.find_all('tr')[1:]
for tr in tr_list:
    a_list = tr.find_all('a')
    print(a_list)

职位名称	职位类别	人数	地点	发布时间
22989-金融云区块链高级研发工程师（深圳）	技术类	1	深圳	2017-11-25
22989-金融云高级后台开发	技术类	2	深圳	2017-11-25
SNG16-腾讯音乐运营开发工程师（深圳）	技术类	2	深圳	2017-11-25
SNG16-腾讯音乐业务运维工程师（深圳）	技术类	1	深圳	2017-11-25
TEG03-高级研发工程师（深圳）	技术类	1	深圳	2017-11-24
TEG03-高级图像算法研发工程师（深圳）	技术类	1	深圳	2017-11-24
TEG11-高级AI开发工程师（深圳）	技术类	4	深圳	2017-11-24
15851-后台开发工程师	技术类	1	深圳	2017-11-24
15851-后台开发工程师	技术类	1	深圳	2017-11-24
SNG11-高级业务运维工程师（深圳）	技术类	1	深圳	2017-11-24

Python爬虫----bs4入门到精通（一）

Python相关栏目本月热门文章