BeautifulSoup 利用Find

BeautifulSoup详解

BeautifulSoup是Python爬虫常用的一个库，起到解析页面的功能。但是我们课上的老师没有把这个库详细的讲，所以我利用网上的资源自己整合一下，写一篇Blog来学习一下~

首先是BeautifulSoup库的安装：

命令行运行:

pip3 install beautifulsoup4

BeautifulSoup的解析器：
我们常用html.parser解析器

解析器	使用方法	优势
Python标准库	BeautifulSoup(response.read(), “html.parser”)	Python的内置标准库、执行速度适中、文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文容错能力差
lmxl HTML解析器	BeautifulSoup(response.read(), “lmxl”)	速度快、文档容错能力强	需要安装C语言库
lmxl XML解析器	BeautifulSoup(response.read(), “xml”)	速度快、唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(response.read(), “html5lib”)	最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档	速度慢、不依赖外部扩展

文章以以海南大学起点论坛为例

标准选择器

find_all( name , attrs , recursive , text , **kwargs )

可根据标签名、属性、内容查找文档。

部分常用方法

find
find用法和findall一模一样，但是返回的是找到的第一个符合条件的内容输出。
find_parents()， find_parent()
find_parents()返回所有祖先节点，find_parent()返回直接父节点。
find_next_siblings() ,find_next_sibling()
返回后面的所有兄弟节点，2返回后面的第一个兄弟节点
find_previous_siblings(),find_previous_sibling()
返回前面所有兄弟节点…
find_all_next(),find_next()
返回节点后所有符合条件的节点，2返回后面第一个符合条件的节点
find_all_previous()和find_previous()

代码：

from bs4 import BeautifulSoup
import urllib
from urllib import request

url = 'http://www.ihain.cn/forum.php?mod=guide&view=newthread'
headers ={}
headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36 Edg/94.0.992.31'
req = request.Request(url,None,headers = headers)
response = request.urlopen(req)
# 获得response对象
html = BeautifulSoup(response,'html.parser')

ul = html.find_all('ul')    #查找ul标签下的内容、嵌套选择
for li in ul:
    print(li.find_all('li'))

# 属性选择：

titles = html.find_all(class_ = 'xst')
for title in titles:
    print(title.text)

# 使用find_all方法的atters参数定义一个字段参数来搜索多个属性

content = html.find(attrs={'class':'p_opt','id':'card_1550_menu_content'})
print(content)

find(name, attrs, recursive, text, wargs)

常用参数：

参数名	作用
name	查找标签
text	查找文本
atters	基于atters参数

**2. find_all **
返回所有匹配到的结果，区别于find（find只返回查找到的第一个结果）
语法：
find_all(name, attrs, recursive, text, limit, kwargs)

多级索引：

查找目标：

对应的html:

我们希望通过索引能够得到以下内容：

列表内容序号：

我们逐步寻找内容，找到我们想要的：

对应代码：

from bs4 import BeautifulSoup
import urllib
from urllib import request

url = 'http://www.ihain.cn/forum.php?mod=guide&view=newthread'
headers ={}
headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36 Edg/94.0.992.31'
req = request.Request(url,None,headers = headers)
response = request.urlopen(req)
# 获得response对象
html = BeautifulSoup(response,'html.parser')
#find_all索引多级标签：
div = html.find_all('div',id = 'nv')
#返回的div是一个列表

# print(div[0].contents[1],'n')
# print(div[0].contents[2],'n')
# print(div[0].contents[3],'n')

print(div[0].contents[3],'n') #继续往下索引

print(div[0].contents[3].contents[1])#ul下的第二个li

print(div[0].contents[3].contents[1].contents[0])#ul下的第二个li下的第一个内容即标签及其内容

print(div[0].contents[3].contents[1].contents[0].text)  #的文本内容（起点百事通）

print(div[0].contents[3].contents[1].contents[0].get('href'))   #对应的链接内容

至此 BeatifulSoup的find_all的查找基本上已经可以满足查找需求了，但是更方便查找的是Xpath等方式，这是另一种查找内容的方式，本篇不做介绍~

如果本篇文章对你有帮助的话记得点个赞哦~

BeautifulSoup 利用Find

C/C++/C#相关栏目本月热门文章