飞桨 AI studio——python基础课程作业实践二爬虫

实践作业二：Baidu百科爬~ 任务描述

本次实践使用Python来爬取百度百科中《青春有你2》所有参赛选手的信息。
数据获取：https://baike.baidu.com/item/青春有你第二季

爬虫程序:

模拟浏览器 --> 往目标站点发送请求 --> 接收响应数据 --> 提取有用的数据 --> 保存到本地/数据库。

爬虫的过程：

发送请求（requests模块）
获取响应数据（服务器返回）
解析并提取数据（BeautifulSoup第三方库）
保存数据

1. 插件 request模块：

requests是python实现的简单易用的HTTP库，requests.get(url)可以发送一个http get请求，返回服务器响应内容。

2. 插件 BeautifulSoup库：

BeautifulSoup 是一个可以从HTML或XML文件中提取数据的Python库，BeautifulSoup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml。
BeautifulSoup(markup, “html.parser”)或者BeautifulSoup(markup, “lxml”)，推荐使用lxml作为解析器,因为效率更高。

补充插件持久化安装与使用方法

paddlepaddle服务器中若想要持久化安装插件，需要创建并使用持久化路径：

# 使用持久化路径, 如下方代码示例:
mkdir /home/aistudio/external-libraries
pip install beautifulsoup4 -t /home/aistudio/external-libraries
pip install lxml -t /home/aistudio/external-libraries

同时在python添加路径代码来启用插件

# 添加如下代码, 这样每次环境(kernel)启动的时候只要运行下方代码即可:
import sys
sys.path.append('/home/aistudio/external-libraries')

代码实践

导入所需包并创建 json 文件保存初步爬取的信息

import json
import re
import requests
from bs4 import BeautifulSoup
import os
import datetime

#获取当天的日期,并进行格式化,用于后面文件命名，格式:20211217
today = datetime.date.today().strftime('%Y%m%d')

模拟浏览器访问并初步抽取response消息中的有用部分

def crawl_wiki_data():
    """
    爬取百度百科中《青春有你2》中参赛选手信息，返回html
    """
    headers = { 
        
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
    }
    url='https://baike.baidu.com/item/青春有你第二季'                         

    try:
        response = requests.get(url,headers=headers)

        #将一段文档传入BeautifulSoup的构造方法,就能得到一个文档的对象, 可以传入一段字符串
        soup = BeautifulSoup(response.text,'lxml')

        #返回的是信息为class还是其它类型的前端常用结构，需要根据实际解析结果判断，建议每次浏览器F12进行判定，如果此处搞不清楚，也没必要往下看了。因为网站各有不同、更新也会导致先前爬虫程序失效
        tables = soup.find_all('table',{'log-set-param':'table_view'})
			
		#根据前一个相关title定位需要的信息内容位置
        crawl_table_title = "参赛学员"
        
        for table in  tables:           
            #对当前节点前面的标签和字符串进行查找，由实际内容来确定
            table_titles = table.find_previous('div').find_all('h3')
            
            for title in table_titles:
                if(crawl_table_title in title):
                    return table       
    except Exception as e:
        print(e)

进行数据清洗和格式化存储信息至 json

def parse_wiki_data(table_html):
    '''
    从百度百科返回的html中解析得到选手信息，以当前日期作为文件名，存JSON文件,保存到work目录下
    '''
    bs = BeautifulSoup(str(table_html),'lxml')
    
    all_trs = bs.find_all('tr')
    
    error_list = [''','"','“','”']

    stars = []

    for tr in all_trs[1:]:
       
        all_tds = tr.find_all('td')
        star = {}
        #姓名
        star["name"]=all_tds[0].text
        #个人百度百科链接
        star["link"]= 'https://baike.baidu.com' + all_tds[0].find('a').get('href')
        #籍贯
        star["zone"]=all_tds[1].text
        #星座
        star["constellation"]=all_tds[2].text

        #花语,去除掉花语中的单引号或双引号
        flower_word = all_tds[3].text
        
        for c in flower_word :
            if  c in error_list:
                flower_word=flower_word.replace(c,'')
        star["flower_word"]=flower_word 
         
        #公司
        if not all_tds[4].find('a') is  None:
            star["company"]= all_tds[4].find('a').text
        else:
            star["company"]= all_tds[4].text  

        stars.append(star)

    json_data = json.loads(str(stars).replace("'","""))   
    with open('work/' + today + '.json', 'w', encoding='UTF-8') as f:
        json.dump(json_data, f, ensure_ascii=False)# false代表不要以ASCII码形式存储=以网页展现的中文去存

获取百科相册中的人物图片
然后进入该人物个人百科获取相册内所有照片

def crawl_pic_urls():
    '''
    爬取每个选手的百度百科图片，并保存
    ''' 
    with open('work/'+ today + '.json', 'r', encoding='UTF-8') as file:
         json_array = json.loads(file.read())

    headers = { 
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' 
     }
    for star in json_array:
        
        name = star['name']
        link = star['link']
        
        #发送请求
        response = requests.get(link,headers = headers)
		#百科中有部分人员因重名、页面表现不同等原因需要用捕获异常的形式规避 
        try:
            bs = BeautifulSoup(response.text,'lxml')
        #从每个明星的百度百科中得到指向相册的url链接
            pic_list_url = bs.select('.summary-pic a')[0].get('href')
            pic_list_url ='https://baike.baidu.com' + pic_list_url
            print(name)
        #向图片列表中的url发送访问请求
            pic_list_response = requests.get(pic_list_url,headers = headers)

        #清洗返回的url数据,得到照片链接
            bs = BeautifulSoup(pic_list_response.text,'lxml')
            pic_list_html = bs.select('.pic-list img')
            
            pic_urls = []
            for pic_html in pic_list_html:
                pic_url = pic_html.get('src')
                print(pic_url)
                pic_urls.append(pic_url)
            down_pic(name,pic_urls)
        except Exception as e:
            continue

下载模块

def down_pic(name,pic_urls):
    '''
    根据图片链接列表pic_urls, 下载所有图片，保存在以name命名的文件夹中,
    '''
    path = 'work/'+'pics/'+name+'/'

    if not os.path.exists(path):
      os.makedirs(path)

    for i, pic_url in enumerate(pic_urls):
        try:
            pic = requests.get(pic_url, timeout=15)
            string = str(i + 1) + '.jpg'
            with open(path+string, 'wb') as f:
                f.write(pic.content)
                print('成功下载第%s张图片: %s' % (str(i + 1), str(pic_url)))
        except Exception as e:
            print('下载第%s张图片时失败: %s' % (str(i + 1), str(pic_url)))
            print(e)
            continue

打印结果以及启动程序

def show_pic_path(path):
    '''
    遍历所爬取的每张图片，并打印所有图片的绝对路径
    '''
    pic_num = 0
    for (dirpath,dirnames,filenames) in os.walk(path):
        for filename in filenames:
           pic_num += 1
           print("第%d张照片：%s" % (pic_num,os.path.join(dirpath,filename)))           
    print("共爬取《青春有你2》选手的%d照片" % pic_num)

if __name__ == '__main__':

     #爬取百度百科中《青春有你2》中参赛选手信息，返回html
     html = crawl_wiki_data()

     #解析html,得到选手信息，保存为json文件
     parse_wiki_data(html)

     #从每个选手的百度百科页面上爬取图片,并保存
     crawl_pic_urls()

     #打印所爬取的选手图片路径
     show_pic_path('/home/aistudio/work/pics/')

     print("所有信息爬取完成！")

运行结果：

参考内容

https://aistudio.baidu.com/aistudio/projectdetail/3250433

飞桨 AI studio——python基础课程作业实践二 爬虫

Python相关栏目本月热门文章

飞桨 AI studio——python基础课程作业实践二爬虫