Python爬虫入门

settings- Project- Project Interpreter

Package列出所有已安装的python包点击右侧加号可以搜索安装新的包

一、抓取豆瓣Top250电影信息

需要安装包 beautifulsoup4、requests

 
爬取豆瓣电影Top250
import requests
import bs4
# 1,打开要抓取的网页
# 2 查看接口(页面地址URL)的调用顺序
# 3 查看数据的结构
result []
header {
 user-agent : Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) 
 Chrome/93.0.4577.82 Safari/537.36 
start 0
for i in range(0, 10):
 response requests.get( https://movie.douban.com/top250?start str(start) filter , headers header)
 response.encoding utf-8 
 html response.text
 start 25
 soup bs4.BeautifulSoup(html, html.parser )
 # 获取所有class为info的div元素并遍历
 for item in soup.find_all( div , info ):
 title item.div.a.span.string
 # 定位到上映年份的p元素,取第二行
 year_line item.find( div , bd ).p.contents[2].string
 year_line year_line.replace( n , )
 year_line year_line.replace( , )
 year year_line[0:4]
 # rating item.find( div , bd ).div.find( span , rating_num ).string
 rating item.find( span , { class : rating_num }).get_text()
 one_result [title, rating, year]
 result.append(one_result)
 print( 第%d页爬取完毕... % (i 1))
print(result)

二、下载腾讯新闻海外疫情数据

需要安装包 selenium、beautifulsoup4、pandas

另外支持selenium需要下载谷歌驱动解压后建议放python的根目录下这样不需要改环境变量

根据浏览器实际版本下载对应驱动 https://chromedriver.storage.googleapis.com/index.html

不装的话会报这样的错

selenium.common.exceptions.WebDriverException: Message: chromedriver executable needs to be in PATH.

 
下载海外疫情数据
from selenium import webdriver
import bs4
import pandas as pd

Python爬虫入门

Python相关栏目本月热门文章