自动化爬取新闻页面

由于上次爬取的内容中有一些图片本人访问是可以的但是别人访问就是404 Nginx…等错误所以需要重新爬取。今天的目标是人民网人民网上也有一些昆虫类的题材。

难点人民网的素材是动态加载的众所周知动态加载的直接爬是爬不到源码的。

解决方法使用自动化模拟用户访问拿到源码对源码再进行爬取分析整理。

过程分如下几步

~获取网页源码的步骤

下载第三方库selenium下载常用浏览器对应的驱动模拟用户等待刷新获取当前界面源码将源码保存在本地

~ 处理简单的网页源码步骤

把写好的请求头放上来浏览合适的目标网址对目标网址的源码进行分析用BeautifulSoup进行爬取整理成需要的json格式 [{键值对},{键值对}] 下载第三方库selenium

下载最新的selenium,在pythonCharm的命令行直接pip install selenium

下载常用浏览器对应的驱动

https://chromedriver.storage.googleapis.com/index.html

如上网址根据自己的浏览器版本下载driver 直接下载win32版本的因为向下兼容

下载后放到当前python.exe存放的位置

模拟用户等待刷新

#使用驱动模拟用户
driver webdriver.Chrome()
driver.get(url)
driver.maximize_window()
time.sleep(1)

获取当前界面源码

因为没有总页码总页码也是加载的这里需要自己手动设置因为数据比较多我需要的不多所以设置成了10 自动化摸到源码后隔一段时间控制下一页对这10页的数据进行爬取

total_pages 10 # 访问总页数
for page in range(1, total_pages 1):
 page_result []
 # 获取当前页面的源码
 pageSource driver.page_source
 time.sleep(1)
 #对源码做持久化
 driver.find_element_by_css_selector(
 #rmw-search div div:nth-child(2) div.page-container div span.page-next ).click()
 time.sleep(2)

将源码保存在本地

 #对源码做持久化
 with open( save_html str(page) .txt , w , encoding utf-8 ) as f:
 f.write(pageSource)
 webSpider(page)#对持久化的数据进行离线爬取

下面的简单步骤就直接省了。

测试的时候最好是单独对一个离线数据进行测试结果到达自己想要的之后放入webSpider方法中

 with open( save_html str(page) .txt , r , encoding utf-8 ) as f:
 st f.readlines()
 html 
 for s in st:
 html s
 # print(html)
 soup BeautifulSoup(html, lxml )
 div soup.find( ul , class_ article )
 div_list div.find_all( li )
 stt [ 
 for a in div_list:
 # print(a)
 href a.find( a )[ href ]
 # print(href)
 title a.find( div , class_ ttl ).find( a ).text
 title title.replace(u xa0 , u )
 if a.find( img ) is None:
 img None
 else:
 img a.find( img )[ src ]
 content a.find( div , class_ abs ).text
 content content.replace(u xa0 , u )
 date a.find( span , class_ tip-pubtime ).text
 fro a.find( a , class_ tip-source ).text
 dic { href : href,
 Title : title,
 banner : img,
 Message : content,
 CreatedAt : date,
 fro : fro
 # print(json.dumps(dic, ensure_ascii False))
 stt json.dumps(dic, ensure_ascii False) ,n 
 stt stt[:-2] ] 
 with open( result_insect.txt , a , encoding utf-8 ) as f:
 f.write(stt)

lse) “,n”
stt stt[:-2] “]”
with open(“result_insect.txt”, ‘a ’, encoding “utf-8”) as f:
f.write(stt)

自动化爬取新闻页面

Python相关栏目本月热门文章