爬虫小练习系列,旨在个人练习记录,避免个人对知识的忘却
如果影响了您的浏览体验,我在这里深感抱歉
目录
- 09.24 -
新浪网简易信息爬虫

爬虫小练习系列,旨在个人练习记录,避免个人对知识的忘却
如果影响了您的浏览体验,我在这里深感抱歉
目录
- 09.24 -
新浪网简易信息爬虫
import requests,os,time,lxml.html
start=time.time()
# get html
url='http://blog.sina.com.cn/'
hearder={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/'
'93.0.4577.82 Safari/537.36'}
cook={'cookie':'SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9W56ZWpY2kh4UX'
'-ack41DJX65JpX5KMhUgL.FoqNe0-EeoeNSKB2dJLoI0qLxKqL1Kn'
'LB-qLxK-L12qLB-qLxKBLBo.L1K5LxK-LBKBLBKMLxKML12-L12zLxK-L1K2L1K5t;'
' SCF=Aos5-_L1QuNw3BUSHgTFrkXfKlHMhdEZ-CrPDb1sctPR1OFTU7L6KvAa7'
'HyrmVdoM-f5dYjJhkHoL9cI6brUxiM.; SUB=_2A25MVh_YDeRhGeBJ6FcT8'
'i3LzjiIHXVvInYQrDV8PUNbmtAKLRLykW9NRlrvIh9-J1rDwMXvBihW_yWqk'
'qJhqliw; ALF=1664328458; SSOLoginState=1632792456'}
referer='http://blog.sina.com.cn/'
html=requests.get(url=url,headers=hearder,cookies=cook).content.decode()
print(html)
# select text
var_1=lxml.html.fromstring(html)
name_list=var_1.xpath('//div[@]/ul/li/a/text()')
print(len(name_list))
print(name_list)
# creat file
os.makedirs('新闻',exist_ok=True)
# save file
for i in range(len(name_list)):
with open('新闻/1.txt','a')as f:
f.write(name_list[i])
end=time.time()
print(f'共耗时{end-start}秒')