学习目标:初步掌握爬虫
获取网站中ajax中的数据,并将数据写入excel表格中
import requests import xlwt
## 然后确定ajax请求的url,通过F12进入开发者工具,点击network,刷新网页,或者点击下面的页码找到数据对应的块,点击header,就可以找到对应的url及请求的方式(post 或 get),找到header最底部的data数据,
head = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36
}
data={
'page' : str(i),
'search' :'',
'ordering' : '-frequency'
}
response = requests.get(url,headers=head,data=data)
page_text = response.json()
# print(html)
head的数据,是在header中请求的请求标头(谷歌浏览器里的是英文的)里的User-Agent:
**
代码中data的数据都是在这里,因为每个页码对应的page不一样,就加了个变量
在response中,可以找到回复的json数据,根据自己的需要筛选数据
def main():
baseurl = 'https://'
#askurl(url)
datalist = getData(baseurl)
# print(datalist)
path = "大厂面试常见算法题.xls"
saveData(datalist,path)
def getData(baseurl):
datalist = []
for i in range(1,41):
url = baseurl + str(i) + '&search=&ordering=-frequency'
page_text = askurl(url,i)
for item in page_text['list']:
data = []
# print(item)
title = item['leetcode']['title']
value = item['value']
level = item['leetcode']['level']
question_id = item['leetcode']['frontend_question_id']
time = item['time'][0:10]
slug_title = 'https://problems/'+str(item['leetcode']['slug_title'])
data.append(question_id + '.' +title)
# print(data)
if int(level) == 1:
data.append('简单')
elif int(level) == 2:
data.append('中等')
elif int(level) == 3:
data.append('困难')
data.append(value)
data.append(time)
data.append(slug_title)
datalist.append(data)
将得到的数据写入Excel中
def saveData(datalist,path):
print('正在saving·······')
book = xlwt.Workbook(encoding='utf-8',style_compression=0)
sheet = book.add_sheet('大厂常见面试算法题',cell_overwrite_ok=True)
col = ('题目','难度','出现次数','最新考察时间','LeetCode链接')
for i in range(0,5):
sheet.write(0,i,col[i])
for i in range(0,785):
data = datalist[i]
for j in range(0,5):
sheet.write(i+1,j,data[j])
book.save(path)
到此我们就得出了所想要的数据
ps:爬取网站的网址,做了修改,源代码可以到微信公众号搜索"一团追梦喵"回复"python爬虫"


