栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 软件开发 > 后端开发 > Python

Python爬虫实现字符联想

Python 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

Python爬虫实现字符联想

五一节放假正想放肆一把时,发现

 

 为什么浏览器能预知我的输入呢?

可否用他实现一些有趣的事呢?

于是一个F12网页现原形

 太多东西了,清除之后在操作一次

 嗯,这里面的东西有点东西

而且又是Get请求,好办

import urllib.request
import urllib.parse

_360_url = 'https://smart.sug.so.com/suggest?crec=0&pid=webpage&word=%E5%B9%82&srcg=&src=https://www.mshxw.com/skin/sinaskin/image/nopic.gif'#'常规'中的'请求URL'

headers={
'user-agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36 QIHU 360SE/13.1.5330.0"
}#'请求标头'中的'user-agent'

re=urllib.request.Request(url=_360_url,headers=headers)
res=urllib.request.urlopen(re)
T= res.read().decode('utf-8')

T就是预览(响应)中的数据了,只是没有换行之类的

>>> print(T)

 __jsonp28__({"abv":"b","errno":0,"data":{"query":"幂","errorcode":0,"ext":"nlpv=test_yc_18","version":"revise","result":[{"ci":"1.000000","ctrScore":"0.070790","recallType":"nginx'wangdun'ci'tail'xgb","recallScore":"0.000000","word":"幂的运算","rerankScore":"0.266065","rankScore":"0.266065"},{"ci":"1.000000","ctrScore":"0.064769","recallType":"nginx'wangdun'ci'tail'xgb","recallScore":"0.000000","word":"幂函数","rerankScore":"0.254498","rankScore":"0.254498"},{"ci":"1.000000","ctrScore":"0.014691","recallType":"nginx'wangdun'xgb","recallScore":"0.000000","word":"幂是什么","rerankScore":"0.121207","rankScore":"0.121207"},{"ci":"1.000000","ctrScore":"0.007204","recallType":"nginx'xgb","recallScore":"0.000000","word":"幂怎么读","rerankScore":"0.084879","rankScore":"0.084879"},{"ci":"1.000000","ctrScore":"0.004697","recallType":"nginx'wangdun'tail'xgb","recallScore":"0.000000","word":"幂的乘方与积的乘方","rerankScore":"0.068532","rankScore":"0.068532"},{"ci":"1.000000","ctrScore":"0.004360","recallType":"nginx'wangdun'ci'tail'xgb","recallScore":"0.000000","word":"幂函数图像","rerankScore":"0.066031","rankScore":"0.066031"},{"ci":"1.000000","ctrScore":"0.004268","recallType":"nginx'wangdun'xgb","recallScore":"0.000000","word":"幂函数公式","rerankScore":"0.065332","rankScore":"0.065332"},{"ci":"1.000000","ctrScore":"0.003952","recallType":"nginx'ci'tail'xgb","recallScore":"0.000000","word":"幂学在线","rerankScore":"0.062861","rankScore":"0.062861"},{"ci":"1.000000","ctrScore":"0.002698","recallType":"nginx'wangdun'xgb","recallScore":"0.000000","word":"幂次方","rerankScore":"0.051942","rankScore":"0.051942"},{"ci":"1.000000","ctrScore":"0.001830","recallType":"nginx'wangdun'tail'xgb","recallScore":"0.000000","word":"幂律分布","rerankScore":"0.042779","rankScore":"0.042779"}],"ssid":"cebbce4bdd6848cf8e25f8bdcf3ce3a3","src":"hao_360so_suggest"}});

对于这么长的数据,我是这样捣鼓的:

T = T[13:-3]
T = eval(T)
result = T['data']['result']
for i in result:
	print(i['word'])

有点奇葩吧,这段大意是:

1.现将它里面长得像字典的东西截出来。

2.用eval()函数把它变成真的字典。

3.分析字典T,将其中有用的一个列表截下来。

4.分析这个由若干个字典组成的列表,用for循环将有用的数据打印下来。

 OK,

但没完全OK,

如何让它具有海纳百川的能力,而不只会一个‘幂’?

很简单,看看请求URL长什么样子。

https://smart.sug.so.com/suggest?crec=0&pid=webpage&word=%E5%B9%82&srcg=&src=https://www.mshxw.com/skin/sinaskin/image/nopic.gif>

再看看'查询字符串参数'是什么样子的

 

 

 很显然,url是由红色部分加上蓝色部分,蓝色部分也就是用‘&’将'查询字符串参数'中的参数连接起来形成的。但‘幂’这个中文变成了‘%E5%B9%82’,这是由于中文在请求中会被URL加密。

这样加密即可:

from urllib.parse import quote
s = quote('幂', 'utf-8')
>>>print(s)
%E5%B9%82

于是与上面的代码有机结合,完整代码:

from urllib.parse import quote
import urllib.request
import urllib.parse
import json
import time

a = input()
s = quote(a, 'utf-8')

_360_url = 'https://smart.sug.so.com/suggest?crec=0&pid=webpage&word={}&srcg=&src=https://www.mshxw.com/skin/sinaskin/image/nopic.gif'.format(s)
headers={
'user-agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36 QIHU 360SE/13.1.5330.0"
}
re=urllib.request.Request(url=_360_url,headers=headers)
res=urllib.request.urlopen(re)
T = res.read().decode('utf-8')
T = T[13:-3]
T = eval(T)
result = T['data']['result']
for i in result:
	print(i['word'])

吾不多求,喜欢就请君点个赞。

转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/850186.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号