栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 面试经验 > 面试问答

python中的Web抓取urlopen

面试问答 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

python中的Web抓取urlopen

我个人写道:

# Python 2.7import urlliburl = 'http://www.boursorama.com/includes/cours/last_transactions.phtml?symbole=1xEURUS'sock = urllib.urlopen(url)content = sock.read() sock.close()print content

Et si tu parlesfrançais,.. bonjour sur stackoverflow.com!

更新1

实际上,我现在喜欢使用以下代码,因为它更快。

# Python 2.7import httplibconn = httplib.HTTPConnection(host='www.boursorama.com',timeout=30)req = '/includes/cours/last_transactions.phtml?symbole=1xEURUS'try:    conn.request('GET',req)except:     print 'echec de connexion'content = conn.getresponse().read()print content

将此代码更改

httplib
http.client
足以使其适应Python 3。

我确认,使用这两个代码,可以获得获取您感兴趣的数据的源代码:

        <td  width="33%" align="center">11:57:44</td>        <td  width="33%" align="center">1.4486</td>        <td  width="33%" align="center">0</td></tr>       <tr>        <td  width="33%" align="center">11:57:43</td>        <td  width="33%" align="center">1.4486</td>        <td  width="33%" align="center">0</td></tr>

更新2

在上面的代码中添加以下代码段,即可提取我想要的数据:

for i,line in enumerate(content.splitlines(True)):    print str(i)+' '+repr(line)print 'nn'import reregx = re.compile('tttttt<td  width="33%" align="center">(dd:dd:dd)</td>rn'       'tttttt<td  width="33%" align="center">([d.]+)</td>rn'       'tttttt<td  width="33%" align="center">(d+)</td>rn')print regx.findall(content)

结果(仅结尾)

............................................................................................................................................................98 'window.config.graphics = {};n'99 'window.config.accordions = {};n'100 'n'101 "window.addEvent('domready', function(){n"102 '});n'103 '</script>n'104 '<script type="text/javascript">n'105 'ttttsas_tmstp = Math.round(Math.random()*10000000000);n'106 'ttttsas_pageid = "177/(includes/cours/last_transactions)"; // Page : boursorama.com/smartad_testn'107 'ttttvar sas_formatids = "8968";n'108 'ttttsas_target = "symb=1xEURUS#"; // TargetingArrayn'109 'ttttdocument.write("<scr"+"ipt src=\"http://ads.boursorama.com/call2/pubjall/" + sas_pageid + "/" + sas_formatids + "/" + sas_tmstp + "/" + escape(sas_target) + "?\"></scr"+"ipt>");ttttn'110 'ttt</script><div id="_smart1"><script language="javascript">sas_script(1,8968);</script></div><script type="text/javascript">rn'111 "twindow.addEvent('domready', function(){rn"112 'sas_move(1,8968);t});rn'113 '</script>n'114 '<script type="text/javascript">n'115 'var _gaq = _gaq || [];n'116 "_gaq.push(['_setAccount', 'UA-1623710-1']);n"117 "_gaq.push(['_setDomainName', 'www.boursorama.com']);n"118 "_gaq.push(['_setCustomVar', 1, 'segment', 'WEB-VISITOR']);n"119 "_gaq.push(['_setCustomVar', 4, 'version', '18']);n"120 "_gaq.push(['_trackPageLoadTime']);n"121 "_gaq.push(['_trackPageview']);n"122 '(function() {n'123 "var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;n"124 "ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';n"125 "var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);n"126 '})();n'127 '</script>n'128 '</body>n'129 '</html>'[('12:25:36', '1.4478', '0'), ('12:25:33', '1.4478', '0'), ('12:25:31', '1.4478', '0'), ('12:25:30', '1.4478', '0'), ('12:25:30', '1.4478', '0'), ('12:25:29', '1.4478', '0')]

我希望您不打算在外汇交易中“玩”交易:这是快速散布资金的最佳方法之一。

更新3

对不起!我忘记了您使用Python3。因此,我认为您必须这样定义正则表达式:

regx = re.compile( b ‘ t t t t t ......)

也就是说在字符串之前加上 b
,否则您将收到类似此问题的错误



转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/662153.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号