栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 软件开发 > 后端开发 > Python

使用python采集脚本之家电子书资源并自动下载到本地的实例脚本

Python 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

使用python采集脚本之家电子书资源并自动下载到本地的实例脚本

jb51上面的资源还比较全,就准备用python来实现自动采集信息,与下载啦。

Python具有丰富和强大的库,使用urllib,re等就可以轻松开发出一个网络信息采集器!

下面,是我写的一个实例脚本,用来采集某技术网站的特定栏目的所有电子书资源,并下载到本地保存!

软件运行截图如下:

在脚本运行时期,不但会打印出信息到shell窗口,还会保存日志到txt文件,记录采集到的页面地址,书籍的名称,大小,服务器本地下载地址以及百度网盘的下载地址!

实例采集并下载考高分网的python栏目电子书资源:

# -*- coding:utf-8 -*-
import re
import urllib2
import urllib
import sys
import os
reload(sys)
sys.setdefaultencoding('utf-8')
def getHtml(url):
 request = urllib2.Request(url)
 page = urllib2.urlopen(request)
 htmlcontent = page.read()
 #解决中文乱码问题
 htmlcontent = htmlcontent.decode('gbk', 'ignore').encode("utf8",'ignore')
 return htmlcontent
def report(count, blockSize, totalSize):
 percent = int(count*blockSize*100/totalSize)
 sys.stdout.write("r%d%%" % percent + ' complete')
 sys.stdout.flush()
def getBookInfo(url):
 htmlcontent = getHtml(url);
 #print "htmlcontent=",htmlcontent; # you should see the ouput html
 #

crifan regex_title = '(?P.+?)'; title = re.search(regex_title, htmlcontent); if(title): title = title.group("title"); print "书籍名字:",title; file_object.write('书籍名字:'+title+'r'); #<li>书籍大小:27.2MB</li> filesize = re.search('(?P<filesize>.+?)', htmlcontent); if(filesize): filesize = filesize.group("filesize"); print "文件大小:",filesize; file_object.write('文件大小:'+filesize+'r'); #.+?)" rel="external nofollow" target="_blank"', htmlcontent); if(bookimg): bookimg = bookimg.group("bookimg"); print "封面图片:",bookimg; file_object.write('封面图片:'+bookimg+'r'); #<li>酷云中国电信下载</li> downurl1 = re.search('<li>.+?)" rel="external nofollow" target="_blank">酷云中国电信下载</li>', htmlcontent); if(downurl1): downurl1 = downurl1.group("downurl1"); print "下载地址1:",downurl1; file_object.write('下载地址1:'+downurl1+'r'); sys.stdout.write('rFetching ' + title + '...n') title = title.replace(' ', ''); title = title.replace('/', ''); saveFile = '/Users/superl/Desktop/pythonbook/'+title+'.rar'; if os.path.exists(saveFile): print "该文件已经下载了!"; else: urllib.urlretrieve(downurl1, saveFile, reporthook=report); sys.stdout.write("rDownload complete, saved as %s" % (saveFile) + 'nn') sys.stdout.flush() file_object.write('文件下载成功!r'); else: print "下载地址1不存在"; file_error.write(url+'r'); file_error.write(title+"下载地址1不存在!文件没有自动下载!r"); file_error.write('r'); #<li>百度网盘下载2</li> downurl2 = re.search('</li><li>.+?)" rel="external nofollow" target="_blank">百度网盘下载2</li>', htmlcontent); if(downurl2): downurl2 = downurl2.group("downurl2"); print "下载地址2:",downurl2; file_object.write('下载地址2:'+downurl2+'r'); else: #file_error.write(url+'r'); print "下载地址2不存在"; file_error.write(title+"下载地址2不存在r"); file_error.write('r'); file_object.write('r'); print "n"; def getBooksUrl(url): htmlcontent = getHtml(url); #<ul class="cur-cat-list"> urls = re.findall('.+?)" rel="external nofollow" class="tit"', htmlcontent); for url in urls: url = "//www.jb51.net"+url; print url+"n"; file_object.write(url+'r'); getBookInfo(url) #print "url->", url if __name__=="__main__": file_object = open('/Users/superl/Desktop/python.txt','w+'); file_error = open('/Users/superl/Desktop/pythonerror.txt','w+'); pagenum = 3; for pagevalue in range(1,pagenum+1): listurl = "//www.jb51.net/ books/list476_%d.html"%pagevalue; print listurl; file_object.write(listurl+'r'); getBooksUrl(listurl); file_object.close(); file_error.close();</pre> <p>注意,上面代码部分地方的url被我换了。</p> <p><strong>总结</strong></p> <p>以上所述是小编给大家介绍的python采集jb51电子书资源并自动下载到本地实例脚本,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对考高分网网站的支持!</p></div> </div> <div style="clear: both;"></div> <div class="author-info fl"> <div><span class="gray">转载请注明:</span>文章转载自 <a href="https://www.mshxw.com/" class="blue">www.mshxw.com</a></div> <div><span class="gray">本文地址:</span><a href="https://www.mshxw.com/it/28461.html" class="blue">https://www.mshxw.com/it/28461.html</a></div> </div> <div class="prev fl"> <p> <a style='text-align:left;' class='center-block text-center glyphicon glyphicon-collapse-down' href="https://www.mshxw.com/it/28496.html">上一篇 解决python 自动安装缺少模块的问题</a> </p> <p> <a style='text-align:left;' class='center-block text-center glyphicon glyphicon-collapse-down' href="https://www.mshxw.com/it/28454.html">下一篇 纯用NumPy实现神经网络的示例代码</a> </p> </div> <div class="new_tag fl"> </div> </div> <div class="new_r fr" style="border-radius:10px;"> <div class="tui fl"> <h3>Python相关栏目本月热门文章</h3> <ul> <li><span>1</span><a href="https://www.mshxw.com/it/1041277.html" title="【Linux驱动开发】设备树详解(二)设备树语法详解">【Linux驱动开发】设备树详解(二)设备树语法详解</a></li> <li><span>2</span><a href="https://www.mshxw.com/it/1041273.html" title="别跟客户扯细节">别跟客户扯细节</a></li> <li><span>3</span><a href="https://www.mshxw.com/it/1041266.html" title="Springboot+RabbitMQ+ACK机制(生产方确认(全局、局部)、消费方确认)、知识盲区">Springboot+RabbitMQ+ACK机制(生产方确认(全局、局部)、消费方确认)、知识盲区</a></li> <li><span>4</span><a href="https://www.mshxw.com/it/1041261.html" title="【Java】对象处理流(ObjectOutputStream和ObjectInputStream)">【Java】对象处理流(ObjectOutputStream和ObjectInputStream)</a></li> <li><span>5</span><a href="https://www.mshxw.com/it/1041256.html" title="【分页】常见两种SpringBoot项目中分页技巧">【分页】常见两种SpringBoot项目中分页技巧</a></li> <li><span>6</span><a href="https://www.mshxw.com/it/1041299.html" title="一文带你搞懂OAuth2.0">一文带你搞懂OAuth2.0</a></li> <li><span>7</span><a href="https://www.mshxw.com/it/1041297.html" title="我要写整个中文互联网界最牛逼的JVM系列教程 | 「JVM与Java体系架构」章节:虚拟机与Java虚拟机介绍">我要写整个中文互联网界最牛逼的JVM系列教程 | 「JVM与Java体系架构」章节:虚拟机与Java虚拟机介绍</a></li> <li><span>8</span><a href="https://www.mshxw.com/it/1041296.html" title="【Spring Cloud】新闻头条微服务项目:FreeMarker模板引擎实现文章静态页面生成">【Spring Cloud】新闻头条微服务项目:FreeMarker模板引擎实现文章静态页面生成</a></li> <li><span>9</span><a href="https://www.mshxw.com/it/1041294.html" title="JavaSE - 封装、static成员和内部类">JavaSE - 封装、static成员和内部类</a></li> <li><span>10</span><a href="https://www.mshxw.com/it/1041291.html" title="树莓派mjpg-streamer实现监控及拍照功能调试">树莓派mjpg-streamer实现监控及拍照功能调试</a></li> <li><span>11</span><a href="https://www.mshxw.com/it/1041289.html" title="用c++写一个蓝屏代码">用c++写一个蓝屏代码</a></li> <li><span>12</span><a href="https://www.mshxw.com/it/1041285.html" title="从JDK8源码中看ArrayList和LinkedList的区别">从JDK8源码中看ArrayList和LinkedList的区别</a></li> <li><span>13</span><a href="https://www.mshxw.com/it/1041281.html" title="idea 1、报错java: 找不到符号 符号: 变量 log 2、转换成Maven项目">idea 1、报错java: 找不到符号 符号: 变量 log 2、转换成Maven项目</a></li> <li><span>14</span><a href="https://www.mshxw.com/it/1041282.html" title="在openwrt使用C语言增加ubus接口(包含C uci操作)">在openwrt使用C语言增加ubus接口(包含C uci操作)</a></li> <li><span>15</span><a href="https://www.mshxw.com/it/1041278.html" title="Spring 解决循环依赖">Spring 解决循环依赖</a></li> <li><span>16</span><a href="https://www.mshxw.com/it/1041275.html" title="SpringMVC——基于MVC架构的Spring框架">SpringMVC——基于MVC架构的Spring框架</a></li> <li><span>17</span><a href="https://www.mshxw.com/it/1041272.html" title="Andy‘s First Dictionary C++ STL set应用">Andy‘s First Dictionary C++ STL set应用</a></li> <li><span>18</span><a href="https://www.mshxw.com/it/1041271.html" title="动态内存管理">动态内存管理</a></li> <li><span>19</span><a href="https://www.mshxw.com/it/1041270.html" title="我的创作纪念日">我的创作纪念日</a></li> <li><span>20</span><a href="https://www.mshxw.com/it/1041269.html" title="Docker自定义镜像-Dockerfile">Docker自定义镜像-Dockerfile</a></li> </ul> </div> </div> </div> <!-- 公共尾部 --> <div class="link main"> <div class="link_tit"> <span class="on">热门相关搜索</span> </div> <div class="link_tab"> <div class="link_con"> <a href="http://www.mshxw.com/TAG_1/luyouqishezhi.html">路由器设置</a> <a href="http://www.mshxw.com/TAG_1/mutuopan.html">木托盘</a> <a href="http://www.mshxw.com/TAG_1/baotamianban.html">宝塔面板</a> <a href="http://www.mshxw.com/TAG_1/shaoerpython.html">儿童python教程</a> <a href="http://www.mshxw.com/TAG_1/xinqingdiluo.html">心情低落</a> <a href="http://www.mshxw.com/TAG_1/pengyouquan.html">朋友圈</a> <a href="http://www.mshxw.com/TAG_1/vim.html">vim</a> <a href="http://www.mshxw.com/TAG_1/shuangyiliuxueke.html">双一流学科</a> <a href="http://www.mshxw.com/TAG_1/zhuanshengben.html">专升本</a> <a href="http://www.mshxw.com/TAG_1/wodexuexiao.html">我的学校</a> <a href="http://www.mshxw.com/TAG_1/rijixuexiao.html">日记学校</a> <a href="http://www.mshxw.com/TAG_1/xidianpeixunxuexiao.html">西点培训学校</a> <a href="http://www.mshxw.com/TAG_1/qixiuxuexiao.html">汽修学校</a> <a href="http://www.mshxw.com/TAG_1/qingshu.html">情书</a> <a href="http://www.mshxw.com/TAG_1/huazhuangxuexiao.html">化妆学校</a> <a href="http://www.mshxw.com/TAG_1/tagouwuxiao.html">塔沟武校</a> <a href="http://www.mshxw.com/TAG_1/yixingmuban.html">异形模板</a> <a href="http://www.mshxw.com/TAG_1/xinandaxuepaiming.html">西南大学排名</a> <a href="http://www.mshxw.com/TAG_1/zuijingpirenshengduanju.html">最精辟人生短句</a> <a href="http://www.mshxw.com/TAG_1/6bujiaonizhuihuibeipian.html">6步教你追回被骗的钱</a> <a href="http://www.mshxw.com/TAG_1/nanchangdaxue985.html">南昌大学排名</a> <a href="http://www.mshxw.com/TAG_1/qingchaoshierdi.html">清朝十二帝</a> <a href="http://www.mshxw.com/TAG_1/beijingyinshuaxueyuanpaiming.html">北京印刷学院排名</a> <a href="http://www.mshxw.com/TAG_1/beifanggongyedaxuepaiming.html">北方工业大学排名</a> <a href="http://www.mshxw.com/TAG_1/beijinghangkonghangtiandaxuepaiming.html">北京航空航天大学排名</a> <a href="http://www.mshxw.com/TAG_1/shoudoujingjimaoyidaxuepaiming.html">首都经济贸易大学排名</a> <a href="http://www.mshxw.com/TAG_1/zhongguochuanmeidaxuepaiming.html">中国传媒大学排名</a> <a href="http://www.mshxw.com/TAG_1/shoudoushifandaxuepaiming.html">首都师范大学排名</a> <a href="http://www.mshxw.com/TAG_1/zhongguodezhidaxue(beijing)paiming.html">中国地质大学(北京)排名</a> <a href="http://www.mshxw.com/TAG_1/beijingxinxikejidaxuepaiming.html">北京信息科技大学排名</a> <a href="http://www.mshxw.com/TAG_1/zhongyangminzudaxuepaiming.html">中央民族大学排名</a> <a href="http://www.mshxw.com/TAG_1/beijingwudaoxueyuanpaiming.html">北京舞蹈学院排名</a> <a href="http://www.mshxw.com/TAG_1/beijingdianyingxueyuanpaiming.html">北京电影学院排名</a> <a href="http://www.mshxw.com/TAG_1/zhongguohuquxueyuanpaiming.html">中国戏曲学院排名</a> <a href="http://www.mshxw.com/TAG_1/hebeizhengfazhiyexueyuanpaiming.html">河北政法职业学院排名</a> <a href="http://www.mshxw.com/TAG_1/hebeijingmaodaxuepaiming.html">河北经贸大学排名</a> <a href="http://www.mshxw.com/TAG_1/tianjinzhongdeyingyongjishudaxuepaiming.html">天津中德应用技术大学排名</a> <a href="http://www.mshxw.com/TAG_1/tianjinyixuegaodengzhuankexuejiaopaiming.html">天津医学高等专科学校排名</a> <a href="http://www.mshxw.com/TAG_1/tianjinmeishuxueyuanpaiming.html">天津美术学院排名</a> <a href="http://www.mshxw.com/TAG_1/tianjinyinlexueyuanpaiming.html">天津音乐学院排名</a> <a href="http://www.mshxw.com/TAG_1/tianjingongyedaxuepaiming.html">天津工业大学排名</a> <a href="http://www.mshxw.com/TAG_1/beijinggongyedaxuegengdanxueyuanpaiming.html">北京工业大学耿丹学院排名</a> <a href="http://www.mshxw.com/TAG_1/beijingjingchaxueyuanpaiming.html">北京警察学院排名</a> <a href="http://www.mshxw.com/TAG_1/tianjinkejidaxuepaiming.html">天津科技大学排名</a> <a href="http://www.mshxw.com/TAG_1/beijingyoudiandaxue(hongfujiaoou)paiming.html">北京邮电大学(宏福校区)排名</a> <a href="http://www.mshxw.com/TAG_1/beijingwanglaozhiyexueyuanpaiming.html">北京网络职业学院排名</a> <a href="http://www.mshxw.com/TAG_1/beijingdaxueyixuebupaiming.html">北京大学医学部排名</a> <a href="http://www.mshxw.com/TAG_1/hebeikejidaxuepaiming.html">河北科技大学排名</a> <a href="http://www.mshxw.com/TAG_1/hebeidezhidaxuepaiming.html">河北地质大学排名</a> <a href="http://www.mshxw.com/TAG_1/hebeitiyoxueyuanpaiming.html">河北体育学院排名</a> </div> </div> </div> <div class="footer"> <div class="dl_con"> <div class="width1200"> <dl> <dt>学习工具</dt> <dd><a href="https://www.mshxw.com/tools/algebra/" title="代数计算器">代数计算器</a></dd> <dd><a href="https://www.mshxw.com/tools/trigonometry/" title="三角函数计算器">三角函数</a></dd> <dd><a href="https://www.mshxw.com/tools/analytical/" title="解析几何">解析几何</a></dd> <dd><a href="https://www.mshxw.com/tools/solidgeometry/" title="立体几何">立体几何</a></dd> </dl> <dl> <dt>知识解答</dt> <dd><a href="https://www.mshxw.com/ask/1033/" title="教育知识">教育知识</a></dd> <dd><a href="https://www.mshxw.com/ask/1180/" title="百科知识">百科知识</a></dd> <dd><a href="https://www.mshxw.com/ask/1155/" title="生活知识">生活知识</a></dd> <dd><a class="https://www.mshxw.com/ask/1199/" title="常识知识">常识知识</a></dd> </dl> <dl> <dt>写作必备</dt> <dd><a href="https://www.mshxw.com/zuowen/1128/" title="作文大全">作文大全</a></dd> <dd><a href="https://www.mshxw.com/zuowen/1130/" title="作文素材">作文素材</a></dd> <dd><a href="https://www.mshxw.com/zuowen/1132/" title="句子大全">句子大全</a></dd> <dd><a href="https://www.mshxw.com/zuowen/1154/" title="实用范文">实用范文</a></dd> </dl> <dl class="mr0"> <dt>关于我们</dt> <dd><a href="https://www.mshxw.com/about/index.html" title="关于我们" rel="nofollow">关于我们</a></dd> <dd><a href="https://www.mshxw.com/about/contact.html" title="联系我们" rel="nofollow">联系我们</a></dd> <dd><a href="https://www.mshxw.com/sitemap/" title="网站地图">网站地图</a></dd> </dl> <div class="dl_ewm"> <div class="wx"> <img src="https://www.mshxw.com/skin/sinaskin//kaotop/picture/gzh.jpg" alt="交流群"> <p>名师互学网交流群</p> </div> <div class="wx"><img src="https://www.mshxw.com/skin/sinaskin//kaotop/picture/weixin.jpg" alt="名师互学网客服"> <p>名师互学网客服</p> </div> </div> </div> </div> <div class="copyright"> <p>名师互学网 版权所有 (c)2021-2022 ICP备案号:<a href="https://beian.miit.gov.cn" rel="nofollow">晋ICP备2021003244-6号</a> </p> </div> </div> <!-- 手机端 --> <div class="m_foot_top"> <img src="https://www.mshxw.com/foot.gif" width="192" height="27" alt="我们一直用心在做"><br/> <a href="https://www.mshxw.com/about/index.html">关于我们</a> <a href="https://www.mshxw.com/archiver/">文章归档</a> <a href="https://www.mshxw.com/sitemap">网站地图</a> <a href="https://www.mshxw.com/about/contact.html">联系我们</a> <p>版权所有 (c)2021-2022 MSHXW.COM</p> <p>ICP备案号:<a href="https://beian.miit.gov.cn/" rel="nofollow">晋ICP备2021003244-6号</a></p> </div> <div class="to_top" style="display:none;"><img src="https://www.mshxw.com/skin/sinaskin//kaotop/picture/to_top.png"></div> <!--广告!--> <script type="text/javascript" src="https://www.mshxw.com/skin/sinaskin//kaotop/js/top.js"></script> <script src="https://www.mshxw.com/skin/sinaskin//kaotop/js/fixed.js" type="text/javascript"></script> <!--头条搜索!--> <script> (function(){ var el = document.createElement("script"); el.src = "https://lf1-cdn-tos.bytegoofy.com/goofy/ttzz/push.js?018f42187355ee18d1bfcee0487fc91a76ac6319beb05b7dc943033ed22c446d3d72cd14f8a76432df3935ab77ec54f830517b3cb210f7fd334f50ccb772134a"; el.id = "ttzz"; var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(el, s); })(window) </script> <!--头条搜索结束!--> <script> var _hmt = _hmt || []; (function() { var hm = document.createElement("script"); hm.src = "https://hm.baidu.com/hm.js?e05fec1c87ee5ca07f1ce57d093866c4"; var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(hm, s); })(); </script> </div> </div> <script type="text/javascript"> $(".alert_kf").click(function() { mantis.requestChat(); }); </script> <script type="text/javascript"> var mySwiper_weixin = new Swiper('.pc_swiper_weixin', { autoplay: 3000, //可选选项,自动滑动 loop: true, speed: 1000, pagination: '.swiper-pagination', paginationClickable: true, }) </script> <script type="text/javascript"> $(function() { $(window).scroll(function() { if ($(window).scrollTop() > 100) { $(".to_top").fadeIn(1000); } else { $(".to_top").fadeOut(1000); } }); $(".to_top").click(function() { if ($('html').scrollTop()) { $('html').animate({ scrollTop: 0 }, 300); return false; } $('body').animate({ scrollTop: 0 }, 300); return false; }); }); </script> </body> </html>