前言:为了看一本小说,我是访问了各大网站终于在一个小众网站找到了!于是为了不让资源流逝我毅然决然的选择把他下载下来,结果。。。。
那没办法了啊,只有自己来了。
下面分四部分来完成
1.导库
import requests
from lxml import etree
import time
import random
由于要用到xpath,所以要导入lxml,没有的小伙伴自行下载库到本地
pip install lxml2.分析网页
我们可以看到这是一个静态的所有的文字都是可以直接在HTML文件里看到,那这就简单了。我这里是直接用 //*[@id="nr1"]/text() 匹配到了所有的文章,用 //*[@id="nr_title"]/text() 匹配到了章节名称 。即使没学过xpath,也可以利用浏览器的开发工具F12具体如下:
3.获取链接对于文章我们已经可以有把握下载下来了,那么接下来该解决批量下载的问题了。
一般来说,小说的链接都是有规律的,很可惜这个没有。
于是我直接去到了小说的目录页,直接先爬取小说的整个目录再从中提取详细的文章链接,上代码了。
def Directory_url():
urls = ['http://wap.xyshuk.com/7/7965_%s/' %x for x in range(1,6)]
#print(urls)
return urls
def get_urls():
urls= Directory_url()
url_list =[]
result_url_list=[]
for url in urls:
#print(url)
response = requests.get(url,headers=headers)
response.encoding ='utf-8'
html =etree.HTML(response.text)
#所有章节的url列表
url_list .append( ['http://wap.xyshuk.com' + x for x in html.xpath("//div[@class='cover']/ul/li/a/@href")] )
for url1 in url_list:
for url2 in url1:
url2=url2.replace('.html','')
url2_list=[url2 + r'_%s' %x for x in range(1,4) ]
print(url2_list,end='n')
result_url_list.append(url2_list)
# print(result_url_list)
return result_url_list
这里有小伙伴会有疑问,为啥提取网址这么麻烦,首要原因是先提取的网址只是每一章的第一页。就拿这个例子来说:假如第一章网址为www.xxx.com/01.html,那么他的一章分了三页网址分别是
www.xxx.com/01_1.html,
ww.xxx.com/01_2.html,
ww.xxx.com/01_3.html
所以最后返回的 列表是这样的【【x_1,x_2,x_3】,【y_1,y_2,y_3】...】
当然鉴于小编能力有限,只能这样来操作,希望有厉害的大佬指点一下!
4.下载并保存def get_text(url):
count =0
response = requests.get(url, headers=headers)
response.encoding = 'utf-8'
html = etree.HTML(response.text)
name = html.xpath('//*[@id="nr_title"]/text()')[0]
text = html.xpath('//*[@id="nr1"]/text()')
name = name.replace('n','').replace('t','').strip()
text = 'n'.join(text)
# print(text)
# print(name)
#按章节存储
# with open(path + f'{name}.txt', 'w', encoding='utf-8') as f:
# for content in text:
# f.write(content)
#直接存在一个文件里
with open(path+'琼明神女录(总).txt','a',encoding='utf-8') as fp :
for title in name:
fp.write(title)
fp.write('n')
for content in text:
fp.write(content)
fp.write('n')
count+=1
if(count==3):#每章有三页
print(f'{name} 下载完成')
5.结果如下
6.全部代码
import requests
from lxml import etree
import time
import random
#琼明神女录:http://wap.xyshuk.com/7/7965/
path = r'爬取的文件琼明神女录'
headers ={
#"Referer": "http://wap.xyshuk.com/7/7965/",
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36"
}
def get_urls():
urls= Directory_url()
url_list =[]
result_url_list=[]
for url in urls:
#print(url)
response = requests.get(url,headers=headers)
response.encoding ='utf-8'
html =etree.HTML(response.text)
#所有章节的url列表
url_list .append( ['http://wap.xyshuk.com' + x for x in html.xpath("//div[@class='cover']/ul/li/a/@href")] )
for url1 in url_list:
for url2 in url1:
url2=url2.replace('.html','')
url2_list=[url2 + r'_%s' %x for x in range(1,4) ]
print(url2_list,end='n')
result_url_list.append(url2_list)
# print(result_url_list)
return result_url_list
def Directory_url():
urls = ['http://wap.xyshuk.com/7/7965_%s/' %x for x in range(1,6)]
#print(urls)
return urls
def get_text(url):
count =0
response = requests.get(url, headers=headers)
response.encoding = 'utf-8'
html = etree.HTML(response.text)
name = html.xpath('//*[@id="nr_title"]/text()')[0]
text = html.xpath('//*[@id="nr1"]/text()')
name = name.replace('n','').replace('t','').strip()
text = 'n'.join(text)
# print(text)
# print(name)
#按章节存储
# with open(path + f'{name}.txt', 'w', encoding='utf-8') as f:
# for content in text:
# f.write(content)
#直接存在一个文件里
with open(path+'琼明神女录(总).txt','a',encoding='utf-8') as fp :
for title in name:
fp.write(title)
fp.write('n')
for content in text:
fp.write(content)
fp.write('n')
count+=1
if(count==3):#每章有三页
print(f'{name} 下载完成')
def main():
urls = get_urls()
for url in urls:
for url_detail in url:
url_detail = url_detail+'.html'
print(url_detail)
get_text(url_detail)
time.sleep(random.randint(1, 3))#别请求太频繁了
if __name__ == '__main__':
main()



