目录
爬虫类:
一、python3 用requests.get获取网页内容为空 <Response [200]>
二、错误:object of type 'Response' has no len()
文本处理类:
三、去除文本中的空行
爬虫类:
一、python3 用requests.get获取网页内容为空 <Response [200]>
import requests
html = requests.get('https://blog.csdn.net/nokiaguy/category_11190376')
print(html)
import requests
html = requests.get('https://blog.csdn.net/nokiaguy/category_11190376')
print(html)
返回
原因:缺少请求头
正确过程:
import requests
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36 SLBrowser/
7.0.0.12151 SLBChan/103'
}
html = requests.get('https://blog.csdn.net/nokiaguy/category_11190376.html',headers=header)
print(html.text)
注:部分URL需要请求头,部分不需要,具体原因目前未知
二、错误:object of type 'Response' has no len()
源代码:
import requests
from bs4 import Beautifulsoup
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36 SLBrowser/
7.0.0.12151 SLBChan/103'
}
html = requests.get('https://blog.csdn.net/nokiaguy/category_11190376.html',headers=header)
soup = Beautifulsoup(html,'lxml')
错误位置:
最后一行
soup = Beautifulsoup(html,'lxml')
原因:Beautifulsoup()处理对象(第一个参数,此处为html)为html页面的文本
更改: soup = Beautifulsoup(html.text,'lxml')
文本处理类:
三、去除文本中的空行
with open("mulu.txt", 'r') as f,open("newmulu.txt",'w') as fn:
word = f.readlines()
num = 0
for i in word:
if i.split():
fn.write(i)
with open("mulu.txt", 'r') as f,open("newmulu.txt",'w') as fn:
word = f.readlines()
num = 0
for i in word:
if i.split():
fn.write(i)
通过split方法,空行会返回空列表
通过if条件判断(空列表布尔值为False),只将非空列表写入新文件,空行便被去除了



