Python两大爬虫库

文章目录

Python两大爬虫库
urllib库
- urllib库使用
- urllib.request
- 实验案例：
- 模拟头部信息
requests库
- 实验案例--get请求
- 实验案例--抓取网页
- 实验案例--响应

在使用Python爬虫时，需要模拟发起网络请求，主要用到的库有requests库和python内置的urllib库，一般建议使用requests，它是对urllib的再次封装。

Python两大爬虫库 urllib库

urllib 包包含以下几个模块：

urllib.request - 打开和读取 URL。
urllib.error - 包含 urllib.request 抛出的异常。
urllib.parse - 解析 URL。
urllib.robotparser - 解析 robots.txt 文件。

urllib库使用

urllib库的response对象是先创建http，request对象，装载到reques.urlopen里完成http请求。

返回的是http，response对象，实际上是html属性。使用.read().decode()解码后转化成了str字符串类型，decode解码后中文字符能够显示出来。

urllib.request

urllib.request 定义了一些打开 URL 的函数和类，包含授权验证、重定向、浏览器 cookies等。

urllib.request 可以模拟浏览器的一个请求发起过程。

我们可以使用 urllib.request 的 urlopen 方法来打开一个 URL，语法格式如下：

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

url：url 地址。
data：发送到服务器的其他数据对象，默认为 None。
timeout：设置访问超时时间。
cafile 和 capath：cafile 为 CA 证书， capath 为 CA 证书的路径，使用 HTTPS 需要用到。
cadefault：已经被弃用。
context：ssl.SSLContext类型，用来指定 SSL 设置。

实验案例：

import urllib
from urllib.request import urlopen
# get请求
response = urllib.request.urlopen('http://www.baidu.com')
print(response.read().decode('utf-8'))
# push请求
data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf-8')
response = urllib.request.urlopen('http://www.baidu.com', data=data)
print(type(response))
print(response.status)
print(response.getheaders())
print(response.getheader('Server'))

try:
    response = urllib.request.urlopen("http://www.baidu.com/no.html")
except urllib.error.HTTPError as e:
    if e.code == 404:
        print(404)   # 404

模拟头部信息

我们抓取网页一般需要对 headers（网页头信息）进行模拟，这时候需要使用到 urllib.request.Request 类：

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

url：url 地址。
data：发送到服务器的其他数据对象，默认为 None。
headers：HTTP 请求的头部信息，字典格式。
origin_req_host：请求的主机地址，IP 或域名。
unverifiable：很少用整个参数，用于设置网页是否需要验证，默认是False。
method：请求方法，如 GET、POST、DELETe、PUT等。

import urllib
from urllib import request
#请求头
headers = {
    "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'
}
# wd = {"wd": "hello"}
# url = "http://www.baidu.com/s?"
url = 'https://www.runoob.com/?s='  # 菜鸟教程搜索页面
keyword = 'Python 教程'
key_code = urllib.request.quote(keyword)  # 对请求进行编码
url_all = url+key_code

req = request.Request(url_all, headers=headers)
response = request.urlopen(req)
print(type(response))
print(response)
res = response.read().decode()
print(type(res))
print(res)

requests库

requests库调用是requests.get方法传入url和参数，返回的对象是Response对象，打印出来是显示响应状态码。

requests的优势：
Python爬虫时，更建议用requests库。因为requests比urllib更为便捷，requests可以直接构造get，post请求并发起，而urllib.request只能先构造get，post请求，再发起。

实验案例–get请求

import requests
# 1.基本get请求
response = requests.get('http://www.baidu.com')
print('responsen',response)
# 2.带参数的get请求
response2 = requests.get('http://www.baidu.com/get?name=germy&age=22')
print('response2n',response2)
# 3.将参数传入params参数中来实现2中一样的功能
data = {
    'name': 'germy',
    'age': 22
}
response3 = requests.get('http://www.baidu.com', params=data)
print('response3n',response3)
# 4.解析jason(如果返回结果是一个json, 则调用该方法就可以直接返回json)
response4 = requests.get('http://httpbin.org/get')
print('response4n',response4)

# 5.获取二进制数据(图片, 视频...)
response5 = requests.get('http://github.com/favicon.ico')
with open('icon.ico', 'wb') as f:
    f.write(response5.content)

# 6.添加headers(传入headers参数)
headers = {
    'User-Agent': '...'
}
response6 = requests.get('http://zhihu.com', headers=headers)
print('response6n',response6)

实验案例–抓取网页

import requests  

url = 'http://httpbin.org/get'
params = {  
    'name': 'germey',  
    'age': 25
}  
r = requests.get(url, params = params)  
print(type(r.json()))
print(r.json())
print(r.json().get('args').get('age'))

实验案例–响应

响应是指在发送请求后，服务器返回的数据，在上面例子中，我们通过响应的 text 以及 content 获取了响应内容，此外，还可以通过其他方法来获取其他属性值，比如状态码、响应头、Cookies

import requests
# 1.基本get请求
r = requests.get('http://www.baidu.com')
print(type(r.status_code), r.status_code)
print(type(r.headers), r.headers)
print(type(r.cookies), r.cookies)
print(type(r.url), r.url)
print(type(r.history), r.history)

在上面的例子中， status_code , cookies ，history 分别代表响应的状态码，cookie 以及请求历史。

在这里需要注意的是，status_code 状态码就是 HTTP 请求状态码，比如 200 代表请求成功，404 代表资源不存在等，具体可以查阅相关资料。因此，在爬虫代码中，我们就可以通过这个状态码来判断是否请求成功，从而方便做相应的处理。

import requests

r = requests.get('http://www.baidu.com')
if not r.status_code == requests.codes.ok:
    print('不OK')
else:
    print('Request Successfully!')

在这里，我们用 requests.codes.ok 代表200状态，这样就不用自己手写200等数字，比较方便。当然，也还有其他内置的状态码，在下面会列出一些比较常用的，供大家参考：

# 信息性状态码  
100: ('continue',),  
101: ('switching_protocols',),  
102: ('processing',),  
103: ('checkpoint',),  
122: ('uri_too_long', 'request_uri_too_long'),  

# 成功状态码  
200: ('ok', 'okay', 'all_ok', 'all_okay', 'all_good', '\o/', '✓'),  
201: ('created',),  
202: ('accepted',),  
203: ('non_authoritative_info', 'non_authoritative_information'),  
204: ('no_content',),  
205: ('reset_content', 'reset'),  
206: ('partial_content', 'partial'),  
207: ('multi_status', 'multiple_status', 'multi_stati', 'multiple_stati'),  
208: ('already_reported',),  
226: ('im_used',),  

# 重定向状态码  
300: ('multiple_choices',),  
301: ('moved_permanently', 'moved', '\o-'),  
302: ('found',),  
303: ('see_other', 'other'),  
304: ('not_modified',),  
305: ('use_proxy',),  
306: ('switch_proxy',),  
307: ('temporary_redirect', 'temporary_moved', 'temporary'),  
308: ('permanent_redirect',  
      'resume_incomplete', 'resume',), # These 2 to be removed in 3.0  

# 客户端错误状态码  
400: ('bad_request', 'bad'),  
401: ('unauthorized',),  
402: ('payment_required', 'payment'),  
403: ('forbidden',),  
404: ('not_found', '-o-'),  
405: ('method_not_allowed', 'not_allowed'),  
406: ('not_acceptable',),  
407: ('proxy_authentication_required', 'proxy_auth', 'proxy_authentication'),  
408: ('request_timeout', 'timeout'),  
409: ('conflict',),  
410: ('gone',),  
411: ('length_required',),  
412: ('precondition_failed', 'precondition'),  
413: ('request_entity_too_large',),  
414: ('request_uri_too_large',),  
415: ('unsupported_media_type', 'unsupported_media', 'media_type'),  
416: ('requested_range_not_satisfiable', 'requested_range', 'range_not_satisfiable'),  
417: ('expectation_failed',),  
418: ('im_a_teapot', 'teapot', 'i_am_a_teapot'),  
421: ('misdirected_request',),  
422: ('unprocessable_entity', 'unprocessable'),  
423: ('locked',),  
424: ('failed_dependency', 'dependency'),  
425: ('unordered_collection', 'unordered'),  
426: ('upgrade_required', 'upgrade'),  
428: ('precondition_required', 'precondition'),  
429: ('too_many_requests', 'too_many'),  
431: ('header_fields_too_large', 'fields_too_large'),  
444: ('no_response', 'none'),  
449: ('retry_with', 'retry'),  
450: ('blocked_by_windows_parental_controls', 'parental_controls'),  
451: ('unavailable_for_legal_reasons', 'legal_reasons'),  
499: ('client_closed_request',),  

# 服务端错误状态码  
500: ('internal_server_error', 'server_error', '/o\', '✗'),  
501: ('not_implemented',),  
502: ('bad_gateway',),  
503: ('service_unavailable', 'unavailable'),  
504: ('gateway_timeout',),  
505: ('http_version_not_supported', 'http_version'),  
506: ('variant_also_negotiates',),  
507: ('insufficient_storage',),  
509: ('bandwidth_limit_exceeded', 'bandwidth'),  
510: ('not_extended',),  
511: ('network_authentication_required', 'network_auth', 'network_authentication')

Python两大爬虫库

Python相关栏目本月热门文章