更新: 现在是2019年,所以在程序员尝试使用该代码的混乱评论之后,我为Python 3重写了此答案。原始的Python
2代码现在位于答案的底部。
标准库中有出色的工具,可以解析RFC
821标头,也可以解析整个HTTP请求。这是一个示例请求字符串(请注意,即使为了方便阅读,我们将其分成几行,Python仍将其视为一个大字符串),可以将其提供给示例:
request_text = ( b'GET /who/ken/trust.html HTTP/1.1rn' b'Host: cm.bell-labs.comrn' b'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3rn' b'Accept: text/html;q=0.9,text/plainrn' b'rn')
正如@TryPyPy指出的那样,您可以使用Python的电子邮件库来解析标头-
尽管我们应该添加一个结果
Message,一旦完成创建后,结果对象就像标头的字典:
from email.parser import BytesParserrequest_line, headers_alone = request_text.split(b'rn', 1)headers = BytesParser().parsebytes(headers_alone)print(len(headers)) # -> "3"print(headers.keys()) # -> ['Host', 'Accept-Charset', 'Accept']print(headers['Host']) # -> "cm.bell-labs.com"
但这当然会忽略请求行,或者让您自己解析它。事实证明,有一个更好的解决方案。
如果使用标准库,标准库将为您解析HTTP
baseHTTPRequestHandler。尽管其文档有点晦涩(标准库中整个HTTP和URL工具套件都存在问题),但您要做的只是解析(a)将字符串包装在
BytesIO()(b)中,
raw_requestline因此它随时可以解析,并且(c)捕获解析期间发生的任何错误代码,而不是让它尝试将其写回客户端(因为我们没有密码!)。
因此,这是我们对标准库类的专门化:
from http.server import baseHTTPRequestHandlerfrom io import BytesIOclass HTTPRequest(baseHTTPRequestHandler): def __init__(self, request_text): self.rfile = BytesIO(request_text) self.raw_requestline = self.rfile.readline() self.error_pre = self.error_message = None self.parse_request() def send_error(self, pre, message): self.error_pre = pre self.error_message = message
再一次,我希望标准库的人们意识到HTTP解析应该以一种不需要我们编写9行代码来正确调用的方式进行,但是您能做什么?这是您将如何使用此简单类的方法:
# Using this new class is really easy!request = HTTPRequest(request_text)print(request.error_pre) # None (check this first)print(request.command) # "GET"print(request.path) # "/who/ken/trust.html"print(request.request_version) # "HTTP/1.1"print(len(request.headers)) # 3print(request.headers.keys()) # ['Host', 'Accept-Charset', 'Accept']print(request.headers['host']) # "cm.bell-labs.com"
如果解析期间发生错误,
error_pre则不会是
None:
# Parsing can result in an error pre and messagerequest = HTTPRequest(b'GETrnHeader: Valuernrn')print(request.error_pre) # 400print(request.error_message) # "Bad request syntax ('GET')"我更喜欢这样使用标准库,因为如果我尝试使用正则表达式自己重新实现Internet规范,我怀疑它们已经遇到并解决了可能会困扰我的所有边缘情况。
旧的Python 2代码
这是此答案的原始代码,可追溯到我第一次编写它时:
request_text = ( 'GET /who/ken/trust.html HTTP/1.1rn' 'Host: cm.bell-labs.comrn' 'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3rn' 'Accept: text/html;q=0.9,text/plainrn' 'rn' )
和:
# Ignore the request line and parse only the headersfrom mimetools import Messagefrom StringIO import StringIOrequest_line, headers_alone = request_text.split('rn', 1)headers = Message(StringIO(headers_alone))print len(headers) # -> "3"print headers.keys() # -> ['accept-charset', 'host', 'accept']print headers['Host'] # -> "cm.bell-labs.com"和:
from baseHTTPServer import baseHTTPRequestHandlerfrom StringIO import StringIOclass HTTPRequest(baseHTTPRequestHandler): def __init__(self, request_text): self.rfile = StringIO(request_text) self.raw_requestline = self.rfile.readline() self.error_pre = self.error_message = None self.parse_request() def send_error(self, pre, message): self.error_pre = pre self.error_message = message
和:
# Using this new class is really easy!request = HTTPRequest(request_text)print request.error_pre # None (check this first)print request.command # "GET"print request.path # "/who/ken/trust.html"print request.request_version # "HTTP/1.1"print len(request.headers) # 3print request.headers.keys() # ['accept-charset', 'host', 'accept']print request.headers['host'] # "cm.bell-labs.com"
和:
# Parsing can result in an error pre and messagerequest = HTTPRequest('GETrnHeader: Valuernrn')print request.error_pre # 400print request.error_message # "Bad request syntax ('GET')"


