python使用日常备忘录

字符串编码

python3和python2的一个主要差异就在于字符编码，在python2中声明的字符串默认是bytes字节流，而在python3中声明的字符串默认是unicode字符串，我们用以下代码进行示例

# if using python2
str_raw = "我们爱编程"
str_bytes = b"我们爱编程"
str_unicode = u"我们爱编程"
print(str_raw) # 输出 '我们爱编程'
print(str_bytes) # 输出 '我们爱编程'
print(str_unicode) # 输出 '我们爱编程'
print(type(str_raw)) # 输出 
print(type(str_bytes)) # 输出 
print(type(str_unicode)) # 输出 

# if using python3
str_raw = "我们爱编程" 
str_bytes = b"我们爱编程" # 会报错 SyntaxError: bytes can only contain ASCII literal characters.
str_bytes = b"we love programming" # 输出 b'we love programming'
str_unicode = u"我们爱编程" 
print(str_raw) # 输出 '我们爱编程'
print(str_unicode) # 输出 '我们爱编程'
print(type(str_raw)) # 输出 
print(type(str_unicode)) # 输出 
print(type(str_bytes)) # 输出

从这个例子中，我们知道在python2中声明的字符串默认是以bytes的形式存储的，如果用交互式终端去打印python2的字符，那么会显示如下

>>> str_raw = "我们爱编程"
>>> str_raw
'xe6x88x91xe4xbbxacxe7x88xb1xe7xbcx96xe7xa8x8b'

而python3中声明的字符串默认以unicode形式储存，如果用交互式终端去打印的话，那么显示如下：

>>> str_raw = "我们爱编程"
>>> str_raw
'我们爱编程'

这里需要提一嘴的是，unicode是字符集的编码，而utf-8是unicode的其中一种编码实现（此外还有utf-16等）。然而unicode作为一种能包含100万符号以上的字符集，其编码存在一定的冗余，比如严的 Unicode 是十六进制数4E25，转换成二进制数足足有15位（100111000100101），不利于数据持久化保存和传输，因此需要编码成字节流bytes进行储存或者网络传输，关于字符编码和字符集的扩展知识可见[1]。如Fig 1.1所示，在python3中称bytes -> unicode的过程为解码，而unicode -> bytes的过程为编码，数据类型而言，在python3中的其实对应python2的，python3的对应python2的。对于函数而言，python3的bytes()对应python2的str()，python3的str()对应python2的bytes()。相对应的，对于python3而言的解码对于python2而言是编码，对于python3而言的编码则是python2的解码。

总而言之，在python中推荐一切字符处理都转换成unicode后进行处理，需要持久化或者传输时候在编码成字节流进行后续操作。

Fig 1.1 python3中字符串的编码与解码。 Reference

[1]. https://zhuanlan.zhihu.com/p/38333902, 字符编码那点事：快速理解ASCII、Unicode、GBK和UTF-8

python使用日常备忘录

Python相关栏目本月热门文章