UTF-8中中文字符的上限和下限是多少？

根据Unipre标准（v6.0，第12.1节），

汉字表意字符在Unipre标准的七个主要块中找到，如表12-2所示

Table 12-2. Blocks Containing Han IdeographsBlock  | Range       | Comment----------------------------------------+-------------+-----------------------------------------------------CJK Unified Ideographs       | 4E00–9FFF   | CommonCJK Unified Ideographs Extension A      | 3400–4DBF   | RareCJK Unified Ideographs Extension B      | 20000–2A6DF | Rare, historicCJK Unified Ideographs Extension C      | 2A700–2B73F | Rare, historicCJK Unified Ideographs Extension D      | 2B740–2B81F | Uncommon, some in current useCJK Compatibility Ideographs | F900–FAFF   | Duplicates, unifiable variants, corporate charactersCJK Compatibility Ideographs Supplement | 2F800–2FA1F | Unifiable variants

这些区块之外还有一些其他功能：

Table 12-3. Small Extensions to the URORange     | Version | Comment----------+---------+-------------------------------------------------9FA6–9FB3 | 4.1     | Interoperability with HKSCS standard9FB4–9FBB | 4.1     | Interoperability with GB 18030 standard9FBC–9FC2 | 5.1     | Interoperability with commercial implementations9FC3      | 5.1     | Correction of mistaken unification9FC4–9FC6 | 5.2     | Interoperability with ARIB standard9FC7–9FCB | 5.2     | Interoperability with HKSCS standard

要使用设置操作来构造这些操作的一组序数值，可以执行以下操作：

chinese = set(range(0x4E00, 0xA000) +   range(0x3400, 0x4DC0) +   range(0x20000, 0x2A6E0) +   range(0x2A700, 0x2B740) +   range(0x2B740, 0x2B820) +   range(0xF900, 0xFB00) +   range(0x2F800, 0x2FA20) +   range(0x9FA6, 0x9FCC))

但是请注意，该集合包含超过75000个字符，因此它可能不是最紧凑或最有效的数据结构。

另外，如果您坚持对文字字符使用ord（），则将需要使用32位unipre文字形式：

>>> ord(u'U00002F800')194560

UTF-8中中文字符的上限和下限是多少？

面试问答相关栏目本月热门文章