根据Unipre标准(v6.0,第12.1节),
汉字表意字符在Unipre标准的七个主要块中找到,如表12-2所示
Table 12-2. Blocks Containing Han IdeographsBlock | Range | Comment----------------------------------------+-------------+-----------------------------------------------------CJK Unified Ideographs | 4E00–9FFF | CommonCJK Unified Ideographs Extension A | 3400–4DBF | RareCJK Unified Ideographs Extension B | 20000–2A6DF | Rare, historicCJK Unified Ideographs Extension C | 2A700–2B73F | Rare, historicCJK Unified Ideographs Extension D | 2B740–2B81F | Uncommon, some in current useCJK Compatibility Ideographs | F900–FAFF | Duplicates, unifiable variants, corporate charactersCJK Compatibility Ideographs Supplement | 2F800–2FA1F | Unifiable variants
这些区块之外还有一些其他功能:
Table 12-3. Small Extensions to the URORange | Version | Comment----------+---------+-------------------------------------------------9FA6–9FB3 | 4.1 | Interoperability with HKSCS standard9FB4–9FBB | 4.1 | Interoperability with GB 18030 standard9FBC–9FC2 | 5.1 | Interoperability with commercial implementations9FC3 | 5.1 | Correction of mistaken unification9FC4–9FC6 | 5.2 | Interoperability with ARIB standard9FC7–9FCB | 5.2 | Interoperability with HKSCS standard
要使用设置操作来构造这些操作的一组序数值,可以执行以下操作:
chinese = set(range(0x4E00, 0xA000) + range(0x3400, 0x4DC0) + range(0x20000, 0x2A6E0) + range(0x2A700, 0x2B740) + range(0x2B740, 0x2B820) + range(0xF900, 0xFB00) + range(0x2F800, 0x2FA20) + range(0x9FA6, 0x9FCC))
但是请注意,该集合包含超过75000个字符,因此它可能不是最紧凑或最有效的数据结构。
另外,如果您坚持对文字字符使用ord(),则将需要使用32位unipre文字形式:
>>> ord(u'U00002F800')194560



