栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 软件开发 > 后端开发 > Python

Difference Between NFD, NFC, NFKD, and NFKC Explained with Python Code

Python 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

Difference Between NFD, NFC, NFKD, and NFKC Explained with Python Code

The difference between Unicode normalization forms

Photo by Joel Filipe on Unsplash
Recently I am working on an NLP task in Japanese, one problem is to convert special characters to a normalized form. So I have done a little research and write this post for anyone who has the same need.

Japanese contains different forms of the character, for example, Latin has two forms, full-width form, and half-width.

In the above example, we can see the full-width form is very ugly and is also hard to utilizing for the following processing. So we need to convert it to a normalized form.

TL;DR
Use NFKC method.

from unicodedata import normalize
s = “株式会社KADOKAWA Future Publishing”
normalize(‘NFKC’, s)
株式会社KADOKAWA Future Publishing
Unicode normalization forms

from Wikipedia
There are 4 kinds of Unicode normalization forms. This article give a very detailed explanation. But I will explain the difference with a simple and easy understanding way.

First, we could see the below result for an intuitive understanding.

アイウエオ (NFC)> アイウエオ
アイウエオ (NFD)> アイウエオ
アイウエオ (NFKC)> アイウエオ
アイウエオ (NFKD)> アイウエオ
パピプペポ (NFC)> パピプペポ
パピプペポ (NFD)> パピプペポ
パピプペポ (NFKC)> パピプペポ
パピプペポ (NFKD)> パピプペポ
パピプペポ (NFC)> パピプペポ
パピプペポ (NFD)> パピプペポ
パピプペポ (NFKC)> パピプペポ
パピプペポ (NFKD)> パピプペポ
abcABC (NFC)> abcABC
abcABC (NFD)> abcABC
abcABC (NFKC)> abcABC
abcABC (NFKD)> abcABC
123 (NFC)> 123
123 (NFD)> 123
123 (NFKC)> 123
123 (NFKD)> 123
+-.~)} (NFC)> +-.~)}
+-.~)} (NFD)> +-.~)}
+-.~)} (NFKC)> ±.~)}
+-.~)} (NFKD)> ±.~)}
There are two classification methods for these 4 forms.

1 original form changed or not
  • A(not changed): NFC & NFD
  • B(changed): NFKC & NFKD
2 the length of original length changed or not
  • A(not changed): NFC & NFKC
  • B(changed): NFD & NFKD
    1 Whether the original form is changed or not
    abcABC (NFC)> abcABC
    abcABC (NFD)> abcABC
    abcABC (NFKC)> abcABC
    abcABC (NFKD)> abcABC
1 original form changed or not
  • A(not changed): NFC & NFD
  • B(changed): NFKC & NFKD
    The first classification method is based on whether the original form is changed or not. More specifically, A group does not contain K but B group contains K. What does K means?

D = Decomposition
C = Composition
K = Compatibility
K means compatibility, which is used to distinguish with the original form. Because K changes the original form, so the length is also changed.

s= ‘…’
normalize(‘NFKC’, s)
‘…’
len(s)
1
len(normalize(‘NFC’, s))
1
len(normalize(‘NFKC’, s))
3
len(normalize(‘NFD’, s))
1
len(normalize(‘NFKD’, s))
3
2 Whether the length of original form is changed or not
パピプペポ (NFC)> パピプペポ
パピプペポ (NFD)> パピプペポ
パピプペポ (NFKC)> パピプペポ
パピプペポ (NFKD)> パピプペポ

2 the length of original length changed or not
  • A(not changed): NFC & NFKC
  • B(changed): NFD & NFKD
    This second classification method is based on whether the length of the original form is changed or not. A group contains C(Composition), which won’t change the length. B group contains D(Decomposition), which will change the length.

You might be wondering why the length is change? Please see the test below.

from unicodedata import normalize
s = “パピプペポ”
len(s)
5
len(normalize(‘NFC’, s))
5
len(normalize(‘NFKC’, s))
5
len(normalize(‘NFD’, s))
10
len(normalize(‘NFKD’, s))
10
We can find the “decomposition” method doubles the length.

from Unicode正規化とは
This is because the NFD & NFKD decompose each Unicode character into two Unicode characters. For example, ポ(U+30DD) = ホ(U+30DB) + Dot(U+309A) . So the length change from 5 to 10. NFC & NFKC compose separated Unicode characters together, so the length is not changed.

Python Implementation
You can use the unicodedata library to get different forms.

from unicodedata import normalize
s = “パピプペポ”
len(s)
5
len(normalize(‘NFC’, s))
5
len(normalize(‘NFKC’, s))
5
len(normalize(‘NFD’, s))
10
len(normalize(‘NFKD’, s))
10
Length

Take Away
Usually, we can use either of NFKC or NFKD to get the normalized form. The length won’t make trouble only if your NLP task is length sensitive. I usually use the NFKC method.

Check out my other posts on Medium with a categorized view!
GitHub: BrambleXu
LinkedIn: Xu Liang
Blog: BrambleXu

Reference
https://unicode.org/reports/tr15/#Norm_Forms
https://www.wikiwand.com/en/Unicode_equivalence#/Normal_forms
http://nomenclator.la.coocan.jp/unicode/normalization.htm
https://maku77.github.io/js/string/normalize.html
http://tech.albert2005.co.jp/501/

https://towardsdatascience.com/difference-between-nfd-nfc-nfkd-and-nfkc-explained-with-python-code-e2631f96ae6c

转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/875608.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号