如果您已经有了“多词表达式”地名词典的列表,则可以使用
MWETokenizer,例如:
>>> from nltk.tokenize import MWETokenizer>>> from nltk import sent_tokenize, word_tokenize>>> s = '''Good muffins cost $3.88nin New York. Please buy me... ... two of them.nnThanks.'''>>> mwe = MWETokenizer([('New', 'York'), ('Hong', 'Kong')], separator='_')>>> [mwe.tokenize(word_tokenize(sent)) for sent in sent_tokenize(s)][['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New_York', '.'], ['Please', 'buy', 'me', '...', 'two', 'of', 'them', '.'], ['Thanks', '.']]


