我认为
PlaintextCorpusReader,至少在您的输入语言是英语的情况下,已经使用punkt标记器来分割输入了。
PlainTextCorpusReader的构造函数
def __init__(self, root, fileids, word_tokenizer=WordPunctTokenizer(), sent_tokenizer=nltk.data.LazyLoader( 'tokenizers/punkt/english.pickle'), para_block_reader=read_blankline_block, encoding='utf8'):
您可以向读者传递一个单词和句子标记器,但后者的默认值已经是
nltk.data.LazyLoader('tokenizers/punkt/english.pickle')。对于单个字符串,将按以下方式使用标记器(此处说明,有关punkt标记器,请参见第5节)。
>>> import nltk.data>>> text = """... Punkt knows that the periods in Mr. Smith and Johann S. Bach... do not mark sentence boundaries. And sometimes sentences... can start with non-capitalized words. i is a good variable... name.... """>>> tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')>>> tokenizer.tokenize(text.strip())


