如何在字符串而不是列表中输出NLTK pos_tag？

简而言之：

' '.join([word + '/' + pos for word, pos in tagged_sent]

总而言之：

我认为您对使用字符串函数来连接字符串的想法过高，这实际上并不那么昂贵。

import timefrom nltk.corpus import browntagged_corpus = brown.tagged_sents()start = time.time()with open('output.txt', 'w') as fout:    for i, sent in enumerate(tagged_corpus):        print(' '.join([word + '/' + pos for word, pos in sent]), end='n', file=fout)end = time.time() - startprint (i, end)

在我的笔记本电脑上，棕色语料库的所有57339个句子花了2.955秒。

[出]：

$ head -n1 output.txt The/AT Fulton/NP-TL County/NN-TL Grand/JJ-TL Jury/NN-TL said/VBD Friday/NR an/AT investigation/NN of/IN Atlanta's/NP$ recent/JJ primary/NN election/NN produced/VBD ``/`` no/AT evidence/NN ''/'' that/CS any/DTI irregularities/NNS took/VBD place/NN ./.

但是使用字符串将单词和POS连接起来会在以后需要读取标记的输出时引起麻烦，例如

>>> from nltk import pos_tag>>> tagged_sent = pos_tag('cat / dog'.split())>>> tagged_sent_str = ' '.join([word + '/' + pos for word, pos in tagged_sent])>>> tagged_sent_str'cat/NN //CD dog/NN'>>> [tuple(wordpos.split('/')) for wordpos in tagged_sent_str.split()][('cat', 'NN'), ('', '', 'CD'), ('dog', 'NN')]

如果要保存标记的输出然后再阅读，最好使用

pickle

保存tagd_output的方法，例如

>>> import pickle>>> tagged_sent = pos_tag('cat / dog'.split())>>> with open('tagged_sent.pkl', 'wb') as fout:...     pickle.dump(tagged_sent, fout)... >>> tagged_sent = None>>> tagged_sent>>> with open('tagged_sent.pkl', 'rb') as fin:...     tagged_sent = pickle.load(fin)... >>> tagged_sent[('cat', 'NN'), ('/', 'CD'), ('dog', 'NN')]

如何在字符串而不是列表中输出NLTK pos_tag？

面试问答相关栏目本月热门文章