简而言之:
' '.join([word + '/' + pos for word, pos in tagged_sent]
总而言之:
我认为您对使用字符串函数来连接字符串的想法过高,这实际上并不那么昂贵。
import timefrom nltk.corpus import browntagged_corpus = brown.tagged_sents()start = time.time()with open('output.txt', 'w') as fout: for i, sent in enumerate(tagged_corpus): print(' '.join([word + '/' + pos for word, pos in sent]), end='n', file=fout)end = time.time() - startprint (i, end)在我的笔记本电脑上,棕色语料库的所有57339个句子花了2.955秒。
[出]:
$ head -n1 output.txt The/AT Fulton/NP-TL County/NN-TL Grand/JJ-TL Jury/NN-TL said/VBD Friday/NR an/AT investigation/NN of/IN Atlanta's/NP$ recent/JJ primary/NN election/NN produced/VBD ``/`` no/AT evidence/NN ''/'' that/CS any/DTI irregularities/NNS took/VBD place/NN ./.
但是使用字符串将单词和POS连接起来会在以后需要读取标记的输出时引起麻烦,例如
>>> from nltk import pos_tag>>> tagged_sent = pos_tag('cat / dog'.split())>>> tagged_sent_str = ' '.join([word + '/' + pos for word, pos in tagged_sent])>>> tagged_sent_str'cat/NN //CD dog/NN'>>> [tuple(wordpos.split('/')) for wordpos in tagged_sent_str.split()][('cat', 'NN'), ('', '', 'CD'), ('dog', 'NN')]如果要保存标记的输出然后再阅读,最好使用
pickle保存tagd_output的方法,例如
>>> import pickle>>> tagged_sent = pos_tag('cat / dog'.split())>>> with open('tagged_sent.pkl', 'wb') as fout:... pickle.dump(tagged_sent, fout)... >>> tagged_sent = None>>> tagged_sent>>> with open('tagged_sent.pkl', 'rb') as fin:... tagged_sent = pickle.load(fin)... >>> tagged_sent[('cat', 'NN'), ('/', 'CD'), ('dog', 'NN')]


