要下载特定的数据集/模型,请使用
nltk.download()函数,例如,如果你要下载
punkt句子标记器,请使用:
$ python3>>> import nltk>>> nltk.download('punkt')如果不确定所需的数据/模型,则可以使用以下数据和模型的基本列表开始:
>>> import nltk>>> nltk.download('popular')它将下载“流行”资源的列表,其中包括:
<collection id="popular" name="Popular packages"> <item ref="cmudict" /> <item ref="gazetteers" /> <item ref="genesis" /> <item ref="gutenberg" /> <item ref="inaugural" /> <item ref="movie_reviews" /> <item ref="names" /> <item ref="shakespeare" /> <item ref="stopwords" /> <item ref="treebank" /> <item ref="twitter_samples" /> <item ref="omw" /> <item ref="wordnet" /> <item ref="wordnet_ic" /> <item ref="words" /> <item ref="maxent_ne_chunker" /> <item ref="punkt" /> <item ref="snowball_data" /> <item ref="averaged_perceptron_tagger" /> </collection>
已编辑
如果有人避免nltk从https://stackoverflow.com/a/38135306/610569上从下载较大的数据集而避免错误
$ rm /Users/<your_username>/nltk_data/corpora/panlex_lite.zip$ rm -r /Users/<your_username>/nltk_data/corpora/panlex_lite$ python>>> import nltk>>> dler = nltk.downloader.Downloader()>>> dler._update_index()>>> dler._status_cache['panlex_lite'] = 'installed' # Trick the index to treat panlex_lite as it's already installed.>>> dler.download('popular')更新
从v3.2.5起,当
nltk_data找不到资源时,NLTK会提供更多信息,例如:
>>> from nltk import word_tokenize>>> word_tokenize('x')Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/l/alvas/git/nltk/nltk/tokenize/__init__.py", line 128, in word_tokenize sentences = [text] if preserve_line else sent_tokenize(text, language) File "/Users//alvas/git/nltk/nltk/tokenize/__init__.py", line 94, in sent_tokenize tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language)) File "/Users/alvas/git/nltk/nltk/data.py", line 820, in load opened_resource = _open(resource_url) File "/Users/alvas/git/nltk/nltk/data.py", line 938, in _open return find(path_, path + ['']).open() File "/Users/alvas/git/nltk/nltk/data.py", line 659, in find raise LookupError(resource_not_found)LookupError: ********************************************************************** Resource punkt not found. Please use the NLTK Downloader to obtain the resource: >>> import nltk >>> nltk.download('punkt') Searched in: - '/Users/alvas/nltk_data' - '/usr/share/nltk_data' - '/usr/local/share/nltk_data' - '/usr/lib/nltk_data' - '/usr/local/lib/nltk_data' - ''


