栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 面试经验 > 面试问答

nltk语料库不包含“好”吗?

面试问答 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

nltk语料库不包含“好”吗?

TL; DR

from nltk.corpus import wordsfrom nltk.corpus import wordnetmanywords = words.words() + wordnet.words()

在长

在文档中,

nltk.corpus.words
单词是“
http://en.wikipedia.org/wiki/Words_(Unix)

在Unix中,您可以执行以下操作:

ls /usr/share/dict/

并阅读自述文件:

$ cd /usr/share/dict//usr/share/dict$ cat README#   @(#)README  8.1 (Berkeley) 6/5/93# $FreeBSD$WEB ---- (introduction provided by jaw@riacs) -------------------------Welcome to web2 (Webster's Second International) all 234,936 words worth.The 1934 copyright has lapsed, according to the supplier.  Thesupplemental 'web2a' list contains hyphenated terms as well as assortednoun and adverbial phrases.  The wordlist makes a dandy 'grep' victim.     -- James A. Woods    {ihnp4,hplabs}!ames!jaw    (or jaw@riacs)Country names are stored in the file /usr/share/misc/iso3166.FreeBSD Maintenance Notes ---------------------------------------------Note that FreeBSD is not maintaining a historical document, we'remaintaining a list of current [American] English spellings.A few words have been removed because their spellings have depreciated.This list of words includes:    corelation (and its derivatives)    "correlation" is the preferred spelling    freen    typographical error in original file    freend   archaic spelling no longer in use;         masks common typo in modern text--A list of technical terms has been added in the file 'freebsd'.  Thisword list contains FreeBSD/Unix lexicon that is used by the systemdocumentation.  It makes a great ispell(1) personal dictionary tosupplement the standard English language dictionary.

由于它是 234,936 的固定列表,因此该列表中 肯定 有不存在的单词。

如果需要扩展单词列表,则可以使用WordNet中的单词将单词添加到列表中

nltk.corpus.wordnet.words()

最有可能的是,您需要的是足够大的文本语料库,例如Wikipedia dump,然后将其标记化并提取所有唯一的单词。



转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/610840.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号