栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 面试经验 > 面试问答

Python NLTK用词网对“更进一步”的词进行词法化

面试问答 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

Python NLTK用词网对“更进一步”的词进行词法化

WordNetLemmatizer
使用该
._morphy
函数访问其单词的引理;来自http://www.nltk.org/_modules/nltk/stem/wordnet.html,并以最小长度返回可能的引理。

def lemmatize(self, word, pos=NOUN):    lemmas = wordnet._morphy(word, pos)    return min(lemmas, key=len) if lemmas else word

._morphy
函数迭代地应用规则以获得引理。规则不断减少单词的长度,并用
MORPHOLOGICAL_SUBSTITUTIONS
。然后查看是否有其他单词简短但与简化单词相同:

def _morphy(self, form, pos):    # from jordanbg:    # Given an original string x    # 1. Apply rules once to the input to get y1, y2, y3, etc.    # 2. Return all that are in the database    # 3. If there are no matches, keep applying rules until you either    #    find a match or you can't go any further    exceptions = self._exception_map[pos]    substitutions = self.MORPHOLOGICAL_SUBSTITUTIONS[pos]    def apply_rules(forms):        return [form[:-len(old)] + new     for form in forms     for old, new in substitutions     if form.endswith(old)]    def filter_forms(forms):        result = []        seen = set()        for form in forms: if form in self._lemma_pos_offset_map:     if pos in self._lemma_pos_offset_map[form]:         if form not in seen:  result.append(form)  seen.add(form)        return result    # 0. Check the exception lists    if form in exceptions:        return filter_forms([form] + exceptions[form])    # 1. Apply rules once to the input to get y1, y2, y3, etc.    forms = apply_rules([form])    # 2. Return all that are in the database (and check the original too)    results = filter_forms([form] + forms)    if results:        return results    # 3. If there are no matches, keep applying rules until we find a match    while forms:        forms = apply_rules(forms)        results = filter_forms(forms)        if results: return results    # Return an empty list if we can't find anything    return []

但是,如果这个词是例外列表,它会返回保持在一个固定值

exceptions
,看
_load_exception_map
在http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html:

def _load_exception_map(self):    # load the exception file data into memory    for pos, suffix in self._FILEMAP.items():        self._exception_map[pos] = {}        for line in self.open('%s.exc' % suffix): terms = line.split() self._exception_map[pos][terms[0]] = terms[1:]    self._exception_map[ADJ_SAT] = self._exception_map[ADJ]

回到您的示例,无法从规则中实现

worse
->
bad
further
->
far
,因此必须从异常列表中进行。由于这是例外列表,因此肯定会有不一致之处。

例外列表保存在

~/nltk_data/corpora/wordnet/adv.exc
和中
~/nltk_data/corpora/wordnet/adv.exc

来自

adv.exc

best wellbetter welldeeper deeplyfarther farfurther farharder hardhardest hard

来自

adj.exc

...worldliest worldlywormier wormywormiest wormyworse badworst badworthier worthyworthiest worthywrier wry...


转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/641211.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号