Python NLTK用词网对“更进一步”的词进行词法化

WordNetLemmatizer

使用该

._morphy

函数访问其单词的引理；来自http://www.nltk.org/_modules/nltk/stem/wordnet.html，并以最小长度返回可能的引理。

def lemmatize(self, word, pos=NOUN):    lemmas = wordnet._morphy(word, pos)    return min(lemmas, key=len) if lemmas else word

该

._morphy

函数迭代地应用规则以获得引理。规则不断减少单词的长度，并用

MORPHOLOGICAL_SUBSTITUTIONS

。然后查看是否有其他单词简短但与简化单词相同：

def _morphy(self, form, pos):    # from jordanbg:    # Given an original string x    # 1. Apply rules once to the input to get y1, y2, y3, etc.    # 2. Return all that are in the database    # 3. If there are no matches, keep applying rules until you either    #    find a match or you can't go any further    exceptions = self._exception_map[pos]    substitutions = self.MORPHOLOGICAL_SUBSTITUTIONS[pos]    def apply_rules(forms):        return [form[:-len(old)] + new     for form in forms     for old, new in substitutions     if form.endswith(old)]    def filter_forms(forms):        result = []        seen = set()        for form in forms: if form in self._lemma_pos_offset_map:     if pos in self._lemma_pos_offset_map[form]:         if form not in seen:  result.append(form)  seen.add(form)        return result    # 0. Check the exception lists    if form in exceptions:        return filter_forms([form] + exceptions[form])    # 1. Apply rules once to the input to get y1, y2, y3, etc.    forms = apply_rules([form])    # 2. Return all that are in the database (and check the original too)    results = filter_forms([form] + forms)    if results:        return results    # 3. If there are no matches, keep applying rules until we find a match    while forms:        forms = apply_rules(forms)        results = filter_forms(forms)        if results: return results    # Return an empty list if we can't find anything    return []

但是，如果这个词是例外列表，它会返回保持在一个固定值

exceptions

，看

_load_exception_map

在http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html：

def _load_exception_map(self):    # load the exception file data into memory    for pos, suffix in self._FILEMAP.items():        self._exception_map[pos] = {}        for line in self.open('%s.exc' % suffix): terms = line.split() self._exception_map[pos][terms[0]] = terms[1:]    self._exception_map[ADJ_SAT] = self._exception_map[ADJ]

回到您的示例，无法从规则中实现

worse

bad

和

further

far

，因此必须从异常列表中进行。由于这是例外列表，因此肯定会有不一致之处。

例外列表保存在

~/nltk_data/corpora/wordnet/adv.exc

和中

~/nltk_data/corpora/wordnet/adv.exc

。

来自

adv.exc

：

best wellbetter welldeeper deeplyfarther farfurther farharder hardhardest hard

来自

adj.exc

：

...worldliest worldlywormier wormywormiest wormyworse badworst badworthier worthyworthiest worthywrier wry...

Python NLTK用词网对“更进一步”的词进行词法化

面试问答相关栏目本月热门文章