WordNetLemmatizer使用该
._morphy函数访问其单词的引理;来自http://www.nltk.org/_modules/nltk/stem/wordnet.html,并以最小长度返回可能的引理。
def lemmatize(self, word, pos=NOUN): lemmas = wordnet._morphy(word, pos) return min(lemmas, key=len) if lemmas else word
该
._morphy函数迭代地应用规则以获得引理。规则不断减少单词的长度,并用
MORPHOLOGICAL_SUBSTITUTIONS。然后查看是否有其他单词简短但与简化单词相同:
def _morphy(self, form, pos): # from jordanbg: # Given an original string x # 1. Apply rules once to the input to get y1, y2, y3, etc. # 2. Return all that are in the database # 3. If there are no matches, keep applying rules until you either # find a match or you can't go any further exceptions = self._exception_map[pos] substitutions = self.MORPHOLOGICAL_SUBSTITUTIONS[pos] def apply_rules(forms): return [form[:-len(old)] + new for form in forms for old, new in substitutions if form.endswith(old)] def filter_forms(forms): result = [] seen = set() for form in forms: if form in self._lemma_pos_offset_map: if pos in self._lemma_pos_offset_map[form]: if form not in seen: result.append(form) seen.add(form) return result # 0. Check the exception lists if form in exceptions: return filter_forms([form] + exceptions[form]) # 1. Apply rules once to the input to get y1, y2, y3, etc. forms = apply_rules([form]) # 2. Return all that are in the database (and check the original too) results = filter_forms([form] + forms) if results: return results # 3. If there are no matches, keep applying rules until we find a match while forms: forms = apply_rules(forms) results = filter_forms(forms) if results: return results # Return an empty list if we can't find anything return []
但是,如果这个词是例外列表,它会返回保持在一个固定值
exceptions,看
_load_exception_map在http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html:
def _load_exception_map(self): # load the exception file data into memory for pos, suffix in self._FILEMAP.items(): self._exception_map[pos] = {} for line in self.open('%s.exc' % suffix): terms = line.split() self._exception_map[pos][terms[0]] = terms[1:] self._exception_map[ADJ_SAT] = self._exception_map[ADJ]回到您的示例,无法从规则中实现
worse->
bad和
further->
far,因此必须从异常列表中进行。由于这是例外列表,因此肯定会有不一致之处。
例外列表保存在
~/nltk_data/corpora/wordnet/adv.exc和中
~/nltk_data/corpora/wordnet/adv.exc。
来自
adv.exc:
best wellbetter welldeeper deeplyfarther farfurther farharder hardhardest hard
来自
adj.exc:
...worldliest worldlywormier wormywormiest wormyworse badworst badworthier worthyworthiest worthywrier wry...



