毫不奇怪,NLTK的运行速度很慢:
>>> tfidf = StemmedTfidfVectorizer(min_df=1, stop_words='english', analyzer='word', ngram_range=(1,1))>>> %timeit tfidf.fit_transform(X_train)1 loops, best of 3: 4.89 s per loop>>> tfidf = TfidfVectorizer(min_df=1, stop_words='english', analyzer='word', ngram_range=(1,1))>>> %timeit tfidf.fit_transform(X_train)1 loops, best of 3: 415 ms per loop
您可以使用更智能的Snowball提取器实现,例如PyStemmer来加快速度:
>>> import Stemmer>>> english_stemmer = Stemmer.Stemmer('en')>>> class StemmedTfidfVectorizer(TfidfVectorizer):... def build_analyzer(self):... analyzer = super(TfidfVectorizer, self).build_analyzer()... return lambda doc: english_stemmer.stemWords(analyzer(doc))... >>> tfidf = StemmedTfidfVectorizer(min_df=1, stop_words='english', analyzer='word', ngram_range=(1,1))>>> %timeit tfidf.fit_transform(X_train)1 loops, best of 3: 650 ms per loopNLTK是一个教学工具包。它的设计速度很慢,因为它针对可读性进行了优化。



