TfIdfVectorizer：具有固定vocab的矢量化器如何处理新单词？

恐怕矩阵可能太大。这将是96582 * 96582 = 9328082724个单元格。尝试稍微切片titles_tfidf并检查。

资料来源：http : //scipy-
user.10969.n7.nabble.com/SciPy-User-strange-error-when-creating-csr-matrix-
td20129.html

EDT：如果您使用的是较旧的SciPy /
Numpy版本，则可能需要更新：https :
//github.com/scipy/scipy/pull/4678

EDT2：同样，如果您使用的是32位python，则切换到64位可能会有所帮助（我想）

EDT3：回答您的原始问题。当您从中使用词汇时

plaintexts

，将会有新词

titles

被忽略-
但不会影响tfidf值。希望此片段可以使其更易于理解：

from sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.metrics.pairwise import cosine_similarityplaintexts =["They are", "plain texts texts amoersand here"]titles = ["And here", "titles ", "wolf dog eagle", "But here plain"]vectorizer = TfidfVectorizer()plaintexts_tfidf = vectorizer.fit_transform(plaintexts)vocab = vectorizer.vocabulary_vectorizer = TfidfVectorizer(vocabulary=vocab)titles_tfidf = vectorizer.fit_transform(titles)print('values using vocabulary')print(titles_tfidf)print(vectorizer.get_feature_names())print('Brand new vectorizer')vectorizer = TfidfVectorizer()titles_tfidf = vectorizer.fit_transform(titles)print(titles_tfidf)print(vectorizer.get_feature_names())

结果是：

values using vocabulary  (0, 2)        1.0  (3, 3)        0.78528827571  (3, 2)        0.61913029649['amoersand', 'are', 'here', 'plain', 'texts', 'they']Brand new vectorizer  (0, 0)        0.78528827571  (0, 4)        0.61913029649  (1, 6)        1.0  (2, 7)        0.57735026919  (2, 2)        0.57735026919  (2, 3)        0.57735026919  (3, 4)        0.486934264074  (3, 1)        0.617614370976  (3, 5)        0.617614370976['and', 'but', 'dog', 'eagle', 'here', 'plain', 'titles', 'wolf']

请注意，这与我从标题中删除明文中未出现的单词不同。

TfIdfVectorizer：具有固定vocab的矢量化器如何处理新单词？

面试问答相关栏目本月热门文章