恐怕矩阵可能太大。这将是96582 * 96582 = 9328082724个单元格。尝试稍微切片titles_tfidf并检查。
资料来源:http : //scipy-
user.10969.n7.nabble.com/SciPy-User-strange-error-when-creating-csr-matrix-
td20129.html
EDT:如果您使用的是较旧的SciPy /
Numpy版本,则可能需要更新:https :
//github.com/scipy/scipy/pull/4678
EDT2:同样,如果您使用的是32位python,则切换到64位可能会有所帮助(我想)
EDT3:回答您的原始问题。当您从中使用词汇时
plaintexts,将会有新词
titles被忽略-
但不会影响tfidf值。希望此片段可以使其更易于理解:
from sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.metrics.pairwise import cosine_similarityplaintexts =["They are", "plain texts texts amoersand here"]titles = ["And here", "titles ", "wolf dog eagle", "But here plain"]vectorizer = TfidfVectorizer()plaintexts_tfidf = vectorizer.fit_transform(plaintexts)vocab = vectorizer.vocabulary_vectorizer = TfidfVectorizer(vocabulary=vocab)titles_tfidf = vectorizer.fit_transform(titles)print('values using vocabulary')print(titles_tfidf)print(vectorizer.get_feature_names())print('Brand new vectorizer')vectorizer = TfidfVectorizer()titles_tfidf = vectorizer.fit_transform(titles)print(titles_tfidf)print(vectorizer.get_feature_names())结果是:
values using vocabulary (0, 2) 1.0 (3, 3) 0.78528827571 (3, 2) 0.61913029649['amoersand', 'are', 'here', 'plain', 'texts', 'they']Brand new vectorizer (0, 0) 0.78528827571 (0, 4) 0.61913029649 (1, 6) 1.0 (2, 7) 0.57735026919 (2, 2) 0.57735026919 (2, 3) 0.57735026919 (3, 4) 0.486934264074 (3, 1) 0.617614370976 (3, 5) 0.617614370976['and', 'but', 'dog', 'eagle', 'here', 'plain', 'titles', 'wolf']
请注意,这 与 我从标题中删除明文中未出现的单词不同。



