的
TfidfVectorizer()已平滑添加到文档计数和
l2归一化被施加在顶TF-IDF向量,作为中提到的文档。
(字符出现的次数)/(给定文档中字符的数量)*
log(1 +#文档/ 1 +#存在给定字符的文档)+1)
l2默认情况下,此标准化是默认设置,但是您可以使用参数更改或删除此步骤
norm。同样,可以
为了了解如何计算准确的分数,我准备使用a
CountVectorizer()来了解每个文档中每个字符的计数。
countVectorizer = CountVectorizer(analyzer='char')tf = countVectorizer.fit_transform(corpus)tf_df = pd.Dataframe(tf.toarray(), columns= countVectorizer.get_feature_names())tf_df#output: . ? _ a c d e f h i m n o r s t u0 1 0 4 0 1 1 2 1 2 3 1 1 1 1 3 4 11 1 0 5 0 3 3 4 0 2 2 2 3 3 0 3 4 22 1 0 5 1 0 2 2 0 3 3 0 2 1 1 2 3 03 0 1 4 0 1 1 2 1 2 3 1 1 1 1 3 4 1
现在让我们将基于sklearn实现的tf-idf加权应用于第二个文档!
v=[]doc_id = 2# number of documents in the corpus + smoothingn_d = 1+ tf_df.shape[0]for char in tf_df.columns: # calculate tf - count of this char in the doc / total number chars in the doc tf = tf_df.loc[doc_id,char]/tf_df.loc[doc_id,:].sum() # number of documents containing this char with smoothing df_d_t = 1+ sum(tf_df.loc[:,char]>0) # now calculate the idf with smoothing idf = (np.log (n_d/df_d_t) + 1 ) # calculate the score now v.append (tf*idf)from sklearn.preprocessing import normalize# normalize the vector with l2 norm and create a dataframe with feature_namespd.Dataframe(normalize([v], norm='l2'), columns=vectorizer.get_feature_names())#output: . ? _ a c d e f h i m n o r s t u 0.140615 0.0 0.57481 0.220301 0.0 0.229924 0.229924 0.0 0.344886 0.344886 0.0 0.229924 0.114962 0.140615 0.229924 0.344886 0.0
您会发现char的分数
a与
TfidfVectorizer()输出匹配!!!



