您可以执行以下操作:
将数据加载到数据框中:
import pandas as pddf = pd.read_table("/tmp/test.csv", sep="s+")print(df)输出:
col1 col2 col3 text0 1 1 0 meaningful text1 5 9 7 trees2 7 8 2 text
text
使用以下方式标记列:
sklearn.feature_extraction.text.TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizerv = TfidfVectorizer()x = v.fit_transform(df['text'])
将标记化的数据转换为数据框:
df1 = pd.Dataframe(x.toarray(), columns=v.get_feature_names())print(df1)
输出:
meaningful text trees0 0.795961 0.605349 0.01 0.000000 0.000000 1.02 0.000000 1.000000 0.0
将标记化数据帧连接到原始数据帧:
res = pd.concat([df, df1], axis=1)print(res)
输出:
col1 col2 col3 text meaningful text trees0 1 1 0 meaningful text 0.795961 0.605349 0.01 5 9 7 trees 0.000000 0.000000 1.02 7 8 2 text 0.000000 1.000000 0.0
如果要删除列
text,则需要在串联之前执行以下操作:
df.drop('text', axis=1, inplace=True)res = pd.concat([df, df1], axis=1)print(res)输出:
col1 col2 col3 meaningful text trees0 1 1 0 0.795961 0.605349 0.01 5 9 7 0.000000 0.000000 1.02 7 8 2 0.000000 1.000000 0.0
这是完整的代码:
import pandas as pdfrom sklearn.feature_extraction.text import TfidfVectorizerdf = pd.read_table("/tmp/test.csv", sep="s+")v = TfidfVectorizer()x = v.fit_transform(df['text'])df1 = pd.Dataframe(x.toarray(), columns=v.get_feature_names())df.drop('text', axis=1, inplace=True)res = pd.concat([df, df1], axis=1)


