- Task03 基于机器学习的文本分类
- 0. 数据准备
- 1. TFIDF 提取文本特征
- 2. 使用TFIDF 特征 和 线性模型完成训练和预测
- 3. 使用TFIDF 特征 和 XGBoost 完成训练和预测
0. 数据准备本次学习活动来自Coogle数据科学:30天入门数据竞赛
学习内容来自于:阿里云天池 - 零基础入门NLP - 新闻文本分类
# 导包 import numpy as np import pandas as pd import matplotlib.pyplot as plt from tqdm import tqdm import pickle # 设置输入输出路径 data_path = 'data/' save_path = 'result/'
# 导包 from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import RidgeClassifier from sklearn import linear_model from sklearn.metrics import f1_score
# 读取数据文件 train_df = pd.read_csv(data_path+'train_set.csv',sep='t')
test_df = pd.read_csv(data_path+'test_a.csv')
train_df.head()
| label | text | |
|---|---|---|
| 0 | 2 | 2967 6758 339 2021 1854 3731 4109 3792 4149 15... |
| 1 | 11 | 4464 486 6352 5619 2465 4802 1452 3137 5778 54... |
| 2 | 3 | 7346 4068 5074 3747 5681 6093 1777 2226 7354 6... |
| 3 | 2 | 7159 948 4866 2109 5520 2490 211 3956 5520 549... |
| 4 | 3 | 3646 3055 3055 2490 4659 6065 3370 5814 2465 5... |
词嵌入将不定长的文本转换到定长的空间内,是文本分类的第一步。
Bag of Words(词袋表示),也称为Count Vectors,在sklearn中可以直接CountVectorizer来实现“词袋”转换:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = CountVectorizer()
vectorizer.fit_transform(corpus).toarray()
TF-IDF
TF-IDF 分数由两部分组成:第一部分是词语频率(Term Frequency),第二部分是逆文档频率(Inverse document Frequency)。其中计算语料库中文档总数除以含有该词语的文档数量,然后再取对数就是逆文档频率。
- TF(t)= 该词语在当前文档出现的次数 / 当前文档中词语的总数
- IDF(t)= log_e(文档总数 / 出现该词语的文档总数+1)
CountVectorizer是属于常见的特征数值计算类,是一个文本特征提取方法。对于每一个训练文本,它只考虑每种词汇在该训练文本中出现的频率。
CountVectorizer会将文本中的词语转换为词频矩阵,它通过fit_transform函数计算各个词语出现的次数。
train_df['label'].values
array([ 2, 11, 3, ..., 11, 2, 3], dtype=int64)
# 词袋 + 岭回归 # 取出现次数最多的3000个单词作为关键词,使用fit_transform统计词频 vectorizer = CountVectorizer(max_features=3000) train_test = vectorizer.fit_transform(train_df['text']) # 取前10000文本个作为训练数据 clf = RidgeClassifier() clf.fit(train_test[:10000], train_df['label'].values[:10000]) # 使用10000—200000条作为测试数据进行测试,得出f1分数 val_pred = clf.predict(train_test[10000:]) print(f1_score(train_df['label'].values[10000:], val_pred, average='macro')) # 0.74
0.74881272024579932. 使用TFIDF 特征 和 线性模型完成训练和预测
TFIDF的值等于:TF x IDF
# 使用TF-IDF和岭回归完成训练和预测 # TF-IDF + 岭回归 # 取出现次数最多的3000个单词作为关键词,使用fit_transform统计词频, tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=3000) # 得出文本TF-IDF特征 train_test = tfidf.fit_transform(train_df['text']) clf = RidgeClassifier() clf.fit(train_test[:10000], train_df['label'].values[:10000]) val_pred = clf.predict(train_test[10000:]) print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))
test_data = pd.read_csv(data_path+'test_a.csv')
# 提取测试数据集的TF-IDF特征 vectorizer = CountVectorizer(max_features=3000) test_test = vectorizer.fit_transform(test_data['text']) # 得出预测标签 test_pred_val = clf.predict(test_test)
test_pred_val
array([3, 3, 4, ..., 3, 3, 8], dtype=int64)
test_df = pd.Dataframe(test_pred_val,columns=['label'])
type(test_df['lable'][0])
numpy.int64
# 保存预测结果 test_df.to_csv(save_path+'submit.csv',index=False,sep=',')
del sample_df3. 使用TFIDF 特征 和 XGBoost 完成训练和预测
# 取出现次数最多的3000个单词作为关键词,使用fit_transform统计词频, tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=3000) # 得出文本TF-IDF特征 train_test = tfidf.fit_transform(train_df['text'])
i = 0
for item in train_test:
for word in item:
print(word)
break
# i += 1
# if i>= 1:
# break
break
(0, 2861) 0.021319323290812923 (0, 2802) 0.019959736004927054 (0, 2194) 0.021588472538590326 (0, 891) 0.018603301344074175 (0, 1247) 0.020608661311821395 (0, 2038) 0.02240483997783033 (0, 2996) 0.022130915429207997 (0, 2923) 0.044051176856073015 (0, 1183) 0.05097171821741816 : : (0, 422) 0.027119239565399852 (0, 197) 0.057173982210201105 (0, 1414) 0.027833712233284878 (0, 1301) 0.09802535247134912 (0, 1387) 0.17425122880337085 (0, 1150) 0.1704436496531144 (0, 328) 0.011162664461440474 (0, 412) 0.032072202918363046 (0, 2600) 0.09225915260965885 (0, 791) 0.08049002269211457
# 提取测试数据集的TF-IDF特征 vectorizer = TfidfVectorizer(ngram_range=(1,3), max_features=3000) test_test = tfidf.fit_transform(test_df['text'])
from xgboost import XGBClassifier xgm = XGBClassifier() xgm.fit(train_test, train_df['label'])
D:ProgramDataAnaconda3libsite-packagesxgboostsklearn.py:1146: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
warnings.warn(label_encoder_deprecation_msg, UserWarning)
[18:54:32] WARNING: D:Buildxgboostxgboost-1.4.2.gitsrclearner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'multi:softprob' was changed from 'merror' to 'mlogloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
importance_type='gain', interaction_constraints='',
learning_rate=0.300000012, max_delta_step=0, max_depth=6,
min_child_weight=1, missing=nan, monotone_constraints='()',
n_estimators=100, n_jobs=8, num_parallel_tree=1,
objective='multi:softprob', random_state=0, reg_alpha=0,
reg_lambda=1, scale_pos_weight=None, subsample=1,
tree_method='exact', validate_parameters=1, verbosity=None)
fit_pred = xgm.predict(test_test)
test_df = pd.Dataframe(fit_pred,columns=['label'])
# 保存预测结果
test_df.to_csv('result/submit_whole.csv',index=False,sep=',')



