运行环境说明
Equipment environment:
system: Win10 64
python version: 3.7.10
matplotlib version: 3.4.2
numpy version: 1.20.3
sklearn version: 0.21.3
pandas version: 1.2.4
seaborn version: 0.11.1
sklearn version: 0.24.2
imblearn version: 0.8.0
信用卡盗刷预测
数据探索- 查看数据的类型及缺失情况. 从数据得知Class为分类信息, 其他列信息为特征,且都为float类型。 共284807个样本
print(df.info())
Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Time 284807 non-null float64 1 V1 284807 non-null float64 2 V2 284807 non-null float64 3 V3 284807 non-null float64 4 V4 284807 non-null float64 5 V5 284807 non-null float64 6 V6 284807 non-null float64 7 V7 284807 non-null float64 8 V8 284807 non-null float64 9 V9 284807 non-null float64 10 V10 284807 non-null float64 11 V11 284807 non-null float64 12 V12 284807 non-null float64 13 V13 284807 non-null float64 14 V14 284807 non-null float64 15 V15 284807 non-null float64 16 V16 284807 non-null float64 17 V17 284807 non-null float64 18 V18 284807 non-null float64 19 V19 284807 non-null float64 20 V20 284807 non-null float64 21 V21 284807 non-null float64 22 V22 284807 non-null float64 23 V23 284807 non-null float64 24 V24 284807 non-null float64 25 V25 284807 non-null float64 26 V26 284807 non-null float64 27 V27 284807 non-null float64 28 V28 284807 non-null float64 29 Amount 284807 non-null float64 30 Class 284807 non-null int64 dtypes: float64(30), int64(1) memory usage: 67.4 MB
- 查看数据的的均值,方差, 和箱线图的哪些数据看:
- 标签严重不均衡, 在训练时需要样本均衡处理
- V1~v28 数据离散化比较大, Time 以秒为单位计时,时间单位太小不利已分析,需要对时间分段处理, Amount 暂时不处理, 可以尝试消费额度和盗刷之间的的分布情况
count mean std min 25%
Time 284807.0 9.481386e+04 47488.145955 0.000000 54201.500000
V1 284807.0 1.168375e-15 1.958696 -56.407510 -0.920373
V2 284807.0 3.416908e-16 1.651309 -72.715728 -0.598550
V3 284807.0 -1.379537e-15 1.516255 -48.325589 -0.890365
V4 284807.0 2.074095e-15 1.415869 -5.683171 -0.848640
V5 284807.0 9.604066e-16 1.380247 -113.743307 -0.691597
V6 284807.0 1.487313e-15 1.332271 -26.160506 -0.768296
V7 284807.0 -5.556467e-16 1.237094 -43.557242 -0.554076
V8 284807.0 1.213481e-16 1.194353 -73.216718 -0.208630
V9 284807.0 -2.406331e-15 1.098632 -13.434066 -0.643098
V10 284807.0 2.239053e-15 1.088850 -24.588262 -0.535426
V11 284807.0 1.673327e-15 1.020713 -4.797473 -0.762494
V12 284807.0 -1.247012e-15 0.999201 -18.683715 -0.405571
V13 284807.0 8.190001e-16 0.995274 -5.791881 -0.648539
V14 284807.0 1.207294e-15 0.958596 -19.214325 -0.425574
V15 284807.0 4.887456e-15 0.915316 -4.498945 -0.582884
V16 284807.0 1.437716e-15 0.876253 -14.129855 -0.468037
V17 284807.0 -3.772171e-16 0.849337 -25.162799 -0.483748
V18 284807.0 9.564149e-16 0.838176 -9.498746 -0.498850
V19 284807.0 1.039917e-15 0.814041 -7.213527 -0.456299
V20 284807.0 6.406204e-16 0.770925 -54.497720 -0.211721
V21 284807.0 1.654067e-16 0.734524 -34.830382 -0.228395
V22 284807.0 -3.568593e-16 0.725702 -10.933144 -0.542350
V23 284807.0 2.578648e-16 0.624460 -44.807735 -0.161846
V24 284807.0 4.473266e-15 0.605647 -2.836627 -0.354586
V25 284807.0 5.340915e-16 0.521278 -10.295397 -0.317145
V26 284807.0 1.683437e-15 0.482227 -2.604551 -0.326984
V27 284807.0 -3.660091e-16 0.403632 -22.565679 -0.070840
V28 284807.0 -1.227390e-16 0.330083 -15.430084 -0.052960
Amount 284807.0 8.834962e+01 250.120109 0.000000 5.600000
Class 284807.0 1.727486e-03 0.041527 0.000000 0.000000
50% 75% max
Time 84692.000000 139320.500000 172792.000000
V1 0.018109 1.315642 2.454930
V2 0.065486 0.803724 22.057729
V3 0.179846 1.027196 9.382558
V4 -0.019847 0.743341 16.875344
V5 -0.054336 0.611926 34.801666
V6 -0.274187 0.398565 73.301626
V7 0.040103 0.570436 120.589494
V8 0.022358 0.327346 20.007208
V9 -0.051429 0.597139 15.594995
V10 -0.092917 0.453923 23.745136
V11 -0.032757 0.739593 12.018913
V12 0.140033 0.618238 7.848392
V13 -0.013568 0.662505 7.126883
V14 0.050601 0.493150 10.526766
V15 0.048072 0.648821 8.877742
V16 0.066413 0.523296 17.315112
V17 -0.065676 0.399675 9.253526
V18 -0.003636 0.500807 5.041069
V19 0.003735 0.458949 5.591971
V20 -0.062481 0.133041 39.420904
V21 -0.029450 0.186377 27.202839
V22 0.006782 0.528554 10.503090
V23 -0.011193 0.147642 22.528412
V24 0.040976 0.439527 4.584549
V25 0.016594 0.350716 7.519589
V26 -0.052139 0.240952 3.517346
V27 0.001342 0.091045 31.612198
V28 0.011244 0.078280 33.847808
Amount 22.000000 77.165000 25691.160000
Class 0.000000 0.000000 1.000000
- 上述的结论信息也可以通过箱线图来查看
cols = df.columns
df.plot(kind='box', y=cols[1:], subplots=True, layout=(4, 8), figsize=(15, 10))
plt.savefig('6_ClassificationAlgorithm/task/picture/data_box.jpg')
plt.show()
- 查看各数据之间的相关性, 通过热地图查看,只把大于相关性系数的绝对值大于0.4的显示出来 (设置0.5已经没有了), 目前暂时都不丢弃任何特征 (Note: drawingHeatMap())
- 时间改为以半小时或者1小时为单位查看各时间段的消费情况(此处用1小时为单位, 半小时图片显示太拥挤)
从图中可知, 盗刷大概发生在凌晨两点左右,早晨6点左右,消费时间段:9~11点
(Note: 具体代码查看draw_timeConsumFraud())
所以需要对时间进行分段处理:
df['Hour'] = divmod(df['Time'], 3600)[0].astype(np.int16)
- 查看各消费金额下盗刷的次数之间存在怎样的规律
从图中可知, 被盗刷大多数是消费在1000以内的概率较大, 也可通过查看1000以内异常的次数:
从结果看1000费用内的异常情况有483, 而总异常才492, 所以盗刷集中在1000内消费
draw_fraundAmount_rela(Xfraud, Xnonfraud) # 也可通过数值具体产看0:281384, 1, 483, 总共被盗刷的总人数才492, 看来盗刷的大概率集中在消费少的用户中 print(df[df['Amount'] <= 1000].groupby(df['Class']).size())
- V1~V28数据处理
此处选用标准化来处理数据, 因为使用标准化可以使分散的数据列缩小分散度, 使集中的数据列加大分散度;
drop_cols = ['Time', 'Amount', 'Class', 'Hour'] df['Amount'] = MinMaxScaler().fit_transform(np.array(df['Amount']).reshape(-1, 1)) X, Y = df.drop(drop_cols, axis=1), df['Class'] scaler = StandardScaler() X_scaler = scaler.fit_transform(X) X_scaler = pd.Dataframe(X_scaler, columns=X.columns) data = pd.concat([X_scaler, df['Amount'], df['Hour']], axis=1)
- 划分训练测试训练数据
因为label严重不均衡,所以在训练时需要做标签均衡。 可以使用过采样方式或者在训练使选择class_weight=‘balanced’, 但操作了一下class_weight, 效果并不是很好,也不知道是不是因为严重不均衡使使用这个的效果不好的原因
** 测试集划分的考虑:
- 此实验采用上采样SMO方式,但测试集不能使用处理后的数据, 所以在做处理前需要把测试集和训练集先划分. ;
- 因为label 严重不均衡,所以均匀划分训练、测试集后,测试集中异常和正常的label 相差也还是很大, 如果直接就用来训练, 假如正常的会有几个预测错, 异常的全部预测错,可能看到的精度也比较高,这就是label严重不均衡造成的, 当然也可以再查看,召回率和精准率来综合评估模型;
经上思考,此处把均分训练、测试集后的测试集再做处理:从测试集中,从正常情况下的样本采样与异常情况相同的样本树作为测试集。
x_train, X_test, y_train, y_test = train_test_split(pd.Dataframe(data), df['Class'], test_size=0.3, stratify=df['Class'])
test = pd.concat([X_test, y_test], axis=1)
test = pd.concat([test[y_test==0].sample(n=np.sum(y_test==1), axis=0), test[y_test==1]], axis=0)
# 此时,训练集中正常与异常的样本数量一样多
X_test, y_test = test.drop('Class', axis=1), test['Class']
- 对训练集做上采样
sm = SMOTE(random_state=11) x_train, y_train = sm.fit_resample(x_train, y_train)模型训练
此处采用逻辑回归,贝叶斯, GDBT和voting
train_score test_score recall_score precision_score predict_time Logist 0.954186 0.922297 0.922297 0.977099 0.005984 Gaussi 0.919978 0.891892 0.891892 0.953125 0.001994 Gradie 0.967139 0.929054 0.929054 0.984733 0.001995 Voting 0.949543 0.918919 0.918919 0.992063 0.006979
从输出结果看,test_score与recall_score一致, 所以看不到线条。 从整体及如果看,对于该数据,逻辑回归还是挺不错的。
from itertools import count
from numpy import random
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from seaborn.matrix import heatmap
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate
from sklearn.svm import SVC
from sklearn.metrics import precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
import seaborn as sns
from imblearn.over_sampling import SMOTE
from sklearn.naive_bayes import GaussianNB
from sklearn import ensemble
from time import time
df = pd.read_csv('6_ClassificationAlgorithm/task/creditcard.csv')
# print(df.info())
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
# print(df.describe().T)
# 处理时间问题,两天数据, 以半小时为单位
df['Hour'] = divmod(df['Time'], 3600)[0].astype(np.int16)
# print(df['Hour'].value_counts())
heatmapData = df.drop(['Time'], axis=1)
Xfraud = heatmapData.loc[df['Class'] == 1]
Xnonfraud = heatmapData.loc[df['Class'] == 0]
def drawingHeatMap(heatmapData, Xfraud, Xnonfraud):
fig, axes = plt.subplots(figsize=(12,8), nrows=1, ncols=3)
fig.subplots_adjust(left=0.1, right=0.95, bottom=0.15, top=0.9, wspace=0.1, hspace=0.15)
axes = axes.flatten()
corrHeatmapData, corrXfraud, corrXnonfraud = heatmapData.corr(), Xfraud.corr(), Xnonfraud.corr()
# 用于只显示下三角部分
# mask = np.zeros_like(heatmap_data.corr(), dtype=np.bool)
# mask[np.triu_indices_from(mask)] = True
# 小于某个值的不显示
mask1 = np.triu(np.ones_like(corrHeatmapData, dtype=np.bool))
mask2 = np.abs(corrHeatmapData) <= 0.4
mask = mask1 | mask2
# cmap = sns.diverging_palette(150, 275, s=80, l=40, n=9, center="light", as_cmap=True)
ax1 = sns.heatmap(corrHeatmapData, ax=axes[0], linewidths=0.1, cbar=False, vmin=-1, vmax=1, annot=False, fmt='.1f', annot_kws={'size':8}, mask=mask, cmap='YlGnBu')
ax1.set_title('frand and nonfrand', color='b')
ax2 = sns.heatmap(corrXfraud, ax=axes[1], yticklabels=False, cbar=False, linewidths=0.1, vmin=-1, vmax=1, annot=False, fmt='.1f', annot_kws={'size':8}, mask=mask, cmap='YlGnBu')
ax2.set_title('frand', color='b')
ax3 = sns.heatmap(corrXnonfraud, ax=axes[2], yticklabels=False, linewidths=0.1, vmin=-1, vmax=1, annot=False, fmt='.1f', annot_kws={'size':8}, mask=mask, cmap='YlGnBu')
ax3.set_title('nonFrand', color='b')
plt.suptitle('Correlation between variables')
# plt.savefig('6_ClassificationAlgorithm/task/picture/heatmap.jpg', bbox_inches='tight')
plt.show()
# 可看出V2与Amount相关度可能高, V3与实践相关度可能高
# drawingHeatMap(heatmapData, Xfraud, Xnonfraud)
# 查看盗刷与交易金额和次数的关系
def draw_fraundAmount_rela(fraudData, nonfraudData):
fig, axes = plt.subplots(figsize=(8, 6), nrows=2, ncols=1)
fig.subplots_adjust(left=0.12, right=0.95, bottom=0.15, top=0.9, wspace=0.2, hspace=0.3)
axes = axes.flatten()
ax = fraudData.plot(kind='hist', y='Amount', ax=axes[0], bins=20)
ax.set_title('fraud')
ax = nonfraudData.plot(kind='hist', y='Amount', ax=axes[1], bins=20, logy=True)
ax.set_title('fraud')
plt.xlabel('Amount', loc='center')
fig.supylabel('Number of Transactions', color='y')
plt.savefig('6_ClassificationAlgorithm/task/picture/fraudAmount_rela.jpg')
plt.show()
# 看着被盗刷大多数是消费在1000以内的概率较大
# draw_fraundAmount_rela(Xfraud, Xnonfraud)
# 也可通过数值具体产看0:281384, 1, 483, 总共被盗刷的总人数才492, 看来盗刷的大概率集中在消费少的用户中
# print(df[df['Amount'] <= 1000].groupby(df['Class']).size())
def draw_timeConsumFraud():
plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
plt.rcParams['axes.unicode_minus']=False #用来正常显示负号
fig = plt.figure(figsize=(12, 5))
# fig.subplots_adjust(left=0.1, right=0.95, bottom=0.1, top=0.95)
fig.tight_layout()
# df['Hour'].value_counts(sort=False).plot(kind='bar')
ax = df['Hour'].value_counts().sort_index().plot(kind='bar', legend=True, label=u'各时间段消费情况')
ax.set_ylabel('consum times')
ax.set_xlabel('Time - Hour')
ax2 = ax.twinx()
ax2.plot(df.groupby(df.Hour)['Class'].sum(), color='r', label=u'各时间段被盗刷的数量')
ax2.set_ylabel('fraud times')
# print(ax2.get_xticks())
# print(df.groupby(df.Hour)['Class'].sum().values)
for x, y in zip(ax2.get_xticks(), df.groupby(df.Hour)['Class'].sum().values):
ax2.text(x, y, '%d'%y, ha='center', va= 'bottom')
# sns.factorplot(x="Hour", data=df, kind="count", palette="ocean", size=5, aspect=3)
plt.legend(loc='upper left')
plt.savefig('6_ClassificationAlgorithm/task/picture/timeConsumFraud.jpg')
plt.show()
# 从图中可知, 盗刷大概发生在凌晨两点左右,早晨6点左右,消费时间段:9~11点
# draw_timeConsumFraud()
# sns.displot(df['V1'], bins=50, alpha=0.5)
# plt.show()
# 对V1~V28进行归一化:StandardScaler()
drop_cols = ['Time', 'Amount', 'Class', 'Hour']
df['Amount'] = MinMaxScaler().fit_transform(np.array(df['Amount']).reshape(-1, 1))
X, Y = df.drop(drop_cols, axis=1), df['Class']
scaler = StandardScaler()
X_scaler = scaler.fit_transform(X)
X_scaler = pd.Dataframe(X_scaler, columns=X.columns)
data = pd.concat([X_scaler, df['Amount'], df['Hour']], axis=1)
x_train, X_test, y_train, y_test = train_test_split(pd.Dataframe(data), df['Class'], test_size=0.3, stratify=df['Class'])
test = pd.concat([X_test, y_test], axis=1)
test = pd.concat([test[y_test==0].sample(n=np.sum(y_test==1), axis=0), test[y_test==1]], axis=0)
X_test, y_test = test.drop('Class', axis=1), test['Class']
# print(test['Class'].value_counts())
sm = SMOTE(random_state=11)
x_train, y_train = sm.fit_resample(x_train, y_train)
LR = LogisticRegression(C=0.01, penalty='l1', solver='saga', n_jobs=-1)
svm = SVC(kernel='rbf', C=1.0, probability=True)
gus = GaussianNB()
GBDT = ensemble.GradientBoostingClassifier(n_estimators=30)
estimators = [LR, gus, GBDT]
voting_estimators = list(zip(['LR', 'gus', 'GBDT'], estimators))
voting = ensemble.VotingClassifier(voting_estimators, voting='hard')
estimators.append(voting)
mode_metrics = pd.Dataframe()
for clf in estimators:
clf.fit(x_train, y_train)
start = time()
y_predict = clf.predict(X_test)
end = time()
mode_name = clf.__class__.__name__.replace('Classifier', '')[:6]
mode_metrics.loc[mode_name, 'train_score'] = np.mean(clf.predict(x_train)==y_train)
mode_metrics.loc[mode_name, 'test_score'] = np.mean(y_predict==y_test)
mode_metrics.loc[mode_name, 'recall_score'] = recall_score(y_test, y_predict, average='micro')
mode_metrics.loc[mode_name, 'precision_score'] = precision_score(y_test, y_predict)
mode_metrics.loc[mode_name, 'predict_time'] = end - start
print(mode_metrics)
ax = mode_metrics.plot(kind='line', secondary_y=['predict_time'], figsize=(8, 6))
plt.xticks(rotation=90)
ax.set_xlabel('mode name', color='r')
ax.set_ylabel('accuracy', color='r')
ax.right_ax.set_ylabel('predict time', color='r')
# plt.savefig('6_ClassificationAlgorithm/task/picture/result_variants_mode.jpg')
plt.show()



