数据分析问题（异常值识别）中数据预处理部分流程（含2022年全国服务外包大赛实例）

博主个人理解的数据预处理主要包括个方面：读取文件 => 数据概览 => 缺失值填补 => 数据分布预览 => 衍生特征设计。这套流程在完成异常值识别时作为数据预处理时没有什么问题的。
我们以2022年全国服务外包大赛的A03题目作为示例代码实现整个预处理过程。问题的主要任务时找出商品的销量异常和价格异常，提供4个月的商品信息数据，共1700万余条，4个月的店铺信息数据，共60万余条，强调时间复杂度空间复杂度、异常值识别率和准确率。
部分数据链接：https://pan.baidu.com/s/1KatV_6ozYHjPkNjfVGBPmw 提取码：ee8i
整体思路如下：

读取文件

严格来说读取文件不算是预处理的步骤的。但对于初学者而言，代码在这块多多少少总有些问题，主要是在文件编码方面的和内存炸了的问题。

def change_utf_1():		# 打开文档，以utf-8的csv保存
    for i in range(6, 7, 1):    # 为了批量打开文档们
        file = "data_20210" + str(i) + ".tsv"
        result = []
        # 相较于pd.read_csv()，这个内存不会炸
        with codecs.open(file, "rb", 'gb18030', errors='ignore') as csvfile:
            # 如果'gb18030'打开有问题，可以尝试下遍历这五个编码方式，基本上都能解决问题
            # ['gbk', 'utf-8', 'utf-8-sig', 'GB2312', 'gb18030']
            for line in csvfile:
                line = line.replace("r", "")
                line = line.replace("n", "")
                temp1 = line.split("t")
                result.append(temp1)
        # print(result)
        data = pd.DataFrame(result)
        new_file = "data_20210" + str(i) + "_new.tsv"
        data.to_csv(new_file, index=0, encoding="utf-8", header=None, sep="t")
        f = pd.read_csv(new_file, sep="t", encoding="utf-8")
        # 因为原来的数据太大了，不方便实验，所以取个头保存
        new_head = "data_20210" + str(i) + "_head.tsv"
        f.head(50).to_csv(new_head, sep="t", encoding="utf-8", index=0)

数据概览

这一部分一般都是直接跟在读取文件后面的，实际上就是为了查看缺失值和每个数据的大概情况。

import pandas as pd 
file = pd.read_csv("data_202106.csv", encoding = "utf-8")
# 看缺失值
print(file.info())
# 这里可以只选LV变量，不一定要全部的
for i in file.columns:
	print(file[i].values_count())
# 更直观一点，直接看饼图，查看销售额总量
plt.figure.figsize = (20,20)
for i in range(5):
	sub_cat_val = df.groupby('CATE_NAME_LV'+str(i+1))['ITEM_SALES_AMOUNT'].sum()
	sub_cat_val.plot(kind = 'pie', autopct='%1.1f%%', label = 'sub_cat_val.index' )

饼图结果如下：

缺失值填补

这部分是方法非常重要的部分，注意不要死死盯着准确率，可能准确率上去了，但是过拟合，一定要亲眼扫一扫被填补的缺失值长什么样。就比如我现在展示的随机森林方法就不太适合这个问题。其他解决办法见博主的另一篇博客：https://blog.csdn.net/Hjh1906008151/article/details/124338450

# coding:utf-8
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
mpl.rcParams['font.sans-serif'] = ['FangSong']  # 指定中文字体
mpl.rcParams['axes.unicode_minus'] = False  # 解决保存图像是负号'-'显示为方块的问题
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False  # 正常显示负号


def check_data_and_process(need_fill_col, need_fill_file):
    f = pd.read_csv(need_fill_file, sep="t", encoding="utf-8")
    col = f.columns
    null_col = []  # 有缺失值的列
    no_need_col = ["DATA_MONTH"]  # 确定对训练无用的字段
    for c in col:
        if len(f[f[c].isnull()][c]) != 0:
            null_col.append(c)
            print("字段：", c, "缺失数：", len(f[f[c].isnull()][c]))
            
    need_fill_col = need_fill_col
    drop_col = []  # 填补数据时需要先删除缺失的数据，一个一个补
    drop_col.extend(no_need_col)
    for c in null_col:
        if c != need_fill_col:
            drop_col.append(c)

    data_all = f.drop(columns=drop_col)  # 需要缺失值预测字段的缺失和非缺失都在一起
    # data_need_to_pre = data_all[data_all[need_fill_col].isnull()]  # 需要填补此字段的数据
    data_need_to_train = data_all[~data_all[need_fill_col].isnull()]  # 无缺失数据作为训练数据

    data = data_need_to_train.drop(columns=[need_fill_col])
    label = data_need_to_train[need_fill_col]
    
    return data, label, need_fill_col, f, data_all


def train_and_chose_model(model, data, label):

    data = np.array(data)
    label = np.array(label)
    scaler_1 = MinMaxScaler(feature_range=(0, 1))
    scaler_2 = MinMaxScaler(feature_range=(0, 1))

    data = scaler_1.fit_transform(data)  # 归一化处理，构造两个因为之后还需要反归一化
    label = scaler_2.fit_transform(label.reshape(-1, 1))

    data_tr, data_te, labels_tr, labels_te = train_test_split(data, label, test_size=0.2, random_state=10)

    model.fit(data_tr, labels_tr)  # 训练模型
    score_test = model.score(data_te, labels_te)
    score_train = model.score(data_tr, labels_tr)
    print(str(model) + "训练集准确率为：" + str(score_train))
    print(str(model) + "测试集准确率为：" + str(score_test))  # 出现负数不是很理解
    y_test_pre = model.predict(data_te)  # 预测测试集
    y_train_pre = model.predict(data_tr)  # 预测训练集
    print("训练集均方误差：", mean_squared_error(labels_tr, y_train_pre))
    print("测试集均方误差：", mean_squared_error(labels_te, y_test_pre))


    data_test_pre = scaler_2.inverse_transform(y_test_pre.reshape(-1, 1))
    data_train_pre = scaler_2.inverse_transform(y_train_pre.reshape(-1, 1))  # 测试集的预测值
    print(data_train_pre)


def fill_data(model, data, label, need_fill_col, f, data_all):
    data_need_to_pre = data_all[data_all[need_fill_col].isnull()].drop(columns=[need_fill_col])  # 含缺失的去除缺失标签的数据
    data = np.array(data)
    label = np.array(label)
    
    scaler_1 = MinMaxScaler(feature_range=(0, 1))
    scaler_2 = MinMaxScaler(feature_range=(0, 1))

    scaler_1.fit_transform(data_all.drop(columns=[need_fill_col]))
    data_to_pre = scaler_1.fit_transform(data_need_to_pre)
    data = scaler_1.fit_transform(data)  # 归一化处理，构造两个因为之后还需要反归一化
    label = scaler_2.fit_transform(label.reshape(-1, 1))

    data_tr, data_te, labels_tr, labels_te = train_test_split(data, label, test_size=0.2, random_state=10)

    model.fit(data_tr, labels_tr)  # 训练模型
    score_test = model.score(data_te, labels_te)
    score_train = model.score(data_tr, labels_tr)
    print(str(model) + "训练集准确率为：" + str(score_train))
    print(str(model) + "测试集准确率为：" + str(score_test))  # 出现负数不是很理解
    y_test_pre = model.predict(data_te)  # 预测测试集
    y_train_pre = model.predict(data_tr)  # 预测训练集
    y_need_pred = model.predict(data_to_pre)  # 填补缺失值
    print("训练集均方误差：", mean_squared_error(labels_tr, y_train_pre))
    print("测试集均方误差：", mean_squared_error(labels_te, y_test_pre))

    data_need_pred = scaler_2.inverse_transform(y_need_pred.reshape(-1, 1))

    for i, x in zip(f[f[need_fill_col].isnull()][need_fill_col].index, data_need_pred):  # 遍历索引和预测值，一个一个补进去
        f.loc[i, need_fill_col] = x
    f.to_csv(need_fill_col+"已补"+"202106_10000_drop.tsv", sep="t", index=0)
    # print(f[f[need_fill_col].isnull()][need_fill_col])  # 检测是否补完


def check(file):
    f = pd.read_csv(file, sep="t", encoding="utf-8")
    col = f.columns
    null_col = []  # 有缺失值的列
    no_need_col = ["DATA_MONTH"]  # 确定对训练无用的字段
    for c in col:
        if len(f[f[c].isnull()][c]) != 0:
            null_col.append(c)
            print("字段：", c, "缺失数：", len(f[f[c].isnull()][c]))


def main():
    need_fill_file = "TOTAL_EVAL_NUM已补202106_10000_drop.tsv"  # 需要补的文件
    need_fill_col = "ITEM_STOCK"  # 需要补的字段
    check("ITEM_STOCK已补202106_10000_drop.tsv")
    data, label, need_fill_col, f, data_all = check_data_and_process(need_fill_col, need_fill_file)  # 用于数据分析以及数据预处理
    train_and_chose_model(RandomForestRegressor(), data, label)  # 选择最优模型
    fill_data(RandomForestRegressor(), data, label, need_fill_col, f, data_all)  # 填补数据

if __name__ == '__main__':
    main()

数据分布预览

查看下各个数据的分布，观察各个数据的分布，这对之后设计衍生特征特别重要，也对后面的模型选择有一定指导作用。所以需要对自己感兴趣的数据组合预览一下。我们在这里对几对特征可视化一下。

def draw_scatter(df):
    df.plot.scatter(x="ITEM_FAV_NUM", y="ITEM_SALES_VOLUME")    # 销售量与点赞的关系
    plt.show()
    df.plot.scatter(x="TOTAL_EVAL_NUM", y="ITEM_SALES_VOLUME")  # 销售量与收藏的关系
    plt.show()
    df.plot.scatter(x="ITEM_PRICE", y="ITEM_SALES_VOLUME")      # 销售量与单价之间的关系
    plt.show()
    df.plot.scatter(x="ITEM_STOCK", y="ITEM_SALES_VOLUME")      # 销售量与库存之间的关系
    plt.show()

# 此处的df_total数据是4个月总体的，是为了看月份之间的差距
def draw_bar(df_total):
	df.groupby('ITEM_NAME')['ITEM_SALES_VOLUME_6','ITEM_SALES_VOLUME_7','ITEM_SALES_VOLUME_8','ITEM_SALES_VOLUME_9'].sum().plot.bar()
	plt.show()

衍生特征设计

此处的衍生特征设计主要是pandas的应用，可以参考思路（以价格异常为例）：

衍生特征	计算方法
same-CATE_NAME_LV1-mean-ITEM_PRICE-rate	商品价格/同月同一级类目均价
same-CATE_NAME_LV2-mean-ITEM_PRICE-rate	商品价格/同月同二级类目均价
same-CATE_NAME_LV3-mean-ITEM_PRICE-rate	商品价格/同月同三级类目均价
same-CATE_NAME_LV4-mean-ITEM_PRICE-rate	商品价格/同月同四级类目均价
same-CATE_NAME_LV5-mean-ITEM_PRICE-rate	商品价格/同月同五级类目均价
same-CATE_NAME_LV2-deliver	商品价格/同月同二级类目&&同一发货地均价
same-CATE_NAME_LV2_price	商品价格/同月同二级类目&&同一产地均价
month_6_mean_rate	商品价格/四个月商品价格的均价
same-USER_ID_ITEM_PRICE	商品价格/同月同一店铺商品均价
same-MAIN_BUSINESS_price	商品价格/同月同一主营类型店铺的所有商品均价
same-BUSINESS_SCOPE_price	商品价格/同月同一经营范围店铺的所有商品均价
same-CATE_NAME_LV2_city	商品价格/同月同一发货城市的店铺所有同二级类目的商品的均价

这里的重点并非如何计算这些变量，当然也是可以计算的，类似的方法见博主的这些博客
传统方法可以参考这篇博客：https://blog.csdn.net/Hjh1906008151/article/details/124342492
pyod方法可以参考这篇博客：https://blog.csdn.net/Hjh1906008151/article/details/124340047

数据分析问题（异常值识别）中数据预处理部分流程（含2022年全国服务外包大赛实例）

Python相关栏目本月热门文章