数据集中异常值的处理之lof,iforest算法

提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档

文章目录

前言
一、预定义的函数和数据
二、具体实践
- 1.z-score异常检测
- 2.Local Outlier Factor
- 3.孤立森林异常点检测
总结

前言

异常点检测(Outlier detection)，又称为离群点检测，是找出与预期对象的行为差异较大的对象的一个检测过程。这些被检测出的对象被称为异常点或者离群点。异常点检测应用非常广泛

信用卡反欺诈
工业损毁检测
广告点击反作弊
刷好评，刷单检测
羊毛党检测
异常点（outlier）是一个数据对象，它明显不同于其他的数据对象。如下图1所示，N1、N2区域内的点是正常数据。而离N1、N2较远的O1、O2、O3区域内的点是异常点。

异常点检验常用算法：

Z-score检验
Local Outlier Factor
孤立森林

一、预定义的函数和数据

data = pd.read_csv(r'F:教师培训ppd7df_Master_merge_clean.csv',encoding='gb18030')
x = data[data.target.notnull()].drop(columns=['Idx', 'target', 'sample_status', 'ListingInfo'])
y = data[data.target.notnull()]['target']

def get_auc(x, y):
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import roc_auc_score, roc_curve
    import lightgbm as lgb
    
    #画出auc图的函数     
    def roc_auc_plot(clf,x_train,y_train,x_test, y_test):
        train_auc = roc_auc_score(y_train,clf.predict_proba(x_train)[:,1])
        train_fpr, train_tpr, _ = roc_curve(y_train,clf.predict_proba(x_train)[:,1])
        train_ks = abs(train_fpr-train_tpr).max()
        print('train_ks = ', train_ks)
        print('train_auc = ', train_auc)

        test_auc = roc_auc_score(y_test,clf.predict_proba(x_test)[:,1])
        test_fpr, test_tpr, _ = roc_curve(y_test,clf.predict_proba(x_test)[:,1])
        test_ks = abs(test_fpr-test_tpr).max()
        print('test_ks = ', test_ks)
        print('test_auc = ', test_auc)

        from matplotlib import pyplot as plt
        plt.plot(train_fpr,train_tpr,label = 'train_auc='+str(train_auc))
        plt.plot(test_fpr,test_tpr,label = 'test_auc='+str(test_auc))
        plt.plot([0,1],[0,1],'k--', c='r')
        plt.xlabel('False positive rate')
        plt.ylabel('True positive rate')
        plt.title('ROC Curve')
        plt.legend(loc = 'best')
        plt.show()

    x_train,x_test, y_train, y_test = train_test_split(x,y,random_state=2,test_size=0.2)

    lgb_model = lgb.LGBMClassifier(n_estimators=800,
                                    boosting_type='gbdt',
                                   learning_rate=0.04,
                                   min_child_samples=68,
                                   min_child_weight=0.01,
                                      max_depth=4,
                                  num_leaves=16,
                                  colsample_bytree=0.8,
                                  subsample=0.8,
                                  reg_alpha=0.7777777777777778,
                                  reg_lambda=0.3,
                                   objective='binary')

    clf = lgb_model.fit(x_train, y_train,
                  eval_set=[(x_train, y_train),(x_test,y_test)],
                  eval_metric='auc',early_stopping_rounds=100)
    roc_auc_plot(clf,x_train,y_train,x_test, y_test)

二、具体实践 1.z-score异常检测

假设样本服从正态分布，用于描述样本偏离正态分布的程度。

通过计算 μ mu μ和 σ sigma σ得到当前样本所属于的正态分布的表达式，然后分别计算每个样本在这个概率密度函数下被生成的概率，当概率小于某一阈值我们认为这个样本是不属于这个分布的，因此定义为异常值。
缺点：需要假设样本满足正态分布，而我们大部分场景都不满足这种假设条件。
代码如下（示例）：

2.Local Outlier Factor

LOF是基于密度的经典算法（Breuning et. al. 2000）, 文章发表于 SIGMOD 2000

在 LOF 之前的异常检测算法大多是基于统计方法的，或者是借用了一些聚类算法用于异常点的识别（比如，DBSCAN，OPTICS）。
基于统计的异常检测算法通常需要假设数据服从特定的概率分布，但假设往往不成立
聚类方法通常只能给出 0/1 的判断（即：是不是异常点），不能量化每个数据点的异常程度
基于密度的LOF算法要更简单、直观，不需要对数据的分布做太多要求，还能量化每个数据点的异常程度（outlierness）。
代码如下（示例）：

LOF ≈1 ⇒ 非异常 LOF ≫1 ⇒ 异常
核心基于密度聚类算法，K-邻近距离，可达距离，局部可达密度，LOF 局部异常因子（local outlier factor）等概念

如果数据点 p 的 LOF 得分在1附近，表明数据点p的局部密度跟它的邻居们差不多；
如果数据点 p 的 LOF 得分小于1，表明数据点p处在一个相对密集的区域，不像是一个异常点；
如果数据点 p 的 LOF 得分远大于1，表明数据点p跟其他点比较疏远，很有可能是一个异常点

#PyOD是一个用于检测数据中异常值的库。它提供对20多种不同算法的访问，以检测异常值，下面的算法都通过PYOD实现

from pyod.models.lof import LOF

data = pd.read_csv(r'F:教师培训ppd7df_Master_merge_clean.csv',encoding='gb18030')
x = data[data.target.notnull()].drop(columns=['Idx', 'target', 'sample_status', 'ListingInfo'])
y = data[data.target.notnull()]['target']

def my_lof(x,y):
    lof = LOF(n_neighbors=20,
                algorithm='auto',
                leaf_size=30,
                metric='minkowski',
                p=2,
                metric_params=None,
                contamination=0.1,
                n_jobs=1,
                novelty=False)
    lof.fit(x)
    out_pred = lof.predict_proba(x, method='linear')[:,1]
    x = pd.concat([y,x], axis=1)
    x['out_pred'] = out_pred
    q = x['out_pred'].quantile(0.93)
    y = x[x.out_pred < q]['target']
    x = x[x.out_pred < q].drop(columns=['out_pred','target'])
    
    return x, y
    
new_x, new_y = my_lof(x, y)
get_auc(new_x, new_y)
效果从原来的test_auc =  0.7414488664969987提高到0.7501722664490398

3.孤立森林异常点检测

论文地址(https://ieeexplore.ieee.org/document/4781136）
先用一个简单的例子来说明 Isolation Forest 的基本想法

假设现在有一组一维数据（如下图所示），我们要对这组数据进行随机切分，希望可以把点 A 和点 B 单独切分出来
先在最大值和最小值之间随机选择一个值 x，然后按照 =x 可以把数据分成左右两组
在这两组数据中分别重复这个步骤，直到数据不可再分。点 B 跟其他数据比较疏离，可能用很少的次数就可以把它切分出来
点 A 跟其他数据点聚在一起，可能需要更多的次数才能把它切分出来。

# Isolation Forest孤立森林异常点检测

from pyod.models.iforest import IForest

def my_Ifo(x,y):
    ifo = IForest(n_estimators=500,
            max_samples='auto',
            contamination=0.1,
            max_features=1.0,
            bootstrap=False,
            n_jobs=1,
            behaviour='new',
            random_state=None,
            verbose=0,)
    ifo.fit(x)
    out_pred = ifo.predict_proba(x, method='linear')[:,1]
    x = pd.concat([y,x], axis=1)
    x['out_pred'] = out_pred
    y = x[x.out_pred < 0.7]['target']
    x = x[x.out_pred < 0.7].drop(columns=['out_pred','target'])
    return x, y

    
new_x, new_y = my_Ifo(x, y)
get_auc(new_x, new_y)
效果从原来的test_auc =  0.7414488664969987提高到0.7577292212472797

总结

通过对比分析，我们发现孤立森林的效果要好于lof算法。

数据集中异常值的处理之lof,iforest算法

Python相关栏目本月热门文章