删除异常值方法总结

前提：

import pandas as pd
import numpy as np
import os

import seaborn as sns
from pyod.models.mad import MAD
from pyod.models.knn import KNN
from pyod.models.lof import LOF
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest

1.IQR

python基于IQR删除异常值：

df = pd.read_excel('./7.xlsx')

def fit_model(model, data, column='Area'):
    # fit the model and predict it
    df = data.copy()
    data_to_predict = data[column].to_numpy().reshape(-1, 1)
    predictions = model.fit_predict(data_to_predict)
    df['Predictions'] = predictions
    
    return df

def plot_anomalies(df, x='Date', y='Area'):

    # categories will be having values from 0 to n
    # for each values in 0 to n it is mapped in colormap
    categories = df['Predictions'].to_numpy()
    colormap = np.array(['g', 'r'])

    f = plt.figure(figsize=(12/2.54, 6/2.54))
    f = plt.scatter(df[x], df[y], c=colormap[categories])
    f = plt.xlabel(x)
    f = plt.ylabel(y)
    f = plt.xticks(rotation=0)
    plt.show()
    
#IQR
def find_anomalies(value, lower_threshold, upper_threshold):
    
    if value < lower_threshold or value > upper_threshold:
        return 1
    else: return 0
    
def area_anomaly_detector(data, column='Area', threshold=1.1):
    
    df = data.copy()
    quartiles = dict(data[column].quantile([.25, .50, .75]))
    quartile_3, quartile_1 = quartiles[0.75], quartiles[0.25]
    area = quartile_3 - quartile_1

    lower_threshold = quartile_1 - (threshold * area)
    upper_threshold = quartile_3 + (threshold * area)

    print(f"Lower threshold: {lower_threshold}, nUpper threshold: {upper_threshold}n")
    
    df['Predictions'] = data[column].apply(find_anomalies, args=(lower_threshold, upper_threshold))
    return df

area_df = area_anomaly_detector(df)
plot_anomalies(area_df)

（红色为异常值）
2.Isolation Forest（隔离森林算法）

孤立森林是基于决策树的算法。从给定的特征集合中随机选择特征，然后在特征的最大值和最小值间随机选择一个分割值，来隔离离群值。这种特征的随机划分会使异常数据点在树中生成的路径更短，从而将它们和其他数据分开。在高维数据集中执行离群值检测的一种有效方法是使用随机森林。属于无监督学习算法。

模型参数：评估器数量：n_estimators 表示集成的基评估器或树的数量，即孤立森林中树的数量。这是一个可调的整数参数，默认值是 100；最大样本：max_samples 是训练每个基评估器的样本的数量。如果 max_samples
比样本量更大，那么会用所用样本训练所有树。max_samples 的默认值是『auto』。如果值为『auto』的话，那么
max_samples=min(256, n_samples)；数据污染问题：算法对这个参数非常敏感，它指的是数据集中离群值的期望比例，根据样本得分拟合定义阈值时使用。默认值是『auto』。如果取『auto』值，则根据孤立森林的原始论文定义阈值；最大特征：所有基评估器都不是用数据集中所有特征训练的。这是从所有特征中提出的、用于训练每个基评估器或树的特征数量。该参数的默认值是 1。

model=IsolationForest(n_estimators=50, max_samples='auto', contamination=float(0.1),max_features=1.0)
model.fit(df[['area']])

算法实现：
直接调用包：from sklearn.ensemble import IsolationForest

iso_forest = IsolationForest(n_estimators=125)
iso_df = fit_model(iso_forest, df)
iso_df['Predictions'] = iso_df['Predictions'].map(lambda x: 1 if x==-1 else 0)
plot_anomalies(iso_df)

3.MAD（Median Absolute Deviation）
中位数绝对偏差是每个观测值与这些观测值中位数之间的差值。

算法实现：调用from pyod.models.mad import MAD

#MAD
#threshold : float, optional (default=3.5)
       # The modified z-score to use as a threshold. Observations with
       # a modified z-score (based on the median absolute deviation) greater
       # than this value will be classified as outliers.
mad_model = MAD()
mad_df = fit_model(mad_model, df)
plot_anomalies(mad_df)

Z-score分布图

（转载自wikipedia）

MAD算法：

def _mad(self, X):
        """
        Apply the robust median absolute deviation (MAD)
        to measure the distances of data points from the median.

        Returns
        -------
        numpy array containing modified Z-scores of the observations.
        The greater the score, the greater the outlierness.
        """
        obs = np.reshape(X, (-1, 1))
        # `self.median` will be None only before `fit()` is called
        self.median = np.nanmedian(obs) if self.median is None else self.median
        diff = np.abs(obs - self.median)
        self.median_diff = np.median(diff) if self.median_diff is None else self.median_diff
        return np.nan_to_num(np.ravel(0.6745 * diff / self.median_diff))

R语言中实现MAD算法：

mad(x, center = median(x), constant = 1.4826, na.rm = FALSE,
    low = FALSE, high = FALSE)

参数说明：https://www.math.ucla.edu/~anderson/rw1001/library/base/html/mad.html

总结：
PyOD是一个全面的、可扩展的Python工具包，可以用来检测异常值。可以直接调用里面的模型。
PyOD网址：https://github.com/yzhao062/pyod

使用方法：如调用MAD模型

# train the MAD detector
from pyod.models.mad import MAD
clf = MAD()
clf.fit(X_train)

# get outlier scores
y_train_scores = clf.decision_scores_  # raw outlier scores on the train data
y_test_scores = clf.decision_function(X_test)  # predict raw outlier scores on test

参考资料：
A walkthrough of Univariate Anomaly Detection in Python（很好学习资料）：https://www.analyticsvidhya.com/blog/2021/06/univariate-anomaly-detection-a-walkthrough-in-python/

隔离森林算法：https://blog.csdn.net/ChenVast/article/details/82863750

异常值检测总结pyod包：https://blog.csdn.net/weixin_43822124/article/details/112523303

pyod包:https://blog.csdn.net/weixin_41697507/article/details/89408236?spm=1001.2101.3001.6650.2&utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7ERate-2.pc_relevant_default&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7ERate-2.pc_relevant_default&utm_relevant_index=2

删除异常值方法总结

Python相关栏目本月热门文章