ML 自实现/逻辑回归/二分类

原理

预测函数：

分类： 0 ≤ h θ ( x ) ≤ 1 0leq h_theta(x)leq1 0≤hθ(x)≤1传统： h θ ( x ) ≫ 1 ∣ ∣ h θ ( x ) ≪ 0 h_theta(x)gg 1 || h_theta(x)ll 0 hθ(x)≫1∣∣hθ(x)≪0

传统预测函数转换为分类预测函数：

传统预测函数 h θ ( x ) h_theta(x) hθ(x) ⟶ s i g m o i d stackrel{sigmoid}{longrightarrow} ⟶sigmoid分类预测函数 h θ ( x ) h_theta(x) hθ(x)

过程：
sigmoid: g ( z ) = 1 1 + e − z ⟶ z = h θ ( x ) = θ T x = 1 1 + e − θ T x g(z)=frac{1}{1+e^{-z}}stackrel{z=h_theta(x)=theta^Tx}{longrightarrow}=frac{1}{1+e^{-theta^Tx}} g(z)=1+e−z1⟶z=hθ(x)=θTx=1+e−θTx1
原 h θ ( x ) = z = θ T x = θ 0 + θ 1 x 1 + θ 2 x 2 + . . . + θ n x n h_theta(x)=z=theta^Tx=theta_0+theta_1x_1+theta_2x_2+...+theta_nx_n hθ(x)=z=θTx=θ0+θ1x1+θ2x2+...+θnxn
新 h θ ( x ) = P ( y = 1 ∣ x ; θ ) = 1 1 + e − z h_theta(x)=P(y=1|x;theta)=frac{1}{1+e^{-z}} hθ(x)=P(y=1∣x;θ)=1+e−z1

决策边界：

新 h θ ( x ) = g ( z ) ≥ 0.5 时，认为 y = 1 ⇒ z ≥ 0 ⇒ θ T x ≥ 0 ⇒ 旧 h θ ( x ) ≥ 0 新h_theta(x)=g(z) geq0.5时，认为y=1 Rightarrow z geq0 Rightarrow theta^Tx geq0 Rightarrow 旧h_theta(x) geq0 新hθ(x)=g(z)≥0.5时，认为y=1⇒z≥0⇒θTx≥0⇒旧hθ(x)≥0 新 h θ ( x ) = g ( z ) ≤ 0.5 时，认为 y = 0 ⇒ z ≤ 0 ⇒ θ T x ≤ 0 ⇒ 旧 h θ ( x ) ≤ 0 新h_theta(x)=g(z) leq0.5时，认为y=0 Rightarrow z leq0 Rightarrow theta^Tx leq0 Rightarrow 旧h_theta(x) leq0 新hθ(x)=g(z)≤0.5时，认为y=0⇒z≤0⇒θTx≤0⇒旧hθ(x)≤0 旧 h θ ( x ) = 0 ，就是决策边界旧h_theta(x)=0，就是决策边界旧hθ(x)=0，就是决策边界

成本函数：
原来的成本函数：

不能使用原来的成本函数，因为存在过多的局部最优 J ( θ ) = 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(theta)=frac{1}{m}sumlimits_{i=1}^m(h_theta(x^{(i)})-y^{(i)})^2 J(θ)=m1i=1∑m(hθ(x(i))−y(i))2， 0 ≤ h θ ( x ( i ) ≤ 1 0leq h_theta(x^{(i)} leq1 0≤hθ(x(i)≤1且 y = 0 或 1 y=0或1 y=0或1，导致存在过多局部最优，不是一个典型凸函数

新成本函数：逆向思维
c o s t ( h θ ( x ) , y ) = { − log ⁡ ( h θ ( x ) ) y = 1 − log ⁡ ( 1 − h θ ( x ) ) y = 0 cost(h_theta(x),y)= begin{cases} -log(h_theta(x))& y=1 \ -log(1-h_theta(x))& y=0 end{cases} cost(hθ(x),y)={−log(hθ(x))−log(1−hθ(x))y=1y=0

介绍

图左
y = 1 时 { h θ ( x ) → 1 如果要 h θ ( x ) → 0 则 c o s t → + ∞ y=1时 begin{cases} h_theta(x)rightarrow1 \ 如果要h_theta(x)rightarrow0则costrightarrow+infty end{cases} y=1时{hθ(x)→1如果要hθ(x)→0则cost→+∞图右
y = 0 时 { h θ ( x ) → 0 如果要 h θ ( x ) → 1 则 c o s t → + ∞ y=0时 begin{cases} h_theta(x)rightarrow0 \ 如果要h_theta(x)rightarrow1则costrightarrow+infty end{cases} y=0时{hθ(x)→0如果要hθ(x)→1则cost→+∞

统一cost函数：

c o s t ( h θ ( x ) , y ) = − y log ⁡ ( h θ ( x ) ) − ( 1 − y ) log ⁡ ( 1 − h θ ( x ) ) cost(h_theta(x),y)=-ylog(h_theta(x))-(1-y)log(1-h_theta(x)) cost(hθ(x),y)=−ylog(hθ(x))−(1−y)log(1−hθ(x))当y=1时，保留上面公式的前1部分当y=0时，保留上面公式的后1部分

成本函数：最小二乘法
J ( θ ) = 1 m ∑ i = 1 m c o s t ( h θ ( x ) , y ) = 1 m ∑ i = 1 m [ − y log ⁡ ( h θ ( x ) ) − ( 1 − y ) log ⁡ ( 1 − h θ ( x ) ) ] begin{aligned} J(theta) &=frac{1}{m}sumlimits_{i=1}^mcost(h_theta(x),y) \ &=frac{1}{m}sumlimits_{i=1}^m[-ylog(h_theta(x))-(1-y)log(1-h_theta(x))] end{aligned} J(θ)=m1i=1∑mcost(hθ(x),y)=m1i=1∑m[−ylog(hθ(x))−(1−y)log(1−hθ(x))]

批量梯度下降算法：

重复直到收敛 {
θ 0 : = θ 0 − α ∂ J ( θ ) ∂ θ 0 theta_0:=theta_0-alphafrac{partial{J(theta)}}{partial{theta_0}} θ0:=θ0−α∂θ0∂J(θ)
θ j : = θ j − α ∂ J ( θ ) ∂ θ j theta_j:=theta_j-alphafrac{partial{J(theta)}}{partial{theta_j}} θj:=θj−α∂θj∂J(θ)
}重复直到收敛 {
θ 0 : = θ 0 − α 1 m ∑ i = 1 m [ h θ ( x ) − y ] theta_0:=theta_0-alphafrac{1}{m}sumlimits_{i=1}^m[h_theta(x)-y] θ0:=θ0−αm1i=1∑m[hθ(x)−y]
θ j : = θ j − α 1 m ∑ i = 1 m [ h θ ( x ) − y ] x j theta_j:=theta_j-alphafrac{1}{m}sumlimits_{i=1}^m[h_theta(x)-y]x_j θj:=θj−αm1i=1∑m[hθ(x)−y]xj
}注意：批量梯度下降算法需要同时更新所有 θ j theta_j θj 数据集

垃圾邮件区分

地址下载spambase.data文件即可二分类问题：spam or non-spam本次实验只取前3列作为特征，最后1列作为目标，最后1列值为1时spam，最后1列值为0时non-spam，代码

from mpl_toolkits.mplot3d import Axes3D
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

matplotlib.rcParams['font.family'] = 'STSong'
matplotlib.rcParams['font.size'] = 20


class DataSet(object):
    """
    X_train 训练集样本
    y_train 训练集样本值
    X_test 测试集样本
    y_test 测试集样本值
    """

    def __init__(self, X_train, y_train, X_test, y_test):
        self.X_train = X_train
        self.y_train = y_train
        self.X_test = X_test
        self.y_test = y_test


class LogisticRegression(object):
    """
    逻辑回归
    """

    def __init__(self, n_feature):
        self.theta0 = 0
        self.theta = np.zeros((n_feature, 1))

    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))

    def gradient_descent(self, X, y, alpha=0.001, num_iter=100):
        costs = []
        m, _ = X.shape
        for i in range(num_iter):
            # 预测值
            h = self.sigmoid(np.dot(X, self.theta) + self.theta0)
            # 成本函数
            cost = (1 / m) * np.sum(-y * np.log(h) - (1 - y) * (np.log(1 - h)))
            costs.append(cost)
            # 梯度
            dJ_dtheta0 = (1 / m) * np.sum(h - y)
            dJ_dtheta = (1 / m) * np.dot((h - y).T, X).T
            # 同时更新所有theta
            self.theta0 = self.theta0 - alpha * dJ_dtheta0
            self.theta = self.theta - alpha * dJ_dtheta

        return costs

    def show_train(self, costs, num_iter):
        """
        展示训练过程
        """
        fig = plt.figure(figsize=(10, 6))
        plt.plot(np.arange(num_iter), costs)
        plt.title("成本变化")
        plt.xlabel("迭代次数")
        plt.ylabel("成本")
        plt.show()

    def hypothesis(self, X, theta0, theta):
        """
        预测函数
        """
        h0 = self.sigmoid(self.theta0 + np.dot(X, self.theta))
        h = [1 if elem > 0.5 else 0 for elem in h0]
        return np.array(h)[:, np.newaxis]


def read_data():
    """
    读取数据
    """
    # names：表头
    # sep：分割符
    # skipinitialspace：忽略分割符后的空格
    # comment：忽略t后的注释
    # na_values：使用?替换NA的值
    origin_data = pd.read_csv("./data/spambase.data", sep=",", skipinitialspace=True, comment="t", na_values="?")
    data = origin_data.copy()
    # tail()打印最后n行数据
    print(data.tail())
    return data


def clean_data(data):
    """
    清洗数据：处理异常值
    """
    # dataset是否含有NA数据
    # pandas 0.22.0+才有isna()，升级命令：pip install --upgrade pandas==0.22.0
    print('NA行数：', data.isna().sum())
    # 删除异常行
    cleaned_data = data.dropna()
    return cleaned_data


def show_data(data):
    """
     展示数据情况
    """
    count_spam = 0
    count_non_spam = 0
    for c in data.iloc[:, -1]:
        if c == 1:
            count_spam += 1
        else:
            count_non_spam += 1

    print("垃圾邮件个数：", count_spam)
    print("正常邮件个数：", count_non_spam)


def split_data(data):
    """
    划分数据
    划分为train, test；train用来训练得出预测函数，test用来测试预测函数值的泛化能力
    """
    copied_data = data.copy()
    # frac：抽取行的比例；random_state 随机种子
    train_dataset = copied_data.sample(frac=0.8, random_state=1)
    # 取剩下的测试集
    test_dataset = copied_data.drop(train_dataset.index)

    X_train = train_dataset.iloc[:, 0:3]
    y_train = train_dataset.iloc[:, -1]
    X_test = test_dataset.iloc[:, 0:3]
    y_test = test_dataset.iloc[:, -1]
    dataset = DataSet(X_train, y_train, X_test, y_test)

    return dataset


def evaluate_model(y_test, h):
    """
    评估模型
    """
    # MSE：均方差
    print("MSE: %f" % (np.sum((h - y_test) ** 2) / len(y_test)))
    # RMSE：均方根差
    print("RMSE: %f" % (np.sqrt(np.sum((h - y_test) ** 2) / len(y_test))))

def show_result(X_test, y_test, h):
    # figure画布
    fig = plt.figure(figsize=(16, 8), facecolor='w')
    # subplot子图
    plt.subplots_adjust(left=0.05, right=0.95, bottom=0.05, top=0.9)

    # 221：nrows=2, ncols=2, index=1
    ax = fig.add_subplot(121, projection='3d')
    ax.set_title("y_test")
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
    ax.set_zlabel('Feature 3')
    # x,y,z,c(color),marker(形状)
    ax.scatter(X_test[:, 0], X_test[:, 1], X_test[:, 2], c=y_test, marker='o')
    plt.grid(True)

    ax1 = fig.add_subplot(122, projection='3d')
    ax1.set_title("h")
    ax1.set_xlabel('Feature 1')
    ax1.set_ylabel('Feature 2')
    ax1.set_zlabel('Feature 3')
    # x,y,z,c(color),marker(形状)
    ax1.scatter(X_test[:, 0], X_test[:, 1], X_test[:, 2], c=h, marker='*')
    plt.grid(True)

    plt.show()

def main():
    # 读数据
    data = read_data()
    # 清洗数据
    cleaned_data = clean_data(data)
    # 均值归一化，所用数据集前3列数据范围差异不大，不做均值归一化处理
    # 展示数据
    show_data(cleaned_data)
    # 拆分数据
    dataset = split_data(cleaned_data)
    # 构建模型
    _, n = dataset.X_train.shape
    logistic_regression = LogisticRegression(n)
    num_iteration = 300
    costs = logistic_regression.gradient_descent(dataset.X_train, dataset.y_train.values[:, np.newaxis], alpha=0.5,
                                                 num_iter=num_iteration)
    # 展示训练过程
    logistic_regression.show_train(costs, num_iteration)
    # 评估模型
    h = logistic_regression.hypothesis(dataset.X_test, logistic_regression.theta0, logistic_regression.theta)
    evaluate_model(dataset.y_test.values[:, np.newaxis], h)
    # 展示结果
    show_result(dataset.X_test.values,dataset.y_test.values, h.ravel())


if __name__ == '__main__':
    main()

ML 自实现/逻辑回归/二分类

Python相关栏目本月热门文章