栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 软件开发 > 后端开发 > Python

ML 自实现/逻辑回归/二分类

Python 更新时间: 发布时间: IT归档 最新发布 模块sitemap 名妆网 法律咨询 聚返吧 英语巴士网 伯小乐 网商动力

ML 自实现/逻辑回归/二分类

原理

预测函数:

分类: 0 ≤ h θ ( x ) ≤ 1 0leq h_theta(x)leq1 0≤hθ​(x)≤1传统: h θ ( x ) ≫ 1 ∣ ∣ h θ ( x ) ≪ 0 h_theta(x)gg 1 || h_theta(x)ll 0 hθ​(x)≫1∣∣hθ​(x)≪0

传统预测函数转换为分类预测函数:

传统预测函数 h θ ( x ) h_theta(x) hθ​(x) ⟶ s i g m o i d stackrel{sigmoid}{longrightarrow} ⟶sigmoid​分类预测函数 h θ ( x ) h_theta(x) hθ​(x)

过程:
sigmoid: g ( z ) = 1 1 + e − z ⟶ z = h θ ( x ) = θ T x = 1 1 + e − θ T x g(z)=frac{1}{1+e^{-z}}stackrel{z=h_theta(x)=theta^Tx}{longrightarrow}=frac{1}{1+e^{-theta^Tx}} g(z)=1+e−z1​⟶z=hθ​(x)=θTx​=1+e−θTx1​
原 h θ ( x ) = z = θ T x = θ 0 + θ 1 x 1 + θ 2 x 2 + . . . + θ n x n h_theta(x)=z=theta^Tx=theta_0+theta_1x_1+theta_2x_2+...+theta_nx_n hθ​(x)=z=θTx=θ0​+θ1​x1​+θ2​x2​+...+θn​xn​
新 h θ ( x ) = P ( y = 1 ∣ x ; θ ) = 1 1 + e − z h_theta(x)=P(y=1|x;theta)=frac{1}{1+e^{-z}} hθ​(x)=P(y=1∣x;θ)=1+e−z1​

决策边界:

新 h θ ( x ) = g ( z ) ≥ 0.5 时 , 认 为 y = 1 ⇒ z ≥ 0 ⇒ θ T x ≥ 0 ⇒ 旧 h θ ( x ) ≥ 0 新h_theta(x)=g(z) geq0.5时,认为y=1 Rightarrow z geq0 Rightarrow theta^Tx geq0 Rightarrow 旧h_theta(x) geq0 新hθ​(x)=g(z)≥0.5时,认为y=1⇒z≥0⇒θTx≥0⇒旧hθ​(x)≥0 新 h θ ( x ) = g ( z ) ≤ 0.5 时 , 认 为 y = 0 ⇒ z ≤ 0 ⇒ θ T x ≤ 0 ⇒ 旧 h θ ( x ) ≤ 0 新h_theta(x)=g(z) leq0.5时,认为y=0 Rightarrow z leq0 Rightarrow theta^Tx leq0 Rightarrow 旧h_theta(x) leq0 新hθ​(x)=g(z)≤0.5时,认为y=0⇒z≤0⇒θTx≤0⇒旧hθ​(x)≤0 旧 h θ ( x ) = 0 , 就 是 决 策 边 界 旧h_theta(x)=0,就是决策边界 旧hθ​(x)=0,就是决策边界

成本函数:
原来的成本函数:

不能使用原来的成本函数,因为存在过多的局部最优 J ( θ ) = 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(theta)=frac{1}{m}sumlimits_{i=1}^m(h_theta(x^{(i)})-y^{(i)})^2 J(θ)=m1​i=1∑m​(hθ​(x(i))−y(i))2, 0 ≤ h θ ( x ( i ) ≤ 1 0leq h_theta(x^{(i)} leq1 0≤hθ​(x(i)≤1且 y = 0 或 1 y=0或1 y=0或1,导致存在过多局部最优,不是一个典型凸函数

新成本函数:逆向思维
c o s t ( h θ ( x ) , y ) = { − log ⁡ ( h θ ( x ) ) y = 1 − log ⁡ ( 1 − h θ ( x ) ) y = 0 cost(h_theta(x),y)= begin{cases} -log(h_theta(x))& y=1 \ -log(1-h_theta(x))& y=0 end{cases} cost(hθ​(x),y)={−log(hθ​(x))−log(1−hθ​(x))​y=1y=0​

介绍

图左
y = 1 时 { h θ ( x ) → 1 如 果 要 h θ ( x ) → 0 则 c o s t → + ∞ y=1时 begin{cases} h_theta(x)rightarrow1 \ 如果要h_theta(x)rightarrow0则costrightarrow+infty end{cases} y=1时{hθ​(x)→1如果要hθ​(x)→0则cost→+∞​图右
y = 0 时 { h θ ( x ) → 0 如 果 要 h θ ( x ) → 1 则 c o s t → + ∞ y=0时 begin{cases} h_theta(x)rightarrow0 \ 如果要h_theta(x)rightarrow1则costrightarrow+infty end{cases} y=0时{hθ​(x)→0如果要hθ​(x)→1则cost→+∞​

统一cost函数:

c o s t ( h θ ( x ) , y ) = − y log ⁡ ( h θ ( x ) ) − ( 1 − y ) log ⁡ ( 1 − h θ ( x ) ) cost(h_theta(x),y)=-ylog(h_theta(x))-(1-y)log(1-h_theta(x)) cost(hθ​(x),y)=−ylog(hθ​(x))−(1−y)log(1−hθ​(x))当y=1时,保留上面公式的前1部分当y=0时,保留上面公式的后1部分

成本函数:最小二乘法
J ( θ ) = 1 m ∑ i = 1 m c o s t ( h θ ( x ) , y ) = 1 m ∑ i = 1 m [ − y log ⁡ ( h θ ( x ) ) − ( 1 − y ) log ⁡ ( 1 − h θ ( x ) ) ] begin{aligned} J(theta) &=frac{1}{m}sumlimits_{i=1}^mcost(h_theta(x),y) \ &=frac{1}{m}sumlimits_{i=1}^m[-ylog(h_theta(x))-(1-y)log(1-h_theta(x))] end{aligned} J(θ)​=m1​i=1∑m​cost(hθ​(x),y)=m1​i=1∑m​[−ylog(hθ​(x))−(1−y)log(1−hθ​(x))]​

批量梯度下降算法:

重复直到收敛 {
θ 0 : = θ 0 − α ∂ J ( θ ) ∂ θ 0 theta_0:=theta_0-alphafrac{partial{J(theta)}}{partial{theta_0}} θ0​:=θ0​−α∂θ0​∂J(θ)​
θ j : = θ j − α ∂ J ( θ ) ∂ θ j theta_j:=theta_j-alphafrac{partial{J(theta)}}{partial{theta_j}} θj​:=θj​−α∂θj​∂J(θ)​
}重复直到收敛 {
θ 0 : = θ 0 − α 1 m ∑ i = 1 m [ h θ ( x ) − y ] theta_0:=theta_0-alphafrac{1}{m}sumlimits_{i=1}^m[h_theta(x)-y] θ0​:=θ0​−αm1​i=1∑m​[hθ​(x)−y]
θ j : = θ j − α 1 m ∑ i = 1 m [ h θ ( x ) − y ] x j theta_j:=theta_j-alphafrac{1}{m}sumlimits_{i=1}^m[h_theta(x)-y]x_j θj​:=θj​−αm1​i=1∑m​[hθ​(x)−y]xj​
}注意:批量梯度下降算法需要同时更新所有 θ j theta_j θj​ 数据集

垃圾邮件区分

地址 下载spambase.data文件即可二分类问题:spam or non-spam本次实验只取前3列作为特征,最后1列作为目标,最后1列值为1时spam,最后1列值为0时non-spam, 代码

from mpl_toolkits.mplot3d import Axes3D
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

matplotlib.rcParams['font.family'] = 'STSong'
matplotlib.rcParams['font.size'] = 20


class DataSet(object):
    """
    X_train 训练集样本
    y_train 训练集样本值
    X_test 测试集样本
    y_test 测试集样本值
    """

    def __init__(self, X_train, y_train, X_test, y_test):
        self.X_train = X_train
        self.y_train = y_train
        self.X_test = X_test
        self.y_test = y_test


class LogisticRegression(object):
    """
    逻辑回归
    """

    def __init__(self, n_feature):
        self.theta0 = 0
        self.theta = np.zeros((n_feature, 1))

    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))

    def gradient_descent(self, X, y, alpha=0.001, num_iter=100):
        costs = []
        m, _ = X.shape
        for i in range(num_iter):
            # 预测值
            h = self.sigmoid(np.dot(X, self.theta) + self.theta0)
            # 成本函数
            cost = (1 / m) * np.sum(-y * np.log(h) - (1 - y) * (np.log(1 - h)))
            costs.append(cost)
            # 梯度
            dJ_dtheta0 = (1 / m) * np.sum(h - y)
            dJ_dtheta = (1 / m) * np.dot((h - y).T, X).T
            # 同时更新所有theta
            self.theta0 = self.theta0 - alpha * dJ_dtheta0
            self.theta = self.theta - alpha * dJ_dtheta

        return costs

    def show_train(self, costs, num_iter):
        """
        展示训练过程
        """
        fig = plt.figure(figsize=(10, 6))
        plt.plot(np.arange(num_iter), costs)
        plt.title("成本变化")
        plt.xlabel("迭代次数")
        plt.ylabel("成本")
        plt.show()

    def hypothesis(self, X, theta0, theta):
        """
        预测函数
        """
        h0 = self.sigmoid(self.theta0 + np.dot(X, self.theta))
        h = [1 if elem > 0.5 else 0 for elem in h0]
        return np.array(h)[:, np.newaxis]


def read_data():
    """
    读取数据
    """
    # names:表头
    # sep:分割符
    # skipinitialspace:忽略分割符后的空格
    # comment:忽略t后的注释
    # na_values:使用?替换NA的值
    origin_data = pd.read_csv("./data/spambase.data", sep=",", skipinitialspace=True, comment="t", na_values="?")
    data = origin_data.copy()
    # tail()打印最后n行数据
    print(data.tail())
    return data


def clean_data(data):
    """
    清洗数据:处理异常值
    """
    # dataset是否含有NA数据
    # pandas 0.22.0+才有isna(),升级命令:pip install --upgrade pandas==0.22.0
    print('NA行数:', data.isna().sum())
    # 删除异常行
    cleaned_data = data.dropna()
    return cleaned_data


def show_data(data):
    """
     展示数据情况
    """
    count_spam = 0
    count_non_spam = 0
    for c in data.iloc[:, -1]:
        if c == 1:
            count_spam += 1
        else:
            count_non_spam += 1

    print("垃圾邮件个数:", count_spam)
    print("正常邮件个数:", count_non_spam)


def split_data(data):
    """
    划分数据
    划分为train, test;train用来训练得出预测函数,test用来测试预测函数值的泛化能力
    """
    copied_data = data.copy()
    # frac:抽取行的比例;random_state 随机种子
    train_dataset = copied_data.sample(frac=0.8, random_state=1)
    # 取剩下的测试集
    test_dataset = copied_data.drop(train_dataset.index)

    X_train = train_dataset.iloc[:, 0:3]
    y_train = train_dataset.iloc[:, -1]
    X_test = test_dataset.iloc[:, 0:3]
    y_test = test_dataset.iloc[:, -1]
    dataset = DataSet(X_train, y_train, X_test, y_test)

    return dataset


def evaluate_model(y_test, h):
    """
    评估模型
    """
    # MSE:均方差
    print("MSE: %f" % (np.sum((h - y_test) ** 2) / len(y_test)))
    # RMSE:均方根差
    print("RMSE: %f" % (np.sqrt(np.sum((h - y_test) ** 2) / len(y_test))))

def show_result(X_test, y_test, h):
    # figure画布
    fig = plt.figure(figsize=(16, 8), facecolor='w')
    # subplot子图
    plt.subplots_adjust(left=0.05, right=0.95, bottom=0.05, top=0.9)

    # 221:nrows=2, ncols=2, index=1
    ax = fig.add_subplot(121, projection='3d')
    ax.set_title("y_test")
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
    ax.set_zlabel('Feature 3')
    # x,y,z,c(color),marker(形状)
    ax.scatter(X_test[:, 0], X_test[:, 1], X_test[:, 2], c=y_test, marker='o')
    plt.grid(True)

    ax1 = fig.add_subplot(122, projection='3d')
    ax1.set_title("h")
    ax1.set_xlabel('Feature 1')
    ax1.set_ylabel('Feature 2')
    ax1.set_zlabel('Feature 3')
    # x,y,z,c(color),marker(形状)
    ax1.scatter(X_test[:, 0], X_test[:, 1], X_test[:, 2], c=h, marker='*')
    plt.grid(True)

    plt.show()

def main():
    # 读数据
    data = read_data()
    # 清洗数据
    cleaned_data = clean_data(data)
    # 均值归一化,所用数据集前3列数据范围差异不大,不做均值归一化处理
    # 展示数据
    show_data(cleaned_data)
    # 拆分数据
    dataset = split_data(cleaned_data)
    # 构建模型
    _, n = dataset.X_train.shape
    logistic_regression = LogisticRegression(n)
    num_iteration = 300
    costs = logistic_regression.gradient_descent(dataset.X_train, dataset.y_train.values[:, np.newaxis], alpha=0.5,
                                                 num_iter=num_iteration)
    # 展示训练过程
    logistic_regression.show_train(costs, num_iteration)
    # 评估模型
    h = logistic_regression.hypothesis(dataset.X_test, logistic_regression.theta0, logistic_regression.theta)
    evaluate_model(dataset.y_test.values[:, np.newaxis], h)
    # 展示结果
    show_result(dataset.X_test.values,dataset.y_test.values, h.ravel())


if __name__ == '__main__':
    main()

转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/700670.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号