预测函数:
分类: 0 ≤ h θ ( x ) ≤ 1 0leq h_theta(x)leq1 0≤hθ(x)≤1传统: h θ ( x ) ≫ 1 ∣ ∣ h θ ( x ) ≪ 0 h_theta(x)gg 1 || h_theta(x)ll 0 hθ(x)≫1∣∣hθ(x)≪0
传统预测函数转换为分类预测函数:
传统预测函数
h
θ
(
x
)
h_theta(x)
hθ(x)
⟶
s
i
g
m
o
i
d
stackrel{sigmoid}{longrightarrow}
⟶sigmoid分类预测函数
h
θ
(
x
)
h_theta(x)
hθ(x)
过程:
sigmoid:
g
(
z
)
=
1
1
+
e
−
z
⟶
z
=
h
θ
(
x
)
=
θ
T
x
=
1
1
+
e
−
θ
T
x
g(z)=frac{1}{1+e^{-z}}stackrel{z=h_theta(x)=theta^Tx}{longrightarrow}=frac{1}{1+e^{-theta^Tx}}
g(z)=1+e−z1⟶z=hθ(x)=θTx=1+e−θTx1
原
h
θ
(
x
)
=
z
=
θ
T
x
=
θ
0
+
θ
1
x
1
+
θ
2
x
2
+
.
.
.
+
θ
n
x
n
h_theta(x)=z=theta^Tx=theta_0+theta_1x_1+theta_2x_2+...+theta_nx_n
hθ(x)=z=θTx=θ0+θ1x1+θ2x2+...+θnxn
新
h
θ
(
x
)
=
P
(
y
=
1
∣
x
;
θ
)
=
1
1
+
e
−
z
h_theta(x)=P(y=1|x;theta)=frac{1}{1+e^{-z}}
hθ(x)=P(y=1∣x;θ)=1+e−z1
决策边界:
新 h θ ( x ) = g ( z ) ≥ 0.5 时 , 认 为 y = 1 ⇒ z ≥ 0 ⇒ θ T x ≥ 0 ⇒ 旧 h θ ( x ) ≥ 0 新h_theta(x)=g(z) geq0.5时,认为y=1 Rightarrow z geq0 Rightarrow theta^Tx geq0 Rightarrow 旧h_theta(x) geq0 新hθ(x)=g(z)≥0.5时,认为y=1⇒z≥0⇒θTx≥0⇒旧hθ(x)≥0 新 h θ ( x ) = g ( z ) ≤ 0.5 时 , 认 为 y = 0 ⇒ z ≤ 0 ⇒ θ T x ≤ 0 ⇒ 旧 h θ ( x ) ≤ 0 新h_theta(x)=g(z) leq0.5时,认为y=0 Rightarrow z leq0 Rightarrow theta^Tx leq0 Rightarrow 旧h_theta(x) leq0 新hθ(x)=g(z)≤0.5时,认为y=0⇒z≤0⇒θTx≤0⇒旧hθ(x)≤0 旧 h θ ( x ) = 0 , 就 是 决 策 边 界 旧h_theta(x)=0,就是决策边界 旧hθ(x)=0,就是决策边界
成本函数:
原来的成本函数:
不能使用原来的成本函数,因为存在过多的局部最优
J
(
θ
)
=
1
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
2
J(theta)=frac{1}{m}sumlimits_{i=1}^m(h_theta(x^{(i)})-y^{(i)})^2
J(θ)=m1i=1∑m(hθ(x(i))−y(i))2,
0
≤
h
θ
(
x
(
i
)
≤
1
0leq h_theta(x^{(i)} leq1
0≤hθ(x(i)≤1且
y
=
0
或
1
y=0或1
y=0或1,导致存在过多局部最优,不是一个典型凸函数
新成本函数:逆向思维
c
o
s
t
(
h
θ
(
x
)
,
y
)
=
{
−
log
(
h
θ
(
x
)
)
y
=
1
−
log
(
1
−
h
θ
(
x
)
)
y
=
0
cost(h_theta(x),y)= begin{cases} -log(h_theta(x))& y=1 \ -log(1-h_theta(x))& y=0 end{cases}
cost(hθ(x),y)={−log(hθ(x))−log(1−hθ(x))y=1y=0
介绍
图左
y
=
1
时
{
h
θ
(
x
)
→
1
如
果
要
h
θ
(
x
)
→
0
则
c
o
s
t
→
+
∞
y=1时 begin{cases} h_theta(x)rightarrow1 \ 如果要h_theta(x)rightarrow0则costrightarrow+infty end{cases}
y=1时{hθ(x)→1如果要hθ(x)→0则cost→+∞图右
y
=
0
时
{
h
θ
(
x
)
→
0
如
果
要
h
θ
(
x
)
→
1
则
c
o
s
t
→
+
∞
y=0时 begin{cases} h_theta(x)rightarrow0 \ 如果要h_theta(x)rightarrow1则costrightarrow+infty end{cases}
y=0时{hθ(x)→0如果要hθ(x)→1则cost→+∞
统一cost函数:
c o s t ( h θ ( x ) , y ) = − y log ( h θ ( x ) ) − ( 1 − y ) log ( 1 − h θ ( x ) ) cost(h_theta(x),y)=-ylog(h_theta(x))-(1-y)log(1-h_theta(x)) cost(hθ(x),y)=−ylog(hθ(x))−(1−y)log(1−hθ(x))当y=1时,保留上面公式的前1部分当y=0时,保留上面公式的后1部分
成本函数:最小二乘法
J
(
θ
)
=
1
m
∑
i
=
1
m
c
o
s
t
(
h
θ
(
x
)
,
y
)
=
1
m
∑
i
=
1
m
[
−
y
log
(
h
θ
(
x
)
)
−
(
1
−
y
)
log
(
1
−
h
θ
(
x
)
)
]
begin{aligned} J(theta) &=frac{1}{m}sumlimits_{i=1}^mcost(h_theta(x),y) \ &=frac{1}{m}sumlimits_{i=1}^m[-ylog(h_theta(x))-(1-y)log(1-h_theta(x))] end{aligned}
J(θ)=m1i=1∑mcost(hθ(x),y)=m1i=1∑m[−ylog(hθ(x))−(1−y)log(1−hθ(x))]
批量梯度下降算法:
重复直到收敛 {
θ
0
:
=
θ
0
−
α
∂
J
(
θ
)
∂
θ
0
theta_0:=theta_0-alphafrac{partial{J(theta)}}{partial{theta_0}}
θ0:=θ0−α∂θ0∂J(θ)
θ
j
:
=
θ
j
−
α
∂
J
(
θ
)
∂
θ
j
theta_j:=theta_j-alphafrac{partial{J(theta)}}{partial{theta_j}}
θj:=θj−α∂θj∂J(θ)
}重复直到收敛 {
θ
0
:
=
θ
0
−
α
1
m
∑
i
=
1
m
[
h
θ
(
x
)
−
y
]
theta_0:=theta_0-alphafrac{1}{m}sumlimits_{i=1}^m[h_theta(x)-y]
θ0:=θ0−αm1i=1∑m[hθ(x)−y]
θ
j
:
=
θ
j
−
α
1
m
∑
i
=
1
m
[
h
θ
(
x
)
−
y
]
x
j
theta_j:=theta_j-alphafrac{1}{m}sumlimits_{i=1}^m[h_theta(x)-y]x_j
θj:=θj−αm1i=1∑m[hθ(x)−y]xj
}注意:批量梯度下降算法需要同时更新所有
θ
j
theta_j
θj
数据集
垃圾邮件区分
地址 下载spambase.data文件即可二分类问题:spam or non-spam本次实验只取前3列作为特征,最后1列作为目标,最后1列值为1时spam,最后1列值为0时non-spam, 代码
from mpl_toolkits.mplot3d import Axes3D
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['font.family'] = 'STSong'
matplotlib.rcParams['font.size'] = 20
class DataSet(object):
"""
X_train 训练集样本
y_train 训练集样本值
X_test 测试集样本
y_test 测试集样本值
"""
def __init__(self, X_train, y_train, X_test, y_test):
self.X_train = X_train
self.y_train = y_train
self.X_test = X_test
self.y_test = y_test
class LogisticRegression(object):
"""
逻辑回归
"""
def __init__(self, n_feature):
self.theta0 = 0
self.theta = np.zeros((n_feature, 1))
def sigmoid(self, z):
return 1 / (1 + np.exp(-z))
def gradient_descent(self, X, y, alpha=0.001, num_iter=100):
costs = []
m, _ = X.shape
for i in range(num_iter):
# 预测值
h = self.sigmoid(np.dot(X, self.theta) + self.theta0)
# 成本函数
cost = (1 / m) * np.sum(-y * np.log(h) - (1 - y) * (np.log(1 - h)))
costs.append(cost)
# 梯度
dJ_dtheta0 = (1 / m) * np.sum(h - y)
dJ_dtheta = (1 / m) * np.dot((h - y).T, X).T
# 同时更新所有theta
self.theta0 = self.theta0 - alpha * dJ_dtheta0
self.theta = self.theta - alpha * dJ_dtheta
return costs
def show_train(self, costs, num_iter):
"""
展示训练过程
"""
fig = plt.figure(figsize=(10, 6))
plt.plot(np.arange(num_iter), costs)
plt.title("成本变化")
plt.xlabel("迭代次数")
plt.ylabel("成本")
plt.show()
def hypothesis(self, X, theta0, theta):
"""
预测函数
"""
h0 = self.sigmoid(self.theta0 + np.dot(X, self.theta))
h = [1 if elem > 0.5 else 0 for elem in h0]
return np.array(h)[:, np.newaxis]
def read_data():
"""
读取数据
"""
# names:表头
# sep:分割符
# skipinitialspace:忽略分割符后的空格
# comment:忽略t后的注释
# na_values:使用?替换NA的值
origin_data = pd.read_csv("./data/spambase.data", sep=",", skipinitialspace=True, comment="t", na_values="?")
data = origin_data.copy()
# tail()打印最后n行数据
print(data.tail())
return data
def clean_data(data):
"""
清洗数据:处理异常值
"""
# dataset是否含有NA数据
# pandas 0.22.0+才有isna(),升级命令:pip install --upgrade pandas==0.22.0
print('NA行数:', data.isna().sum())
# 删除异常行
cleaned_data = data.dropna()
return cleaned_data
def show_data(data):
"""
展示数据情况
"""
count_spam = 0
count_non_spam = 0
for c in data.iloc[:, -1]:
if c == 1:
count_spam += 1
else:
count_non_spam += 1
print("垃圾邮件个数:", count_spam)
print("正常邮件个数:", count_non_spam)
def split_data(data):
"""
划分数据
划分为train, test;train用来训练得出预测函数,test用来测试预测函数值的泛化能力
"""
copied_data = data.copy()
# frac:抽取行的比例;random_state 随机种子
train_dataset = copied_data.sample(frac=0.8, random_state=1)
# 取剩下的测试集
test_dataset = copied_data.drop(train_dataset.index)
X_train = train_dataset.iloc[:, 0:3]
y_train = train_dataset.iloc[:, -1]
X_test = test_dataset.iloc[:, 0:3]
y_test = test_dataset.iloc[:, -1]
dataset = DataSet(X_train, y_train, X_test, y_test)
return dataset
def evaluate_model(y_test, h):
"""
评估模型
"""
# MSE:均方差
print("MSE: %f" % (np.sum((h - y_test) ** 2) / len(y_test)))
# RMSE:均方根差
print("RMSE: %f" % (np.sqrt(np.sum((h - y_test) ** 2) / len(y_test))))
def show_result(X_test, y_test, h):
# figure画布
fig = plt.figure(figsize=(16, 8), facecolor='w')
# subplot子图
plt.subplots_adjust(left=0.05, right=0.95, bottom=0.05, top=0.9)
# 221:nrows=2, ncols=2, index=1
ax = fig.add_subplot(121, projection='3d')
ax.set_title("y_test")
ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
ax.set_zlabel('Feature 3')
# x,y,z,c(color),marker(形状)
ax.scatter(X_test[:, 0], X_test[:, 1], X_test[:, 2], c=y_test, marker='o')
plt.grid(True)
ax1 = fig.add_subplot(122, projection='3d')
ax1.set_title("h")
ax1.set_xlabel('Feature 1')
ax1.set_ylabel('Feature 2')
ax1.set_zlabel('Feature 3')
# x,y,z,c(color),marker(形状)
ax1.scatter(X_test[:, 0], X_test[:, 1], X_test[:, 2], c=h, marker='*')
plt.grid(True)
plt.show()
def main():
# 读数据
data = read_data()
# 清洗数据
cleaned_data = clean_data(data)
# 均值归一化,所用数据集前3列数据范围差异不大,不做均值归一化处理
# 展示数据
show_data(cleaned_data)
# 拆分数据
dataset = split_data(cleaned_data)
# 构建模型
_, n = dataset.X_train.shape
logistic_regression = LogisticRegression(n)
num_iteration = 300
costs = logistic_regression.gradient_descent(dataset.X_train, dataset.y_train.values[:, np.newaxis], alpha=0.5,
num_iter=num_iteration)
# 展示训练过程
logistic_regression.show_train(costs, num_iteration)
# 评估模型
h = logistic_regression.hypothesis(dataset.X_test, logistic_regression.theta0, logistic_regression.theta)
evaluate_model(dataset.y_test.values[:, np.newaxis], h)
# 展示结果
show_result(dataset.X_test.values,dataset.y_test.values, h.ravel())
if __name__ == '__main__':
main()



