2021-12-11_Python

2021-12-11

吴恩达机器学习第六次作业

第一部分是针对线性可分数据利用线性SVM来进行分类，代码如下。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.io import loadmat
from sklearn.svm import SVC

path = 'C:/Users/ASUS/Desktop/我的资源/data_sets/ex6data1.mat'
data = loadmat(path)
x = data['X']
y = data['y']
x.shape,y.shape #((51, 2), (51, 1))
x_positive = np.array([x[i,:] for i in range(y.shape[0]) if y[i][0] == 1])
x_negative = np.array([x[i,:] for i in range(y.shape[0]) if y[i][0] == 0])
fig,ax = plt.subplots()
ax.scatter(x_positive[:,0],x_positive[:,1],color='r',label='positive')
ax.scatter(x_negative[:,0],x_negative[:,1],color='b',label='negative')
plt.legend()
plt.show()

画出要分类的点如下，可以直观看出除了有一个红点是异常值，其他都是线性可分的。

下面用线性SVM进行分类，这里我们取C=1，代码如下。

svc_1 = SVC(C=1,kernel='linear') #实例化分类器
svc_1.fit(x,y.flatten())
svc_1.score(x,y) #准确度为0.98039215686274506

之后画出分界面。

#绘制决策边界
def plot_boundary(x,y,svc_1,x_positive,x_negative):
    x1 = np.linspace(x[:,0].min(), x[:,0].max(), 1000)
    x2 = np.linspace(x[:,1].min(), x[:,1].max(), 1000)
    xx,yy = np.meshgrid(x1,x2)
    z = svc_1.predict(np.c_[xx.ravel(),yy.ravel()])
    zz = np.reshape(z,xx.shape)

    fig,ax = plt.subplots()
    ax.scatter(x_positive[:,0],x_positive[:,1],color='r',label='positive')
    ax.scatter(x_negative[:,0],x_negative[:,1],color='b',label='negative')
    plt.legend(loc=4)
    plt.contour(xx, yy, zz,colors='g')
    
plot_boundary(x,y,svc_1,x_positive,x_negative)

画出的图像如下，可以看出除了异常点被错分其他都分类正确。

然后我们调整C的大小，看分类面的变化，可以看出C大，则模型偏差低，方差高，而C小，则模型偏差高，方差低。

svc_1 = SVC(C=100,kernel='linear') #实例化分类器
svc_1.fit(x,y.flatten())
plot_boundary(x,y,svc_1,x_positive,x_negative)

第二部分是利用高斯核函数的SVM对线性不可分的数据进行分类。

path_1 = 'C:/Users/ASUS/Desktop/我的资源/data_sets/ex6data2.mat'
data_1 = loadmat(path_1)
x_1 = data_1['X']
y_1 = data_1['y']
x_1.shape,y_1.shape #((863, 2), (863, 1))
x_1_positive = np.array([x_1[i,:] for i in range(y_1.shape[0]) if y_1[i][0] == 1])
x_1_negative = np.array([x_1[i,:] for i in range(y_1.shape[0]) if y_1[i][0] == 0])
fig,ax = plt.subplots()
ax.scatter(x_1_positive[:,0],x_1_positive[:,1],color='r',label='positive')
ax.scatter(x_1_negative[:,0],x_1_negative[:,1],color='b',label='negative')
plt.legend()
plt.show()

数据分布如下图，明显可以看出无法找到一个线性分类面把这些数据分开。

我们利用高斯核函数进行分类，代码如下。

svc_2 = SVC(C=1,kernel='rbf',gamma=1) #实例化分类器,rbf——高斯核
svc_2.fit(x_1,y_1.flatten())
svc_2.score(x_1,y_1) #0.80880648899188878
plot_boundary(x_1,y_1,svc_2,x_1_positive,x_1_negative) #画出分类面

这个SVM的分类面如下图，可以看出效果不是很好，下面我们更改高斯核函数的参数来优化模型。

svc_3 = SVC(C=1,kernel='rbf',gamma=100) #更改gamma，可以看出gamma越大模型越复杂，偏差越小
svc_3.fit(x_1,y_1.flatten())
plot_boundary(x_1,y_1,svc_3,x_1_positive,x_1_negative)

得到的分类面如下图，可以看出效果较好。

第三部分是使用SVM来构建垃圾邮件过滤器，邮件已经做了特征化处理。下面用线性SVM构建，可以看出准确度较高。

path_2 = 'C:/Users/ASUS/Desktop/我的资源/data_sets/spamTrain.mat'
data_2 = loadmat(path_2)
path_3 = 'C:/Users/ASUS/Desktop/我的资源/data_sets/spamTest.mat'
data_3 = loadmat(path_3)
x_train, y_train = data_2['X'],data_2['y']
x_test, y_test = data_3['Xtest'],data_3['ytest']
svc_4 = SVC(C=1,kernel='linear')
svc_4.fit(x_train,y_train.flatten())
svc_4.score(x_train,y_train.flatten())#0.99975000000000003
svc_4.score(x_test,y_test.flatten())#0.97799999999999998

2021-12-11

Python相关栏目本月热门文章