2021-09-26_Python

2021-09-26

吴恩达机器学习第二次作业（第一部分）

这次作业是用logistic回归实现对数据集的分类，logistic回归本质上是用到了sigmoid函数。下面先看看数据集。

test1是第一次面试成绩，test2是第二次面试成绩，outcome是录取结果。我们的任务是利用逻辑回归完成对录取结果的预测，也可以看作是进行分类。
刚开始是想到利用梯度下降进行实现，后面才看到有更简便的方法。所以还是先介绍利用梯度下降方法来完成。
首先还是读数据然后处理。

path = 'C:/Users/ASUS/Desktop/2.txt'
df = pd.read_csv(path,header=None,names=['test1', 'test2', 'outcome'])
df.insert(0,'ones',[1]*df.shape[0]) #插入一列全是1的用于和W0相乘
positive = df[df['outcome'].isin([1])].ix[:,1:3] #取出录取结果为1的数据
negative = df[df['outcome'].isin([0])].ix[:,1:3]
x = df.ix[:,:-1]
y = df.ix[:,-1]
x = np.mat(x)
y = np.mat(y)
theta = np.zeros(3)
theta = np.mat(theta)#初始化θ

然后要定义sigmoid函数和梯度更新的函数

def sigmoid(z):
    return 1 / (1 + np.exp(-z))
def gradient(x,y,theta,num,lr):
    param_num = theta.shape[-1]
    loss_list = []
    for i in range(num):
        theta = theta - lr*((sigmoid(x*theta.T) - (y.T)).T)*x
        loss = (y*np.log(sigmoid(x*theta.T)))+((1-y)*np.log(1-(sigmoid(-x*theta.T))))
        loss_list.append(loss)
    loss_list = [i[0,0] for i in loss_list if np.isnan(i[0,0]) == False] #去除列表中的nan
    return theta, loss_list

之后进行训练。可以看出直接使用梯度下降训练次数较大。

theta_1,loss = gradient(x,y,theta,200000,0.01)

最后展示结果。

x_pre = np.linspace(30, 100, 100)
plt.scatter(positive['test1'],positive['test2'])
plt.scatter(negative['test1'],negative['test2'],color = 'r')
plt.plot(x_pre,-theta_1[0,0]/theta_1[0,2]-theta_1[0,1]/theta_1[0,2]*x_pre,color = 'y')
plt.show()

结果如下图。

利用梯度下降大概就是上面这些内容，下面介绍利用scipy中optimize子包自带的优化算法fmin_tnc()，利用fmin_tnc()我们无需定义迭代次数和步长，这个函数可以直接返回所求问题的最优解，也就是θ的最优解。

def cost(theta, x, y):
    cost = (y*np.log(sigmoid((theta*x.T)).T)+((1-y)*np.log(1-(sigmoid((theta*x.T)).T))))/(-x.shape[0])
    return cost
def gradient(theta, x, y):
    return ((sigmoid((theta*x.T).T) - (y.T)).T)*x/(x.shape[0])
import scipy.optimize as opt
result = opt.fmin_tnc(func=cost, x0=theta, fprime=gradient, args=(x, y))  
#result返回值是(array([-25.16131853,   0.20623159,   0.20147149]), 36, 0)

result的第一个返回值就是θ，下面展示结果。

x_pre = np.linspace(30, 100, 100)
plt.scatter(positive['test1'],positive['test2'])
plt.scatter(negative['test1'],negative['test2'],color = 'r')
plt.plot(x_pre,-result[0][0]/result[0][2]-result[0][1]/result[0][2]*x_pre,color = 'y')
plt.show()

结果如下图。

另外在利用这个fmin_tnc()时还出现了一些小问题，现在也还没搞清楚。

#x_shape(100,3),y_shape(1,100),theta_shape(1,3)
def cost(theta, x, y):
    cost = y*np.log(sigmoid(x*theta.T))+(1-y)*np.log(1-sigmoid(x*theta.T))
    return cost/(-x.shape[0])
def gradient(theta, x, y):
    return ((sigmoid((theta*x.T).T) - (y.T)).T)*x/(x.shape[0])
import scipy.optimize as opt
result = opt.fmin_tnc(func=cost, x0=theta, fprime=gradient, args=(x, y))

当我定义的cost函数长这样时，使用fmin_tnc()就会报错"shapes (100,3) and (1,3) not aligned: 3 (dim 1) != 1 (dim 0)"但是明明θ已经转置了，后来考虑是不是转置会影响后来θ的形状，但是使用.T转置不应该改变原始θ，这个问题到现在也没搞清楚，严重怀疑和fmin_tnc()函数的内部有关。

2021-09-26

Python相关栏目本月热门文章