Machine Learning - Coursera 吴恩达机器学习教程 Week3 学习笔记

分类问题

例如对肿瘤的分类问题：
0：良性
1：恶性

二元分类问题（binary classification problem）只需要两个结果：0和1。有时候也用-和+表示，所以y⁽ⁱ⁾ 也被称为标签。

逻辑回归

一些术语：

asymptotes 渐近线

使用sigmoid函数g(z)将线性函数h(x)的值域映射到(0, 1)，

g(z)的函数图：

新的h(x)就表示结果为1的概率。

例如肿瘤分类的例子，假如h_θ (x) = 0.7，就表示输出结果为1，即肿瘤为恶性的概率为70%。

相应的，输出结果为0，即肿瘤为良性的概率为30%：

决策边界(Decision Boundary)

使用了sigmoid函数，其实目标就转化为，找出一个线性函数h_θ(x)：

当h > 0时，判断为1当h < 0时，判断为0

当只有一维特征时，可以画一条竖线；

当有两维特征时，可以画一条斜线：

或者更复杂的，画一个曲线范围：

h(x)就是决策边界。

代价函数

sigmoid函数的均方差代价函数非凸的（下面左图），所以无法保证到达全局最优。

所以，应该给它专门设计一个代价函数：

当y=1时，代价函数如下图。

h=1时，代价为0；（命中了）

h->0时，代价->∞；（错的越远，惩罚越大）

当y=0时，代价函数如下图。

h=0时，代价为0；（命中了）

h->1时，代价->∞；（错的越远，惩罚越大）

代价函数简化

上面的两个代价函数，可以合并为以下蓝色字体的函数：

此时的代价函数如下图中的J。

此时的梯度下降，根据 principal of maximum likelihood estimation: 最大似然估计定理（此处不深究），求得每次迭代过程如下。

尽管每次迭代的delta看起来和线性回归很像，但其实由于h(x)不同，它们也不相同：

向量化计算

向量化的代价函数计算：

向量化的梯度下降计算方法是：

优化算法

为了找到假设函数最优的θ值，除了梯度下降，还有一些优化算法，效率更高，无需选择α，缺点就是比较复杂，比如：

Conjugate gradientBFGSL-BFGS

和梯度下降一样，它们的输入是下面两个函数：

一般写个函数来返回上面两个函数：

function [jVal, gradient] = costFunction(theta)
  jVal = [...code to compute J(theta)...];
  gradient = [...code to compute derivative of J(theta)...];
end

然后用fminunc()方法，结合参数配置方法optimset()，就能输出最优的θ：

options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2,1);
   [optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);

吴恩达大佬说了，不用非得搞懂这些算法，会用就行。

多类别分类 one-vs-all

将n分类问题，转化为n个2分类问题，训练出n个h(x)

此时给定一个x，求出所有的h(x)，取其中最大的作为它的分类。

过拟合和正则化

欠拟合（图一）：假设函数不够合适，结果误差较大，high bias。J比较大。

过拟合（图三）：过度契合训练集，预测结果也不好，high variance。J能约等于0，但泛化能力差。

为解决过拟合，有两个选择：

减少特征数量：手工减少，或用模型选择算法正则化（regularization）：保留所有特征，但减少θ的量级，在特征数量较多时表现良好

对于相关性较小的特征，可以把它们加入代价函数，以便惩罚它们，让它们尽可能接近0：

但我们怎么知道哪些特征需要惩罚呢？干脆一起加入代价函数（除了θ₀）：

λ是正则化参数，决定了θ带来的代价高低。

假如λ过大会怎么样呢？θ1~n全都趋于0，h(x)趋于一条水平线，会变得欠拟合：

通过正则化，能将目标函数变得平滑，一定程度解决过拟合问题。

带正则化的梯度下降

对每个θ的下降过程变成了：

由于 α * λ / m 为正，通常α也比较小，就相当于只是将θ做了额外的一些缩减，梯度下降法仍能正常运转。

带正则化的正态方程

正则方程变成了：

没有正则化时，当 m <= n，即样本数小于特征数时，X^TX不可逆。

但正则化后，X^TX + λ * L就变得可逆了。所以正则化能拯救正态方程。

正则化的逻辑回归

原始的逻辑回归代价函数为：

正则化后的J变成了（注意式子的加号右侧，排除了θ₀）：

梯度下降变成了：

总结

吴恩达说，至此，学了线性回归、逻辑回归、高级优化算法、正则化，对于机器学习你已经比很多硅谷大佬懂得多了…恭喜！

作业 plotData.m

function plotData(X, y)
%PLOTDATA Plots the data points X and y into a new figure 
%   PLOTDATA(x,y) plots the data points with + for the positive examples
%   and o for the negative examples. X is assumed to be a Mx2 matrix.

% Create New Figure
figure; hold on;

% ====================== YOUR CODE HERE ======================
% Instructions: Plot the positive and negative examples on a
%               2D plot, using the option 'k+' for the positive
%               examples and 'ko' for the negative examples.
%


% Find Indices of Positive and Negative Examples
pos = find(y==1); neg = find(y == 0);
% Plot Examples
plot(X(pos, 1), X(pos, 2), 'k+','LineWidth', 2, ...
     'MarkerSize', 7);
plot(X(neg, 1), X(neg, 2), 'ko', 'MarkerFaceColor', 'y', ...
     'MarkerSize', 7);


% =========================================================================



hold off;

end

sigmoid.m

function g = sigmoid(z)
%SIGMOID Compute sigmoid function
%   g = SIGMOID(z) computes the sigmoid of z.

% You need to return the following variables correctly 
%g = zeros(size(z));

% ====================== YOUR CODE HERE ======================
% Instructions: Compute the sigmoid of each value of z (z can be a matrix,
%               vector or scalar).


g = 1 ./ (1 + exp(-z))

% =============================================================

end

costFunction.m

function [J, grad] = costFunction(theta, X, y)
%COSTFUNCTION Compute cost and gradient for logistic regression
%   J = COSTFUNCTION(theta, X, y) computes the cost of using theta as the
%   parameter for logistic regression and the gradient of the cost
%   w.r.t. to the parameters.

% Initialize some useful values
m = length(y); % number of training examples

% You need to return the following variables correctly 
% J = 0;
% grad = zeros(size(theta));

% ====================== YOUR CODE HERE ======================
% Instructions: Compute the cost of a particular choice of theta.
%               You should set J to the cost.
%               Compute the partial derivatives and set grad to the partial
%               derivatives of the cost w.r.t. each parameter in theta
%
% Note: grad should have the same dimensions as theta
%
hx = sigmoid(X * theta)
J = (-y'*log(hx)-(1-y)'*log(1-hx)) / m
grad = X' * (hx - y) / m

% =============================================================

end

predict.m

function p = predict(theta, X)
%PREDICT Predict whether the label is 0 or 1 using learned logistic 
%regression parameters theta
%   p = PREDICT(theta, X) computes the predictions for X using a 
%   threshold at 0.5 (i.e., if sigmoid(theta'*x) >= 0.5, predict 1)

m = size(X, 1); % Number of training examples

% You need to return the following variables correctly
% p = zeros(m, 1);

% ====================== YOUR CODE HERE ======================
% Instructions: Complete the following code to make predictions using
%               your learned logistic regression parameters. 
%               You should set p to a vector of 0's and 1's
%

p = X * theta > 0
% =========================================================================

end

costFunctionReg.m

function [J, grad] = costFunctionReg(theta, X, y, lambda)
%COSTFUNCTIonREG Compute cost and gradient for logistic regression with regularization
%   J = COSTFUNCTIONREG(theta, X, y, lambda) computes the cost of using
%   theta as the parameter for regularized logistic regression and the
%   gradient of the cost w.r.t. to the parameters. 

% Initialize some useful values
m = length(y); % number of training examples

% You need to return the following variables correctly 
%J = 0;
%grad = zeros(size(theta));

% ====================== YOUR CODE HERE ======================
% Instructions: Compute the cost of a particular choice of theta.
%               You should set J to the cost.
%               Compute the partial derivatives and set grad to the partial
%               derivatives of the cost w.r.t. each parameter in theta
n = length(theta)
hx = sigmoid(X * theta)
%J = (-y'*log(hx)-(1-y)'*log(1-hx)) / m + lambda * (theta(2:n)'*theta(2:n)) / 2 / m 
J = (-y'*log(hx)-(1-y)'*log(1-hx)) / m + lambda * (theta(2:n)'*theta(2:n)) / 2 / m 
grad0 = X(:,1)' * (hx - y) / m
grad1 = X(:,2:n)' * (hx - y) / m + lambda * theta(2:n) / m
grad = [grad0;grad1]
% =============================================================

end

分类结果

ex2.m的划分结果如下：

ex2_reg.m的划分结果如下：

Machine Learning - Coursera 吴恩达机器学习教程 Week3 学习笔记

Python相关栏目本月热门文章