吴恩达《机器学习》笔记——第四章

4、Linear Regression with multiple variables（多元线性回归）

4.1 Multiple features/variables（多特征/变量）
4.2 Gradient descent for multiple variables（多元梯度下降法）
4.3 Gradient descent in practice I：Feature Scaling（多元梯度下降法演练1：特征缩放）
4.4 Gradient descent in practice II：Learning rate（多元梯度下降法演练2：学习率）.
4.5 Features and polynomial regression（特征和多项式回归）
4.6 Normal equation（正规方程）

4.1 Multiple features/variables（多特征/变量）

记号(Notation)：
n n n：number of features/variables（特征/变量数）
x ( i ) x^{(i)} x(i)：input (features) of i t h i^{th} ith training examples（第 i i i个训练样本）
x j ( i ) x_j^{(i)} xj(i)：value of feature j j j in i t h i^{th} ith training examples（第 i i i个训练样本的第 j j j个特征）

h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 + ⋯ + θ n x n h_theta(x)=theta_0+theta_1x_1+theta_2x_2+dots+theta_nx_n hθ(x)=θ0+θ1x1+θ2x2+⋯+θnxn，为了便利，定义 x 0 = 1 x_0=1 x0=1， x = ( x 0 , x 1 , x 2 , … , x n ) T x=(x_0,x_1,x_2,dots,x_n)^T x=(x0,x1,x2,…,xn)T， θ = ( θ 0 , θ 1 , θ 2 , … , θ n ) T theta=(theta_0,theta_1,theta_2,dots,theta_n)^T θ=(θ0,θ1,θ2,…,θn)T，则 h θ ( x ) = θ T x h_theta(x)=theta^Tx hθ(x)=θTx，称为多元线性回归(Multivariate linear regression)。

4.2 Gradient descent for multiple variables（多元梯度下降法）

假设函数：* h θ ( x ) = θ T x = θ 0 + θ 1 x 1 + θ 2 x 2 + ⋯ + θ n x n h_theta(x)=theta^Tx=theta_0+theta_1x_1+theta_2x_2+dots+theta_nx_n hθ(x)=θTx=θ0+θ1x1+θ2x2+⋯+θnxn

参数： θ 0 , θ 1 , θ 2 , … , θ n theta_0,theta_1,theta_2,dots,theta_n θ0,θ1,θ2,…,θn

代价函数： J ( θ ) = J ( θ 0 , θ 1 , θ 2 , … , θ n ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(theta)=J(theta_0,theta_1,theta_2,dots,theta_n)=frac{1}{2m}sum_{i=1}^m(h_theta(x^{(i)})-y^{(i)})^2 J(θ)=J(θ0,θ1,θ2,…,θn)=2m1∑i=1m(hθ(x(i))−y(i))2

还是和单变量的线性回归一样进行求导，只不过这次是多变量线性回归。不嫌麻烦可以对每个变量进行求导；简单方法是对向量求导。（求导很简单，没啥需要讲的。）

4.3 Gradient descent in practice I：Feature Scaling（多元梯度下降法演练1：特征缩放）

这一节讲的内容的目的是加快梯度下降算法的收敛。

Feature Scaling。Idea：Make sure features are on a similar scale。Get every feature into approximately a − 1 ≤ x i ≤ 1 -1leq x_ileq1 −1≤xi≤1。方法：将特征除以训练集中该特征的最大值。

Mean normalization。Replace x i x_i xi with x i − μ i x_i-mu_i xi−μi to make features have approximately zero mean (Do not apply to x 0 = 1 x_0=1 x0=1)，其中 μ i mu_i μi是训练集中特征 x i x_i xi的平均值，最后再除以 x i x_i xi的范围。用数学表达式就是： x i ← x i − μ i s i x_ileftarrow frac{x_i-mu_i}{s_i} xi←sixi−μi，其中 s i s_i si是 x i x_i xi的范围，范围是指最大值减去最小值，也可以把 s i s_i si设置为 x i x_i xi的标准差。

以上两个缩放不需要太精确，只是为了让梯度下降法的速度更快一点儿。

4.4 Gradient descent in practice II：Learning rate（多元梯度下降法演练2：学习率）.

建议：每次迭代输出代价函数值。

如果梯度下降算法不能正常工作（代价函数值变大或者代价函数值来回横跳），则可以尝试使用更小的学习率 α alpha α。

对于足够小的 α alpha α，代价函数每次迭代都会下降；但是如果 α alpha α太小，收敛会变慢。

4.5 Features and polynomial regression（特征和多项式回归）

这一节只是简单提了一下利用现有特征的运算(加减乘除)构造新的特征和多项式回归，没啥好说的。

4.6 Normal equation（正规方程）

分析地求解线性回归的 θ theta θ，这一节就是在讲最小二乘法。

Normal equation：method to solve for θ theta θ analytically。

m m m个训练样本 ( x ( 1 ) , y ( 1 ) ) , … , ( x ( m ) , y ( m ) ) (x^{(1)},y^{(1)}),dots,(x^{(m)},y^{(m)}) (x(1),y(1)),…,(x(m),y(m))； n n n个特征。
令 x ( i ) = ( x 0 ( i ) , x 1 ( i ) , x 2 ( i ) , … , x n ( i ) ) ∈ R n + 1 x^{(i)}=(x_0^{(i)},x_1^{(i)},x_2^{(i)},dots,x_n^{(i)})in R^{n+1} x(i)=(x0(i),x1(i),x2(i),…,xn(i))∈Rn+1，则 X = [ x ( 1 ) x ( 2 ) ⋮ x ( m ) ] X= begin{bmatrix} x^{(1)} \ x^{(2)} \ vdots \ x^{(m)} end{bmatrix} X=⎣⎢⎢⎢⎡x(1)x(2)⋮x(m)⎦⎥⎥⎥⎤， y = [ y ( 1 ) y ( 2 ) ⋮ y ( m ) ] y= begin{bmatrix} y^{(1)} \ y^{(2)} \ vdots \ y^{(m)} end{bmatrix} y=⎣⎢⎢⎢⎡y(1)y(2)⋮y(m)⎦⎥⎥⎥⎤，利用最小二乘法或者是根据对代价函数求导，得到 θ = ( X T X ) − 1 X T y theta=(X^TX)^{-1}X^Ty θ=(XTX)−1XTy。注：利用正规方程，不需要进行特征缩放。

上述涉及到矩阵求逆。当现实任务中 X T X X^TX XTX往往不是满秩矩阵。例如特征(变量)数远远超过样本数，导致 X X X的列数多于行数， X T X X^TX XTX显然不满秩。此时可以解出多个解，它们均能使代价函数最小化。选择哪一个解作为输出，将由学习算法的归纳偏好决定，常见的做法是引入正则化(regularization)项。

梯度下降法和正规方程各自的优缺点。梯度下降法：需要选择学习率 α alpha α；需要多次迭代；当训练集样本数大的时候表现好（速度快）。正规方程：不需要学习率 α alpha α；不需要迭代；但是当训练集样本数大( n ≥ 10000 ngeq10000 n≥10000)的时候慢（因为这时候矩阵 X X X规模大）。

吴恩达《机器学习》笔记——第四章

Python相关栏目本月热门文章