假设model是
y ^ = f ( x ) = X β hat{y} = f(x) = Xbeta y^=f(x)=Xβ
我们可以写个基类,表示了线性回归必须的一些参数
class LinearRegressionbase:
def __init__(self):
# beta: shape (p, 1)
self.beta = None
def fit(self, X, y):
# X: shape (n, p), Y: shape (n, 1)
pass
def predict(self, X):
# X: shape (n, p)
return X @ self.beta
我们来生成一些测例的data,
n = 100 p = 5 noise = np.random.randn(n) / 10 X = np.random.randn(n, p) * 10 weights = np.ones(p) weights[[1,2,3]]=[2,5,3] y = (X @ weights + noise).reshape(-1, 1) X.shape, y.shape, X.mean(), y.mean()
上面代码生成的data的ground truth是:
y = 1 ⋅ [ x ] 0 + 2 ⋅ [ x ] 1 + 5 ⋅ [ x ] 2 + 3 ⋅ [ x ] 3 + 1 ⋅ [ x ] 4 y = 1cdot[x]_0 + 2cdot[x]_1 + 5cdot[x]_2 + 3cdot[x]_3 + 1cdot[x]_4 y=1⋅[x]0+2⋅[x]1+5⋅[x]2+3⋅[x]3+1⋅[x]4
线性回归的显式解用 Mean Square Error 作为loss function
l o s s = 1 N ∣ ∣ y ^ − y ∣ ∣ 2 2 = 1 N ∣ ∣ X β − y ∣ ∣ 2 2 loss = frac{1}{N} ||hat{y} - y||_2^2 \ = frac{1}{N} ||Xbeta - y||_2^2 loss=N1∣∣y^−y∣∣22=N1∣∣Xβ−y∣∣22
对 β beta β求导,推导如下,
所以数值解是:
β ^ = ( X T X ) − 1 X T y ( assume ( X T X ) invertible ) hatbeta = (X^T X)^{-1} X^T y \ (text{assume }(X^T X)text{ invertible}) β^=(XTX)−1XTy(assume (XTX) invertible)
python实现一下,
class LinearRegression:
def fit(self, X, y):
super().fit(X, y)
self.beta = (np.linalg.inv(X.T @ X) @ X.T) @ y
算一算结果,可以看到还是很接近 [ 1 , 2 , 5 , 3 , 1 ] [1,2,5,3,1] [1,2,5,3,1] 的ground truth的,
lr = LinearRegressionO()
lr.fit(X, y)
pred = lr.predict(X)
print(f'beta {lr.beta.flatten()}')
print(f'error {np.linalg.norm(pred-y)}')
# beta [0.99903924 1.99955203 5.00101573 3.00037858 1.00036293]
# error 0.9101951355773127
使用SGD进行优化
这个也不难,我们用PyTorch做自动推导的工具,就省得我们自己写求导了,
import torch
X = torch.normal(0, 0.5, (100, 2))
Y = X * torch.Tensor([3, 1])
weights = torch.autograd.Variable(torch.Tensor([1, 4]), requires_grad=True)
optimizer = torch.optim.SGD([weights], lr=0.1, momentum=0.9)
print(weights)
for e in range(100):
pred = X * weights
loss = torch.mean((pred - Y) ** 2)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if e%10 == 0:
print(weights, loss.cpu().item())
很简单嗷,看一下输出:
tensor([3.0024, 0.9876], requires_grad=True) tensor([3.0037, 0.9900], requires_grad=True) 2.2653073756373487e-05 tensor([3.0058, 1.0071], requires_grad=True) 1.056142536981497e-05 tensor([2.9994, 1.0027], requires_grad=True) 1.8875620071412413e-06 tensor([2.9979, 0.9972], requires_grad=True) 1.558545136504108e-06 tensor([2.9999, 0.9993], requires_grad=True) 1.4644724899426365e-07 tensor([3.0007, 1.0010], requires_grad=True) 2.129867482381087e-07 tensor([3.0001, 1.0001], requires_grad=True) 1.1658719323293099e-08 tensor([2.9998, 0.9996], requires_grad=True) 2.724585179691985e-08 tensor([2.9999, 1.0000], requires_grad=True) 1.2194050214020535e-09 tensor([3.0001, 1.0001], requires_grad=True) 3.2819149620166854e-09
(注意这个是没有用我们的线性回归基类的)
脊回归的显式解 Ridge Regression和第一部分是一样的,但是我们在loss里加入了 l2 正则化项,这就是脊回归了,
l o s s = 1 N ∣ ∣ X β − y ∣ ∣ 2 2 + λ ∣ ∣ β ∣ ∣ 2 2 loss = frac{1}{N}||Xbeta-y||^2_2+lambda||beta||_2^2 loss=N1∣∣Xβ−y∣∣22+λ∣∣β∣∣22
还是类似于上一节的结果,我们对 β beta β求导,
∂ l o s s ∂ β = 1 N ( 2 X T X β − 2 X T y ) + 2 λ β frac{partial loss}{partialbeta}=frac{1}{N}(2X^TXbeta-2X^Ty)+2lambdabeta ∂β∂loss=N1(2XTXβ−2XTy)+2λβ
同样,使这个导数为0,我们可以得到最小loss对应的 β beta β是,
β ^ = ( X T X + λ I ) X T y hatbeta=(X^TX+lambda I)X^Ty β^=(XTX+λI)XTy
( X T X + λ I ) (X^TX+lambda I) (XTX+λI) 总是可逆的,所以我们不需要像上一节那样假设他的可逆性了,
简单证明:
任意一个向量都是 λ I lambda I λI 的特征向量,因为 λ I v = λ v lambda I v = lambda v λIv=λv. 并且 λ I lambda I λI 的所有特征值都是 λ lambda λ.
X T X X^TX XTX 是一个对称矩阵,而且是半正定的(positive semi-definite),所以所有的特征值都大于等于0。
所以对于 X T X X^TX XTX 的所有特征向量,标记为 a i , v i a_i, v_i ai,vi,
( X T X + λ I ) v i = ( X T X v i + λ I v i ) = a i v i + λ v i = ( a i + λ ) v i (X^TX+lambda I) v_i= (X^TXv_i+lambda Iv_i)=a_iv_i+lambda v_i=(a_i+lambda)v_i (XTX+λI)vi=(XTXvi+λIvi)=aivi+λvi=(ai+λ)vi
可见,所有 ( X T X + λ I ) (X^TX+lambda I) (XTX+λI) 的特征值都满足 λ i > 0 lambda_i>0 λi>0。
因此 ( X T X + λ I ) (X^TX+lambda I) (XTX+λI) 永远是可逆的。
Python实现:
class RidgeRegression(LinearRegressionbase):
def fit(self, X, y, l):
# l is the lambda in the formula
super().fit(X, y)
self.beta = (np.linalg.inv(X.T @ X + l * np.identity(X.shape[1])) @ X.T) @ y
使用不同的 λ lambda λ 实现
l = 10: beta [0.99547771 1.99660303 4.99462785 2.99652915 0.99752248] error 1.1770827499055974 l = 1 beta [0.99868234 1.99925674 5.00037611 2.99999319 1.00007827] error 0.9132585574988228 l = 0.1 beta [0.99900354 1.9995225 5.00095176 3.00034004 1.00033446] error 0.9102258292688261



