1 Linear Model
1.1 parameter estimation1.2 Cost Function1.3 Normal equation to get θ ^ hattheta θ^1.4 Gradient Descent Process to get θ ^ hattheta θ^
1.4.1 Algorithm1.4.2 Learning Rate1.4.3 Stoping Convergence 2 Logistic Regression3 Avoid Overfitting
3.1 Penalizing parameters3.2 Regression model with penalty function
3.2.1 Lasso regression3.2.2 Ridge regression3.2.3 Elastic net 4 Learning curve
1 Linear Model
y
i
=
θ
1
x
i
1
+
θ
2
x
i
2
+
.
.
.
+
θ
n
p
x
i
n
p
+
ε
i
,
i
=
1
,
2
,
.
.
.
,
n
d
.
y_i=theta_1 x_{i1} + theta_2 x_{i2} + ... + theta_{n_p} x_{in_p} + varepsilon_i,~~~~~i=1,2,...,n_d.
yi=θ1xi1+θ2xi2+...+θnpxinp+εi, i=1,2,...,nd.
or,
y
=
X
θ
+
ε
y = X theta + varepsilon
y=Xθ+ε
where
ε
varepsilon
ε is noise,
θ
theta
θ is the vector of unkown parameters.
The linear model is parametric with n p n_p np parametersIf adding an interception θ 0 theta_0 θ0, then the intercept is a column of 1 in the design matrix X X XFor dimension of design matrix X : n d × n p X:~n_d times n_p X: nd×np or X : n d × ( n p + 1 ) X:~n_d times (n_p + 1) X: nd×(np+1) for adding θ 0 theta_0 θ0, n d > n p n_d > n_p nd>npThe traditional assumption of distribution of noise (iid noise, means noise with distribution) ε varepsilon ε: ε i ∼ N ( 0 , σ 2 ) , i = 1 , 2 , . . . , n d varepsilon_i ~sim ~ mathcal{N}(0,sigma^2),~~i=1,2,...,n_d εi ∼ N(0,σ2), i=1,2,...,nd 1.1 parameter estimation
For parameter estimation of Linear Model: via the cost function or ** maximum likelihood principle**, these two approaches coincide to the same optimal solution of the parameter θ theta θ, the difference is: cost function reduces the discrepancy between predictions and data, while MLE maximizes the likelihood of observing data given parameters.
1.2 Cost FunctionThe goal: reduces the discrepancy between predictions and data
⟹
Longrightarrow
⟹ reduce the MSE of predictions
M
S
E
(
y
^
)
=
1
n
d
∣
∣
y
−
y
^
∣
∣
2
=
1
n
d
∑
i
=
1
n
d
(
y
i
−
y
^
i
)
2
=
1
n
d
∣
∣
y
−
X
θ
^
∣
∣
2
=
1
n
d
∑
i
=
1
n
d
(
y
i
−
x
i
T
θ
^
)
2
MSE(hat{y}) = frac{1}{n_d}||y-hat{y}||^2 = frac{1}{n_d}sum^{n_d}_{i=1}(y_i - hat{y}_i)^2 \ = frac{1}{n_d}||y-Xhat{theta}||^2 = frac{1}{n_d}sum^{n_d}_{i=1}(y_i - x_i^{mathsf{T}}hat{theta})^2
MSE(y^)=nd1∣∣y−y^∣∣2=nd1i=1∑nd(yi−y^i)2=nd1∣∣y−Xθ^∣∣2=nd1i=1∑nd(yi−xiTθ^)2
Let
J
(
θ
)
J(theta)
J(θ) be cost function, find
θ
^
hat{theta}
θ^ of
θ
theta
θ to minimize
J
(
θ
)
J(theta)
J(θ), where
J
(
θ
)
=
∣
∣
y
−
X
θ
∣
∣
2
=
∑
i
=
1
n
d
(
y
i
−
x
i
T
θ
)
2
J(theta) = ||y-Xtheta||^2 = sum^{n_d}_{i=1}(y_i - x_i^{mathsf{T}}theta)^2
J(θ)=∣∣y−Xθ∣∣2=i=1∑nd(yi−xiTθ)2
so
θ
^
=
arg
m
i
n
θ
J
(
θ
)
hattheta=arg underset{theta} {min} J(theta)
θ^=argθminJ(θ), thia is the same as ordinary least squares (OLS) estimator, because OLS for
θ
=
∣
∣
ε
∣
∣
2
=
∑
i
=
1
n
d
ε
2
=
∑
i
=
1
n
d
(
y
i
−
x
i
T
θ
)
2
theta = ||varepsilon||^2=sum^{n_d}_{i=1}varepsilon^2=sum^{n_d}_{i=1}(y_i - x_i^{mathsf{T}}theta)^2
θ=∣∣ε∣∣2=∑i=1ndε2=∑i=1nd(yi−xiTθ)2, so OLS estimator for
θ
^
hattheta
θ^:
θ
^
=
arg
m
i
n
θ
∣
∣
ε
(
θ
)
∣
∣
2
=
arg
m
i
n
θ
(
y
i
−
x
i
T
θ
)
2
hattheta = arg underset{theta} {min} ||varepsilon(theta)||^2 = arg underset{theta} {min} (y_i - x_i^{mathsf{T}}theta)^2
θ^=argθmin∣∣ε(θ)∣∣2=argθmin(yi−xiTθ)2
Cost Function:
J
(
θ
)
=
∑
i
=
1
n
d
(
y
i
−
h
θ
(
x
i
)
)
2
J(theta)=sum^{n_d}_{i=1}(y_i - h_theta(x_i))^2
J(θ)=i=1∑nd(yi−hθ(xi))2, where
h
θ
(
x
)
h_theta(x)
hθ(x) n is called the hypothesis and
h
θ
(
x
i
)
=
x
i
T
θ
h_theta(x_i)=x_i^mathsf{T}theta
hθ(xi)=xiTθ for linear models. So, the estimator obtained by minimizing the cost function for
h
θ
(
x
i
)
=
x
i
T
θ
h_theta(x_i)=x_i^mathsf{T}theta
hθ(xi)=xiTθ and the OLS estimator obtained by minimizing the squared noise norm
∣
∣
ε
(
θ
)
∣
∣
2
||varepsilon(theta)||^2
∣∣ε(θ)∣∣2 coincide
Get θ ^ hattheta θ^:
{ g r a d i e n t d e s c e n t a p p r o x i m a t e s θ ^ , J ( θ ) i s c o n v e x o t h e r n u m e r i c a l o p t i m i z a t i o n s c h e m e s θ ^ , J ( θ ) i s n o t c o n v e x left{ begin{array}{cc} mathrm{gradient ~descent ~approximates~} hattheta, ~~~~~ J(theta)~mathrm{is ~convex} & \ mathrm{ other~ numerical~ optimization~ schemes~} hattheta, ~~~~~ J(theta)~mathrm{is~ not~ convex} & end{array} right. {gradient descent approximates θ^, J(θ) is convexother numerical optimization schemes θ^, J(θ) is not convex
1.3 Normal equation to get θ ^ hattheta θ^For invertible
(
X
T
X
)
−
1
(X^mathsf{T}X)^{-1}
(XTX)−1,
θ
^
=
(
X
T
X
)
−
1
X
T
y
hattheta = (X^mathsf{T}X)^{-1}X^mathsf{T}y
θ^=(XTX)−1XTy
Gauss-Markov assumptions
E ( ε i ) = 0 E(varepsilon_i) = 0 E(εi)=0 V a r ( ε i ) = σ 2 Var(varepsilon_i)=sigma^2 Var(εi)=σ2 C o v ( ε i , ε j ) = 0 Cov(varepsilon_i,varepsilon_j)=0 Cov(εi,εj)=0
Under this assumption:
X
X
X should be non-singular to allow its columns to be linearly independent.
E
(
θ
^
)
=
θ
E(hattheta)=theta
E(θ^)=θ
V
a
r
(
θ
^
)
=
σ
2
(
X
T
X
)
−
1
Var(hattheta)=sigma^2(X^mathsf{T}X)^{-1}
Var(θ^)=σ2(XTX)−1
θ
^
∼
N
(
θ
,
σ
2
(
X
T
X
)
−
1
)
,
i
=
1
,
2
,
.
.
.
,
n
d
hattheta~sim ~ mathcal{N}(theta,sigma^2(X^mathsf{T}X)^{-1}),~~i=1,2,...,n_d
θ^ ∼ N(θ,σ2(XTX)−1), i=1,2,...,nd
( under the assumption of
ε
i
∼
N
(
0
,
σ
2
)
,
i
=
1
,
2
,
.
.
.
,
n
d
varepsilon_i ~sim ~ mathcal{N}(0,sigma^2),~~i=1,2,...,n_d
εi ∼ N(0,σ2), i=1,2,...,nd)The maximum likelihood is
L
(
y
,
X
∣
θ
,
σ
2
)
=
∏
i
=
1
n
d
(
2
π
σ
2
)
−
n
d
/
2
e
x
p
(
−
1
2
σ
2
∑
i
=
1
n
d
(
y
i
−
x
i
T
θ
)
2
)
mathcal{L}(y,X|theta,sigma^2)=prod^{n_d}_{i=1}(2pisigma^2)^{-n_d/2} expleft(-frac{1}{2sigma^2}sum^{n_d}_{i=1}(y_i - x_i^mathsf{T}theta)^2right)
L(y,X∣θ,σ2)=i=1∏nd(2πσ2)−nd/2exp(−2σ21i=1∑nd(yi−xiTθ)2)The associated MLE:
θ
^
=
(
X
T
X
)
−
1
X
T
y
hattheta = (X^mathsf{T}X)^{-1}X^mathsf{T}y
θ^=(XTX)−1XTy, coincide with OLS for normal noise, equal to
a
r
g
m
i
n
θ
(
y
i
−
x
i
T
θ
)
2
underset{theta} {argmin} (y_i - x_i^{mathsf{T}}theta)^2
θargmin(yi−xiTθ)2
Limitations of normal equation with large n p n_p np
computational cost of ( X T X ) − 1 (X^mathsf{T}X)^{-1} (XTX)−1singularity, with large n p n_p np, its columns might turn to be highly correlated 1.4 Gradient Descent Process to get θ ^ hattheta θ^
The Logic: start with θ = ( θ 0 , θ 1 ) theta = (theta_0,theta_1) θ=(θ0,θ1) , then keep to change ( θ 0 , θ 1 ) (theta_0,theta_1) (θ0,θ1) to reduce J ( θ 0 , θ 1 ) J(theta_0,theta_1) J(θ0,θ1) until reach m i n θ 0 , θ 1 J ( θ 0 , θ 1 ) underset{theta_0,theta_1} {min}J(theta_0,theta_1) θ0,θ1minJ(θ0,θ1)
1.4.1 Algorithm- Initialize
(
θ
1
,
θ
2
,
.
.
.
,
θ
n
)
(theta_1,theta_2,...,theta_n)
(θ1,θ2,...,θn)
(
θ
~
1
,
θ
~
2
,
.
.
.
,
θ
~
n
)
=
(
θ
1
,
θ
2
,
.
.
.
,
θ
n
)
(tildetheta_1,tildetheta_2,...,tildetheta_n)=(theta_1,theta_2,...,theta_n)
(θ~1,θ~2,...,θ~n)=(θ1,θ2,...,θn)Update
θ
i
theta_i
θi with
θ
i
=
θ
ˉ
i
−
a
⋅
∂
∂
θ
i
J
(
θ
~
1
,
θ
~
2
,
.
.
.
,
θ
~
n
)
theta_i=bartheta_i - a cdot frac{partial}{partialtheta_i}J(tildetheta_1,tildetheta_2,...,tildetheta_n)
θi=θˉi−a⋅∂θi∂J(θ~1,θ~2,...,θ~n), where
a
a
a is the learning rateRepeat 2, 3 until
θ
i
theta_i
θi converges
The learning rate a a a used in the algorithm should be > 0, constant.
Learning rate can affect convergence and speed of convergence:
Too small learning rate might lead to slow convergence, while too large learning rate might
lead to non-convergence or divergence.Can be selected using a validation set or cross-validation.
1.4.3 Stoping Convergence
Check the error tolerance close to 0:
Absolute error tolerance:
ε
a
b
s
=
∣
J
(
θ
(
k
+
1
)
)
−
J
(
θ
k
)
∣
varepsilon_{abs}=|J(theta^{(k+1)})-J(theta^{k})|
εabs=∣J(θ(k+1))−J(θk)∣Relative error tolerance:
ε
r
e
l
=
∣
J
(
θ
(
k
+
1
)
)
−
J
(
θ
k
)
J
(
θ
(
k
+
1
)
)
∣
varepsilon_{rel}=left|frac{J(theta^{(k+1)})-J(theta^{k})}{J(theta^{(k+1)})}right|
εrel=∣∣∣∣J(θ(k+1))J(θ(k+1))−J(θk)∣∣∣∣
2 Logistic Regression
Output:
{
0
,
1
}
{0,1}
{0,1}
Decision boundary:
x
i
T
θ
x_i^{mathsf{T}}theta
xiTθ
Hypothesis:
h
θ
(
x
i
)
=
g
(
x
i
T
θ
)
=
1
1
+
e
x
p
(
−
x
i
T
θ
)
h_theta(x_i)=g(x_i^mathsf{T}theta) =frac{1}{1+exp(-x_i^mathsf{T}theta)}
hθ(xi)=g(xiTθ)=1+exp(−xiTθ)1
Cost function:
J
(
θ
0
,
θ
1
)
=
{
−
l
o
g
(
h
(
θ
0
,
θ
1
)
(
x
)
)
,
i
f
y
=
1
−
l
o
g
(
1
−
h
(
θ
0
,
θ
1
)
(
x
)
)
,
i
f
y
=
0
J(theta_0,theta_1)=left{ begin{array}{cc} -log(h_{(theta_0,theta_1)}(x)),~~~~~~~~~~~if~y=1 & \ -log(1-h_{(theta_0,theta_1)}(x)),~~~~~~~~~~~if~y=0 & end{array} right.
J(θ0,θ1)={−log(h(θ0,θ1)(x)), if y=1−log(1−h(θ0,θ1)(x)), if y=0
General form for n samples cost function:
J
(
θ
)
=
−
1
n
∑
i
=
1
n
(
y
i
l
o
g
(
h
θ
(
x
i
)
)
+
(
1
−
y
i
)
l
o
g
(
1
−
h
θ
(
x
i
)
)
)
J(theta)=-frac{1}{n}sum^n_{i=1}(y_i log(h_theta(x_i))+(1-y_i)log(1-h_theta(x_i)))
J(θ)=−n1i=1∑n(yilog(hθ(xi))+(1−yi)log(1−hθ(xi)))
Classification:
x
i
T
θ
≥
0
⟹
h
θ
(
x
i
)
≥
0.5
⟹
y
^
=
1
x_i^mathsf{T}theta ge 0 ~Longrightarrow ~h_theta(x_i) ge 0.5 ~Longrightarrow ~ hat{y}=1
xiTθ≥0 ⟹ hθ(xi)≥0.5 ⟹ y^=1
y
=
1
y=1
y=1, then
J
(
θ
)
→
0
J(theta)~rightarrow~0
J(θ) → 0
y
=
0
,
J
(
θ
)
→
∞
y=0,~J(theta)~rightarrow~infty
y=0, J(θ) → ∞
x
i
T
θ
<
0
⟹
h
θ
(
x
i
)
<
0.5
⟹
y
^
=
0
x_i^mathsf{T}theta < 0 ~Longrightarrow ~h_theta(x_i) < 0.5 ~Longrightarrow ~ hat{y}=0
xiTθ<0 ⟹ hθ(xi)<0.5 ⟹ y^=0
y
=
1
y=1
y=1, then
J
(
θ
)
→
∞
J(theta)~rightarrow~infty
J(θ) → ∞
y
=
0
,
J
(
θ
)
→
0
y=0,~J(theta)~rightarrow~0
y=0, J(θ) → 0
3 Avoid Overfitting
higher overfitting
⟹
Longrightarrow
⟹ higher variance of the predictions
higher underfitting
⟹
Longrightarrow
⟹ higher bias of the predictions
∑ i = 1 n d ( y i − h θ ( x i ) ) 2 + λ r ( θ ) sum^{n_d}_{i=1}(y_i - h_theta(x_i))^2+lambda r(theta) i=1∑nd(yi−hθ(xi))2+λr(θ), where λ r ( θ ) lambda r(theta) λr(θ) is the Penalty, r ( θ ) r(theta) r(θ) is the parameter penalty, λ lambda λ is the regularization parameter.l
regularization parameter λ lambda λ
λ → 0 lambda ~ rightarrow~0 λ → 0 regularized regression tend to linear regression estimates λ → ∞ lambda ~ rightarrow~infty λ → ∞ parameters get penalized more, the less the over-fitting to the dataselect λ lambda λ via cross-validation
parameter penalty functions
r
(
θ
)
r(theta)
r(θ)
Some typical parameter penalty functions (参考distance)
d − v a r i a t i o n d-variation d−variation function: r ( θ ) = ( ∑ i = 1 n p ∣ θ i ∣ d ) 1 / d r(theta)=left(sum^{n_p}_{i=1}|theta_i|^dright)^{1/d} r(θ)=(∑i=1np∣θi∣d)1/d L 1 n o r m : r ( θ ) = ∑ i = 1 n p ∣ θ i ∣ L_{1}~norm:~r(theta)=sum^{n_p}_{i=1}|theta_i| L1 norm: r(θ)=∑i=1np∣θi∣ s q u a r e d L 2 n o r m r ( θ ) = ∑ i = 1 n p ∣ θ i ∣ 2 squared~L_2~norm~r(theta)=sum^{n_p}_{i=1}|theta_i|^2 squared L2 norm r(θ)=∑i=1np∣θi∣2 3.2 Regression model with penalty function 3.2.1 Lasso regression
Lasso regression uses L 1 n o r m L_1~norm L1 norm as penalty function. Lasso regression “zeroes out” coefficients, so it performs variable selection and to some extent parameter shrinkage.
Cost function
J
L
(
θ
)
=
∣
∣
y
−
X
θ
∣
∣
2
+
λ
∣
∣
θ
∣
∣
1
=
∑
i
=
1
n
d
(
y
i
−
x
i
T
θ
)
2
+
λ
∑
j
=
1
n
d
∣
θ
j
∣
J_L(theta)=||y-Xtheta||^2+lambda||theta||_1= sum^{n_d}_{i=1}(y_i - x_i^{mathsf{T}}theta)^2+lambdasum^{n_d}_{j=1}|theta_j|
JL(θ)=∣∣y−Xθ∣∣2+λ∣∣θ∣∣1=i=1∑nd(yi−xiTθ)2+λj=1∑nd∣θj∣
Lasso estimate
θ
^
L
=
a
r
g
m
i
n
θ
(
∑
i
=
1
n
d
(
y
i
−
x
i
T
θ
)
2
+
λ
∑
j
=
1
n
d
∣
θ
j
∣
)
hattheta_L=underset{theta}{argmin}left(sum^{n_d}_{i=1}(y_i - x_i^{mathsf{T}}theta)^2 + lambdasum^{n_d}_{j=1}|theta_j| right)
θ^L=θargmin(i=1∑nd(yi−xiTθ)2+λj=1∑nd∣θj∣)
Can’t be solved, but
J
L
(
θ
)
J_L(theta)
JL(θ) is convex, using the least angle regression algorithm to solve.
Ridge regression uses L 2 n o r m L_2~norm L2 norm as penalty function. So it does not “zero out” coefficients, i.e. it can not perform variable selection, but rather shrinks parameter values.
Cost function
J
R
(
θ
)
=
∣
∣
y
−
X
θ
∣
∣
2
+
λ
∣
∣
θ
∣
∣
2
2
=
∑
i
=
1
n
d
(
y
i
−
x
i
T
θ
)
2
+
λ
∑
j
=
1
n
d
(
θ
j
)
2
J_R(theta)=||y-Xtheta||^2+lambda||theta||_2^2= sum^{n_d}_{i=1}(y_i - x_i^{mathsf{T}}theta)^2+lambdasum^{n_d}_{j=1}(theta_j)^2
JR(θ)=∣∣y−Xθ∣∣2+λ∣∣θ∣∣22=i=1∑nd(yi−xiTθ)2+λj=1∑nd(θj)2
ridge estimate
θ
^
R
=
(
X
T
X
+
λ
I
)
−
1
X
T
y
hattheta_R = (X^mathsf{T} X + lambda I)^{-1}X^mathsf{T}y
θ^R=(XTX+λI)−1XTy
This is the closed form solution.
The ridge estimate is biased because:
E
(
θ
^
R
)
=
E
(
(
X
T
X
+
λ
I
)
−
1
X
T
y
)
=
(
X
T
X
+
λ
I
)
−
1
X
T
E
(
y
)
=
(
X
T
X
+
λ
I
)
−
1
X
T
X
θ
=
(
X
T
X
+
λ
I
)
−
1
(
(
X
T
X
)
−
1
)
−
1
θ
=
[
(
X
T
X
)
−
1
(
X
T
X
+
λ
I
)
]
−
1
θ
=
[
I
+
λ
(
X
T
X
)
−
1
]
−
1
θ
begin{aligned} E(hattheta_R) &= Eleft( (X^mathsf{T} X + lambda I)^{-1}X^mathsf{T}y right) \ &=(X^mathsf{T} X + lambda I)^{-1}X^mathsf{T} E(y)\ &=(X^mathsf{T} X + lambda I)^{-1}X^mathsf{T}X theta\ &=(X^mathsf{T} X + lambda I)^{-1}((X^mathsf{T}X)^{-1})^{-1}theta\ &=left[(X^mathsf{T}X)^{-1}(X^mathsf{T} X + lambda I)right]^{-1} theta\ &=left[ I + lambda(X^mathsf{T}X)^{-1}right]^{-1}theta end{aligned}
E(θ^R)=E((XTX+λI)−1XTy)=(XTX+λI)−1XTE(y)=(XTX+λI)−1XTXθ=(XTX+λI)−1((XTX)−1)−1θ=[(XTX)−1(XTX+λI)]−1θ=[I+λ(XTX)−1]−1θ
X
T
X
X^mathsf{T}X
XTX is positive definite,
λ
>
0
lambda > 0
λ>0 by definition, so
E
(
θ
^
R
)
≠
θ
E(hattheta_R)neq theta
E(θ^R)=θ
Tips:
Although ridge regression does not perform variable selection, it performs grouped selection. So if one variable amongst a group of correlated ones is selected, ridge regression automatically includes the whole group. Ridge regression can resolve near multicollinearity.
The elastic net penalty is a compromise between Lasso and ridge.
Cost function
J
E
(
θ
)
=
∣
∣
y
−
X
θ
∣
∣
2
+
λ
1
∣
∣
θ
∣
∣
1
+
λ
2
∣
∣
θ
∣
∣
2
2
J_E(theta)=||y-Xtheta||^2+lambda_1||theta||_1+lambda_2||theta||_2^2
JE(θ)=∣∣y−Xθ∣∣2+λ1∣∣θ∣∣1+λ2∣∣θ∣∣22
elastic net estimate
θ
^
E
=
a
r
g
m
i
n
θ
(
∣
∣
y
−
X
θ
∣
∣
2
+
λ
1
∣
∣
θ
∣
∣
1
+
λ
2
∣
∣
θ
∣
∣
2
2
)
hattheta_E =underset{theta}{argmin}left(||y-Xtheta||^2+lambda_1||theta||_1+lambda_2||theta||_2^2right)
θ^E=θargmin(∣∣y−Xθ∣∣2+λ1∣∣θ∣∣1+λ2∣∣θ∣∣22)
no closed form solution for θ E theta_E θE, but J E ( θ ) J_E(theta) JE(θ) is convex to have a unique minimum
4 Learning curveIncrease prediction accuracy:
variable selection: increase / reduce number of covariatesadd polynomial features: / x 2 , x 1 x 2 / /{x^2, x_1x_2/} /x2,x1x2/regularized regression (Lasso, ridge regression and elastic nets):Collect more data
Learning curve:
A learning curve is a metric of prediction accuracy, like cost function J ( θ ) J(theta) J(θ)Or error metric (function of a parameter that affects the metric)



