Big Data Analytics 笔记 2

Content

1 Linear Model

1.1 parameter estimation1.2 Cost Function1.3 Normal equation to get θ ^ hattheta θ^1.4 Gradient Descent Process to get θ ^ hattheta θ^

1.4.1 Algorithm1.4.2 Learning Rate1.4.3 Stoping Convergence 2 Logistic Regression3 Avoid Overfitting

3.1 Penalizing parameters3.2 Regression model with penalty function

3.2.1 Lasso regression3.2.2 Ridge regression3.2.3 Elastic net 4 Learning curve

1 Linear Model

y i = θ 1 x i 1 + θ 2 x i 2 + . . . + θ n p x i n p + ε i , i = 1 , 2 , . . . , n d . y_i=theta_1 x_{i1} + theta_2 x_{i2} + ... + theta_{n_p} x_{in_p} + varepsilon_i,~~~~~i=1,2,...,n_d. yi=θ1xi1+θ2xi2+...+θnpxinp+εi, i=1,2,...,nd.
or,
y = X θ + ε y = X theta + varepsilon y=Xθ+ε
where ε varepsilon ε is noise, θ theta θ is the vector of unkown parameters.

The linear model is parametric with n p n_p np parametersIf adding an interception θ 0 theta_0 θ0, then the intercept is a column of 1 in the design matrix X X XFor dimension of design matrix X : n d × n p X:~n_d times n_p X: nd×np or X : n d × ( n p + 1 ) X:~n_d times (n_p + 1) X: nd×(np+1) for adding θ 0 theta_0 θ0, n d > n p n_d > n_p nd>npThe traditional assumption of distribution of noise (iid noise, means noise with distribution) ε varepsilon ε: ε i ∼ N ( 0 , σ 2 ) , i = 1 , 2 , . . . , n d varepsilon_i ~sim ~ mathcal{N}(0,sigma^2),~~i=1,2,...,n_d εi ∼ N(0,σ2), i=1,2,...,nd 1.1 parameter estimation

For parameter estimation of Linear Model: via the cost function or ** maximum likelihood principle**, these two approaches coincide to the same optimal solution of the parameter θ theta θ, the difference is: cost function reduces the discrepancy between predictions and data, while MLE maximizes the likelihood of observing data given parameters.

1.2 Cost Function

The goal: reduces the discrepancy between predictions and data
⟹ Longrightarrow ⟹ reduce the MSE of predictions

M S E ( y ^ ) = 1 n d ∣ ∣ y − y ^ ∣ ∣ 2 = 1 n d ∑ i = 1 n d ( y i − y ^ i ) 2 = 1 n d ∣ ∣ y − X θ ^ ∣ ∣ 2 = 1 n d ∑ i = 1 n d ( y i − x i T θ ^ ) 2 MSE(hat{y}) = frac{1}{n_d}||y-hat{y}||^2 = frac{1}{n_d}sum^{n_d}_{i=1}(y_i - hat{y}_i)^2 \ = frac{1}{n_d}||y-Xhat{theta}||^2 = frac{1}{n_d}sum^{n_d}_{i=1}(y_i - x_i^{mathsf{T}}hat{theta})^2 MSE(y^)=nd1∣∣y−y^∣∣2=nd1i=1∑nd(yi−y^i)2=nd1∣∣y−Xθ^∣∣2=nd1i=1∑nd(yi−xiTθ^)2
Let J ( θ ) J(theta) J(θ) be cost function, find θ ^ hat{theta} θ^ of θ theta θ to minimize J ( θ ) J(theta) J(θ), where
J ( θ ) = ∣ ∣ y − X θ ∣ ∣ 2 = ∑ i = 1 n d ( y i − x i T θ ) 2 J(theta) = ||y-Xtheta||^2 = sum^{n_d}_{i=1}(y_i - x_i^{mathsf{T}}theta)^2 J(θ)=∣∣y−Xθ∣∣2=i=1∑nd(yi−xiTθ)2
so θ ^ = arg ⁡ m i n θ J ( θ ) hattheta=arg underset{theta} {min} J(theta) θ^=argθminJ(θ), thia is the same as ordinary least squares (OLS) estimator, because OLS for θ = ∣ ∣ ε ∣ ∣ 2 = ∑ i = 1 n d ε 2 = ∑ i = 1 n d ( y i − x i T θ ) 2 theta = ||varepsilon||^2=sum^{n_d}_{i=1}varepsilon^2=sum^{n_d}_{i=1}(y_i - x_i^{mathsf{T}}theta)^2 θ=∣∣ε∣∣2=∑i=1ndε2=∑i=1nd(yi−xiTθ)2, so OLS estimator for θ ^ hattheta θ^:
θ ^ = arg ⁡ m i n θ ∣ ∣ ε ( θ ) ∣ ∣ 2 = arg ⁡ m i n θ ( y i − x i T θ ) 2 hattheta = arg underset{theta} {min} ||varepsilon(theta)||^2 = arg underset{theta} {min} (y_i - x_i^{mathsf{T}}theta)^2 θ^=argθmin∣∣ε(θ)∣∣2=argθmin(yi−xiTθ)2

Cost Function:
J ( θ ) = ∑ i = 1 n d ( y i − h θ ( x i ) ) 2 J(theta)=sum^{n_d}_{i=1}(y_i - h_theta(x_i))^2 J(θ)=i=1∑nd(yi−hθ(xi))2, where h θ ( x ) h_theta(x) hθ(x) n is called the hypothesis and h θ ( x i ) = x i T θ h_theta(x_i)=x_i^mathsf{T}theta hθ(xi)=xiTθ for linear models. So, the estimator obtained by minimizing the cost function for h θ ( x i ) = x i T θ h_theta(x_i)=x_i^mathsf{T}theta hθ(xi)=xiTθ and the OLS estimator obtained by minimizing the squared noise norm ∣ ∣ ε ( θ ) ∣ ∣ 2 ||varepsilon(theta)||^2 ∣∣ε(θ)∣∣2 coincide

Get θ ^ hattheta θ^:

{ g r a d i e n t d e s c e n t a p p r o x i m a t e s θ ^ , J ( θ ) i s c o n v e x o t h e r n u m e r i c a l o p t i m i z a t i o n s c h e m e s θ ^ , J ( θ ) i s n o t c o n v e x left{ begin{array}{cc} mathrm{gradient ~descent ~approximates~} hattheta, ~~~~~ J(theta)~mathrm{is ~convex} & \ mathrm{ other~ numerical~ optimization~ schemes~} hattheta, ~~~~~ J(theta)~mathrm{is~ not~ convex} & end{array} right. {gradient descent approximates θ^, J(θ) is convexother numerical optimization schemes θ^, J(θ) is not convex

1.3 Normal equation to get θ ^ hattheta θ^

For invertible ( X T X ) − 1 (X^mathsf{T}X)^{-1} (XTX)−1,
θ ^ = ( X T X ) − 1 X T y hattheta = (X^mathsf{T}X)^{-1}X^mathsf{T}y θ^=(XTX)−1XTy
Gauss-Markov assumptions

E ( ε i ) = 0 E(varepsilon_i) = 0 E(εi)=0 V a r ( ε i ) = σ 2 Var(varepsilon_i)=sigma^2 Var(εi)=σ2 C o v ( ε i , ε j ) = 0 Cov(varepsilon_i,varepsilon_j)=0 Cov(εi,εj)=0

Under this assumption:

X X X should be non-singular to allow its columns to be linearly independent. E ( θ ^ ) = θ E(hattheta)=theta E(θ^)=θ V a r ( θ ^ ) = σ 2 ( X T X ) − 1 Var(hattheta)=sigma^2(X^mathsf{T}X)^{-1} Var(θ^)=σ2(XTX)−1 θ ^ ∼ N ( θ , σ 2 ( X T X ) − 1 ) , i = 1 , 2 , . . . , n d hattheta~sim ~ mathcal{N}(theta,sigma^2(X^mathsf{T}X)^{-1}),~~i=1,2,...,n_d θ^ ∼ N(θ,σ2(XTX)−1), i=1,2,...,nd
( under the assumption of ε i ∼ N ( 0 , σ 2 ) , i = 1 , 2 , . . . , n d varepsilon_i ~sim ~ mathcal{N}(0,sigma^2),~~i=1,2,...,n_d εi ∼ N(0,σ2), i=1,2,...,nd)The maximum likelihood is
L ( y , X ∣ θ , σ 2 ) = ∏ i = 1 n d ( 2 π σ 2 ) − n d / 2 e x p ( − 1 2 σ 2 ∑ i = 1 n d ( y i − x i T θ ) 2 ) mathcal{L}(y,X|theta,sigma^2)=prod^{n_d}_{i=1}(2pisigma^2)^{-n_d/2} expleft(-frac{1}{2sigma^2}sum^{n_d}_{i=1}(y_i - x_i^mathsf{T}theta)^2right) L(y,X∣θ,σ2)=i=1∏nd(2πσ2)−nd/2exp(−2σ21i=1∑nd(yi−xiTθ)2)The associated MLE:
θ ^ = ( X T X ) − 1 X T y hattheta = (X^mathsf{T}X)^{-1}X^mathsf{T}y θ^=(XTX)−1XTy, coincide with OLS for normal noise, equal to a r g m i n θ ( y i − x i T θ ) 2 underset{theta} {argmin} (y_i - x_i^{mathsf{T}}theta)^2 θargmin(yi−xiTθ)2

Limitations of normal equation with large n p n_p np

computational cost of ( X T X ) − 1 (X^mathsf{T}X)^{-1} (XTX)−1singularity, with large n p n_p np, its columns might turn to be highly correlated 1.4 Gradient Descent Process to get θ ^ hattheta θ^

The Logic: start with θ = ( θ 0 , θ 1 ) theta = (theta_0,theta_1) θ=(θ0,θ1) , then keep to change ( θ 0 , θ 1 ) (theta_0,theta_1) (θ0,θ1) to reduce J ( θ 0 , θ 1 ) J(theta_0,theta_1) J(θ0,θ1) until reach m i n θ 0 , θ 1 J ( θ 0 , θ 1 ) underset{theta_0,theta_1} {min}J(theta_0,theta_1) θ0,θ1minJ(θ0,θ1)

1.4.1 Algorithm

learning rate

converges

1.4.2 Learning Rate

The learning rate a a a used in the algorithm should be > 0, constant.

Learning rate can affect convergence and speed of convergence:
Too small learning rate might lead to slow convergence, while too large learning rate might
lead to non-convergence or divergence.Can be selected using a validation set or cross-validation. 1.4.3 Stoping Convergence

Check the error tolerance close to 0:

Absolute error tolerance:
ε a b s = ∣ J ( θ ( k + 1 ) ) − J ( θ k ) ∣ varepsilon_{abs}=|J(theta^{(k+1)})-J(theta^{k})| εabs=∣J(θ(k+1))−J(θk)∣Relative error tolerance:
ε r e l = ∣ J ( θ ( k + 1 ) ) − J ( θ k ) J ( θ ( k + 1 ) ) ∣ varepsilon_{rel}=left|frac{J(theta^{(k+1)})-J(theta^{k})}{J(theta^{(k+1)})}right| εrel=∣∣∣∣J(θ(k+1))J(θ(k+1))−J(θk)∣∣∣∣ 2 Logistic Regression

Output: { 0 , 1 } {0,1} {0,1}
Decision boundary: x i T θ x_i^{mathsf{T}}theta xiTθ
Hypothesis: h θ ( x i ) = g ( x i T θ ) = 1 1 + e x p ( − x i T θ ) h_theta(x_i)=g(x_i^mathsf{T}theta) =frac{1}{1+exp(-x_i^mathsf{T}theta)} hθ(xi)=g(xiTθ)=1+exp(−xiTθ)1

Cost function:
J ( θ 0 , θ 1 ) = { − l o g ( h ( θ 0 , θ 1 ) ( x ) ) , i f y = 1 − l o g ( 1 − h ( θ 0 , θ 1 ) ( x ) ) , i f y = 0 J(theta_0,theta_1)=left{ begin{array}{cc} -log(h_{(theta_0,theta_1)}(x)),~~~~~~~~~~~if~y=1 & \ -log(1-h_{(theta_0,theta_1)}(x)),~~~~~~~~~~~if~y=0 & end{array} right. J(θ0,θ1)={−log(h(θ0,θ1)(x)), if y=1−log(1−h(θ0,θ1)(x)), if y=0

General form for n samples cost function:
J ( θ ) = − 1 n ∑ i = 1 n ( y i l o g ( h θ ( x i ) ) + ( 1 − y i ) l o g ( 1 − h θ ( x i ) ) ) J(theta)=-frac{1}{n}sum^n_{i=1}(y_i log(h_theta(x_i))+(1-y_i)log(1-h_theta(x_i))) J(θ)=−n1i=1∑n(yilog(hθ(xi))+(1−yi)log(1−hθ(xi)))

Classification:

x i T θ ≥ 0 ⟹ h θ ( x i ) ≥ 0.5 ⟹ y ^ = 1 x_i^mathsf{T}theta ge 0 ~Longrightarrow ~h_theta(x_i) ge 0.5 ~Longrightarrow ~ hat{y}=1 xiTθ≥0 ⟹ hθ(xi)≥0.5 ⟹ y^=1
y = 1 y=1 y=1, then J ( θ ) → 0 J(theta)~rightarrow~0 J(θ) → 0
y = 0 , J ( θ ) → ∞ y=0,~J(theta)~rightarrow~infty y=0, J(θ) → ∞ x i T θ < 0 ⟹ h θ ( x i ) < 0.5 ⟹ y ^ = 0 x_i^mathsf{T}theta < 0 ~Longrightarrow ~h_theta(x_i) < 0.5 ~Longrightarrow ~ hat{y}=0 xiTθ<0 ⟹ hθ(xi)<0.5 ⟹ y^=0
y = 1 y=1 y=1, then J ( θ ) → ∞ J(theta)~rightarrow~infty J(θ) → ∞
y = 0 , J ( θ ) → 0 y=0,~J(theta)~rightarrow~0 y=0, J(θ) → 0 3 Avoid Overfitting

higher overfitting ⟹ Longrightarrow ⟹ higher variance of the predictions
higher underfitting ⟹ Longrightarrow ⟹ higher bias of the predictions

3.1 Penalizing parameters

∑ i = 1 n d ( y i − h θ ( x i ) ) 2 + λ r ( θ ) sum^{n_d}_{i=1}(y_i - h_theta(x_i))^2+lambda r(theta) i=1∑nd(yi−hθ(xi))2+λr(θ), where λ r ( θ ) lambda r(theta) λr(θ) is the Penalty, r ( θ ) r(theta) r(θ) is the parameter penalty, λ lambda λ is the regularization parameter.l

regularization parameter λ lambda λ

λ → 0 lambda ~ rightarrow~0 λ → 0 regularized regression tend to linear regression estimates λ → ∞ lambda ~ rightarrow~infty λ → ∞ parameters get penalized more, the less the over-fitting to the dataselect λ lambda λ via cross-validation

parameter penalty functions r ( θ ) r(theta) r(θ)
Some typical parameter penalty functions (参考distance)

d − v a r i a t i o n d-variation d−variation function: r ( θ ) = ( ∑ i = 1 n p ∣ θ i ∣ d ) 1 / d r(theta)=left(sum^{n_p}_{i=1}|theta_i|^dright)^{1/d} r(θ)=(∑i=1np∣θi∣d)1/d L 1 n o r m : r ( θ ) = ∑ i = 1 n p ∣ θ i ∣ L_{1}~norm:~r(theta)=sum^{n_p}_{i=1}|theta_i| L1 norm: r(θ)=∑i=1np∣θi∣ s q u a r e d L 2 n o r m r ( θ ) = ∑ i = 1 n p ∣ θ i ∣ 2 squared~L_2~norm~r(theta)=sum^{n_p}_{i=1}|theta_i|^2 squared L2 norm r(θ)=∑i=1np∣θi∣2 3.2 Regression model with penalty function 3.2.1 Lasso regression

Lasso regression uses L 1 n o r m L_1~norm L1 norm as penalty function. Lasso regression “zeroes out” coefficients, so it performs variable selection and to some extent parameter shrinkage.

Cost function
J L ( θ ) = ∣ ∣ y − X θ ∣ ∣ 2 + λ ∣ ∣ θ ∣ ∣ 1 = ∑ i = 1 n d ( y i − x i T θ ) 2 + λ ∑ j = 1 n d ∣ θ j ∣ J_L(theta)=||y-Xtheta||^2+lambda||theta||_1= sum^{n_d}_{i=1}(y_i - x_i^{mathsf{T}}theta)^2+lambdasum^{n_d}_{j=1}|theta_j| JL(θ)=∣∣y−Xθ∣∣2+λ∣∣θ∣∣1=i=1∑nd(yi−xiTθ)2+λj=1∑nd∣θj∣

Lasso estimate
θ ^ L = a r g m i n θ ( ∑ i = 1 n d ( y i − x i T θ ) 2 + λ ∑ j = 1 n d ∣ θ j ∣ ) hattheta_L=underset{theta}{argmin}left(sum^{n_d}_{i=1}(y_i - x_i^{mathsf{T}}theta)^2 + lambdasum^{n_d}_{j=1}|theta_j| right) θ^L=θargmin(i=1∑nd(yi−xiTθ)2+λj=1∑nd∣θj∣)
Can’t be solved, but J L ( θ ) J_L(theta) JL(θ) is convex, using the least angle regression algorithm to solve.

3.2.2 Ridge regression

Ridge regression uses L 2 n o r m L_2~norm L2 norm as penalty function. So it does not “zero out” coefficients, i.e. it can not perform variable selection, but rather shrinks parameter values.

Cost function
J R ( θ ) = ∣ ∣ y − X θ ∣ ∣ 2 + λ ∣ ∣ θ ∣ ∣ 2 2 = ∑ i = 1 n d ( y i − x i T θ ) 2 + λ ∑ j = 1 n d ( θ j ) 2 J_R(theta)=||y-Xtheta||^2+lambda||theta||_2^2= sum^{n_d}_{i=1}(y_i - x_i^{mathsf{T}}theta)^2+lambdasum^{n_d}_{j=1}(theta_j)^2 JR(θ)=∣∣y−Xθ∣∣2+λ∣∣θ∣∣22=i=1∑nd(yi−xiTθ)2+λj=1∑nd(θj)2

ridge estimate
θ ^ R = ( X T X + λ I ) − 1 X T y hattheta_R = (X^mathsf{T} X + lambda I)^{-1}X^mathsf{T}y θ^R=(XTX+λI)−1XTy
This is the closed form solution.

The ridge estimate is biased because:

E ( θ ^ R ) = E ( ( X T X + λ I ) − 1 X T y ) = ( X T X + λ I ) − 1 X T E ( y ) = ( X T X + λ I ) − 1 X T X θ = ( X T X + λ I ) − 1 ( ( X T X ) − 1 ) − 1 θ = [ ( X T X ) − 1 ( X T X + λ I ) ] − 1 θ = [ I + λ ( X T X ) − 1 ] − 1 θ begin{aligned} E(hattheta_R) &= Eleft( (X^mathsf{T} X + lambda I)^{-1}X^mathsf{T}y right) \ &=(X^mathsf{T} X + lambda I)^{-1}X^mathsf{T} E(y)\ &=(X^mathsf{T} X + lambda I)^{-1}X^mathsf{T}X theta\ &=(X^mathsf{T} X + lambda I)^{-1}((X^mathsf{T}X)^{-1})^{-1}theta\ &=left[(X^mathsf{T}X)^{-1}(X^mathsf{T} X + lambda I)right]^{-1} theta\ &=left[ I + lambda(X^mathsf{T}X)^{-1}right]^{-1}theta end{aligned} E(θ^R)=E((XTX+λI)−1XTy)=(XTX+λI)−1XTE(y)=(XTX+λI)−1XTXθ=(XTX+λI)−1((XTX)−1)−1θ=[(XTX)−1(XTX+λI)]−1θ=[I+λ(XTX)−1]−1θ
X T X X^mathsf{T}X XTX is positive definite, λ > 0 lambda > 0 λ>0 by definition, so E ( θ ^ R ) ≠ θ E(hattheta_R)neq theta E(θ^R)=θ

Tips:
Although ridge regression does not perform variable selection, it performs grouped selection. So if one variable amongst a group of correlated ones is selected, ridge regression automatically includes the whole group. Ridge regression can resolve near multicollinearity.

3.2.3 Elastic net

The elastic net penalty is a compromise between Lasso and ridge.

Cost function
J E ( θ ) = ∣ ∣ y − X θ ∣ ∣ 2 + λ 1 ∣ ∣ θ ∣ ∣ 1 + λ 2 ∣ ∣ θ ∣ ∣ 2 2 J_E(theta)=||y-Xtheta||^2+lambda_1||theta||_1+lambda_2||theta||_2^2 JE(θ)=∣∣y−Xθ∣∣2+λ1∣∣θ∣∣1+λ2∣∣θ∣∣22

elastic net estimate
θ ^ E = a r g m i n θ ( ∣ ∣ y − X θ ∣ ∣ 2 + λ 1 ∣ ∣ θ ∣ ∣ 1 + λ 2 ∣ ∣ θ ∣ ∣ 2 2 ) hattheta_E =underset{theta}{argmin}left(||y-Xtheta||^2+lambda_1||theta||_1+lambda_2||theta||_2^2right) θ^E=θargmin(∣∣y−Xθ∣∣2+λ1∣∣θ∣∣1+λ2∣∣θ∣∣22)

no closed form solution for θ E theta_E θE, but J E ( θ ) J_E(theta) JE(θ) is convex to have a unique minimum

4 Learning curve

Increase prediction accuracy:

variable selection: increase / reduce number of covariatesadd polynomial features: / x 2 , x 1 x 2 / /{x^2, x_1x_2/} /x2,x1x2/regularized regression (Lasso, ridge regression and elastic nets):Collect more data

Learning curve:

A learning curve is a metric of prediction accuracy, like cost function J ( θ ) J(theta) J(θ)Or error metric (function of a parameter that affects the metric)

Big Data Analytics 笔记 2

大数据系统相关栏目本月热门文章