栏目分类:
子分类:
返回
名师互学网用户登录
快速导航关闭
当前搜索
当前分类
子分类
实用工具
热门搜索
名师互学网 > IT > 前沿技术 > 大数据 > 大数据系统

Big Data Analytics 笔记 2

Big Data Analytics 笔记 2

Content

1 Linear Model

1.1 parameter estimation1.2 Cost Function1.3 Normal equation to get θ ^ hattheta θ^1.4 Gradient Descent Process to get θ ^ hattheta θ^

1.4.1 Algorithm1.4.2 Learning Rate1.4.3 Stoping Convergence 2 Logistic Regression3 Avoid Overfitting

3.1 Penalizing parameters3.2 Regression model with penalty function

3.2.1 Lasso regression3.2.2 Ridge regression3.2.3 Elastic net 4 Learning curve

1 Linear Model

y i = θ 1 x i 1 + θ 2 x i 2 + . . . + θ n p x i n p + ε i ,       i = 1 , 2 , . . . , n d . y_i=theta_1 x_{i1} + theta_2 x_{i2} + ... + theta_{n_p} x_{in_p} + varepsilon_i,~~~~~i=1,2,...,n_d. yi​=θ1​xi1​+θ2​xi2​+...+θnp​​xinp​​+εi​,     i=1,2,...,nd​.
or,
y = X θ + ε y = X theta + varepsilon y=Xθ+ε
where ε varepsilon ε is noise, θ theta θ is the vector of unkown parameters.

The linear model is parametric with n p n_p np​ parametersIf adding an interception θ 0 theta_0 θ0​, then the intercept is a column of 1 in the design matrix X X XFor dimension of design matrix X :   n d × n p X:~n_d times n_p X: nd​×np​ or X :   n d × ( n p + 1 ) X:~n_d times (n_p + 1) X: nd​×(np​+1) for adding θ 0 theta_0 θ0​, n d > n p n_d > n_p nd​>np​The traditional assumption of distribution of noise (iid noise, means noise with distribution) ε varepsilon ε: ε i   ∼   N ( 0 , σ 2 ) ,    i = 1 , 2 , . . . , n d varepsilon_i ~sim ~ mathcal{N}(0,sigma^2),~~i=1,2,...,n_d εi​ ∼ N(0,σ2),  i=1,2,...,nd​ 1.1 parameter estimation

For parameter estimation of Linear Model: via the cost function or ** maximum likelihood principle**, these two approaches coincide to the same optimal solution of the parameter θ theta θ, the difference is: cost function reduces the discrepancy between predictions and data, while MLE maximizes the likelihood of observing data given parameters.

1.2 Cost Function

The goal: reduces the discrepancy between predictions and data
⟹ Longrightarrow ⟹ reduce the MSE of predictions

M S E ( y ^ ) = 1 n d ∣ ∣ y − y ^ ∣ ∣ 2 = 1 n d ∑ i = 1 n d ( y i − y ^ i ) 2 = 1 n d ∣ ∣ y − X θ ^ ∣ ∣ 2 = 1 n d ∑ i = 1 n d ( y i − x i T θ ^ ) 2 MSE(hat{y}) = frac{1}{n_d}||y-hat{y}||^2 = frac{1}{n_d}sum^{n_d}_{i=1}(y_i - hat{y}_i)^2 \ = frac{1}{n_d}||y-Xhat{theta}||^2 = frac{1}{n_d}sum^{n_d}_{i=1}(y_i - x_i^{mathsf{T}}hat{theta})^2 MSE(y^​)=nd​1​∣∣y−y^​∣∣2=nd​1​i=1∑nd​​(yi​−y^​i​)2=nd​1​∣∣y−Xθ^∣∣2=nd​1​i=1∑nd​​(yi​−xiT​θ^)2
Let J ( θ ) J(theta) J(θ) be cost function, find θ ^ hat{theta} θ^ of θ theta θ to minimize J ( θ ) J(theta) J(θ), where
J ( θ ) = ∣ ∣ y − X θ ∣ ∣ 2 = ∑ i = 1 n d ( y i − x i T θ ) 2 J(theta) = ||y-Xtheta||^2 = sum^{n_d}_{i=1}(y_i - x_i^{mathsf{T}}theta)^2 J(θ)=∣∣y−Xθ∣∣2=i=1∑nd​​(yi​−xiT​θ)2
so θ ^ = arg ⁡ m i n θ J ( θ ) hattheta=arg underset{theta} {min} J(theta) θ^=argθmin​J(θ), thia is the same as ordinary least squares (OLS) estimator, because OLS for θ = ∣ ∣ ε ∣ ∣ 2 = ∑ i = 1 n d ε 2 = ∑ i = 1 n d ( y i − x i T θ ) 2 theta = ||varepsilon||^2=sum^{n_d}_{i=1}varepsilon^2=sum^{n_d}_{i=1}(y_i - x_i^{mathsf{T}}theta)^2 θ=∣∣ε∣∣2=∑i=1nd​​ε2=∑i=1nd​​(yi​−xiT​θ)2, so OLS estimator for θ ^ hattheta θ^:
θ ^ = arg ⁡ m i n θ ∣ ∣ ε ( θ ) ∣ ∣ 2 = arg ⁡ m i n θ ( y i − x i T θ ) 2 hattheta = arg underset{theta} {min} ||varepsilon(theta)||^2 = arg underset{theta} {min} (y_i - x_i^{mathsf{T}}theta)^2 θ^=argθmin​∣∣ε(θ)∣∣2=argθmin​(yi​−xiT​θ)2

Cost Function:
J ( θ ) = ∑ i = 1 n d ( y i − h θ ( x i ) ) 2 J(theta)=sum^{n_d}_{i=1}(y_i - h_theta(x_i))^2 J(θ)=i=1∑nd​​(yi​−hθ​(xi​))2, where h θ ( x ) h_theta(x) hθ​(x) n is called the hypothesis and h θ ( x i ) = x i T θ h_theta(x_i)=x_i^mathsf{T}theta hθ​(xi​)=xiT​θ for linear models. So, the estimator obtained by minimizing the cost function for h θ ( x i ) = x i T θ h_theta(x_i)=x_i^mathsf{T}theta hθ​(xi​)=xiT​θ and the OLS estimator obtained by minimizing the squared noise norm ∣ ∣ ε ( θ ) ∣ ∣ 2 ||varepsilon(theta)||^2 ∣∣ε(θ)∣∣2 coincide

Get θ ^ hattheta θ^:

{ g r a d i e n t   d e s c e n t   a p p r o x i m a t e s   θ ^ ,       J ( θ )   i s   c o n v e x o t h e r   n u m e r i c a l   o p t i m i z a t i o n   s c h e m e s   θ ^ ,       J ( θ )   i s   n o t   c o n v e x left{ begin{array}{cc} mathrm{gradient ~descent ~approximates~} hattheta, ~~~~~ J(theta)~mathrm{is ~convex} & \ mathrm{ other~ numerical~ optimization~ schemes~} hattheta, ~~~~~ J(theta)~mathrm{is~ not~ convex} & end{array} right. {gradient descent approximates θ^,     J(θ) is convexother numerical optimization schemes θ^,     J(θ) is not convex​​

1.3 Normal equation to get θ ^ hattheta θ^

For invertible ( X T X ) − 1 (X^mathsf{T}X)^{-1} (XTX)−1,
θ ^ = ( X T X ) − 1 X T y hattheta = (X^mathsf{T}X)^{-1}X^mathsf{T}y θ^=(XTX)−1XTy
Gauss-Markov assumptions

E ( ε i ) = 0 E(varepsilon_i) = 0 E(εi​)=0 V a r ( ε i ) = σ 2 Var(varepsilon_i)=sigma^2 Var(εi​)=σ2 C o v ( ε i , ε j ) = 0 Cov(varepsilon_i,varepsilon_j)=0 Cov(εi​,εj​)=0

Under this assumption:

X X X should be non-singular to allow its columns to be linearly independent. E ( θ ^ ) = θ E(hattheta)=theta E(θ^)=θ V a r ( θ ^ ) = σ 2 ( X T X ) − 1 Var(hattheta)=sigma^2(X^mathsf{T}X)^{-1} Var(θ^)=σ2(XTX)−1 θ ^   ∼   N ( θ , σ 2 ( X T X ) − 1 ) ,    i = 1 , 2 , . . . , n d hattheta~sim ~ mathcal{N}(theta,sigma^2(X^mathsf{T}X)^{-1}),~~i=1,2,...,n_d θ^ ∼ N(θ,σ2(XTX)−1),  i=1,2,...,nd​
( under the assumption of ε i   ∼   N ( 0 , σ 2 ) ,    i = 1 , 2 , . . . , n d varepsilon_i ~sim ~ mathcal{N}(0,sigma^2),~~i=1,2,...,n_d εi​ ∼ N(0,σ2),  i=1,2,...,nd​)The maximum likelihood is
L ( y , X ∣ θ , σ 2 ) = ∏ i = 1 n d ( 2 π σ 2 ) − n d / 2 e x p ( − 1 2 σ 2 ∑ i = 1 n d ( y i − x i T θ ) 2 ) mathcal{L}(y,X|theta,sigma^2)=prod^{n_d}_{i=1}(2pisigma^2)^{-n_d/2} expleft(-frac{1}{2sigma^2}sum^{n_d}_{i=1}(y_i - x_i^mathsf{T}theta)^2right) L(y,X∣θ,σ2)=i=1∏nd​​(2πσ2)−nd​/2exp(−2σ21​i=1∑nd​​(yi​−xiT​θ)2)The associated MLE:
θ ^ = ( X T X ) − 1 X T y hattheta = (X^mathsf{T}X)^{-1}X^mathsf{T}y θ^=(XTX)−1XTy, coincide with OLS for normal noise, equal to a r g m i n θ ( y i − x i T θ ) 2 underset{theta} {argmin} (y_i - x_i^{mathsf{T}}theta)^2 θargmin​(yi​−xiT​θ)2

Limitations of normal equation with large n p n_p np​

computational cost of ( X T X ) − 1 (X^mathsf{T}X)^{-1} (XTX)−1singularity, with large n p n_p np​, its columns might turn to be highly correlated 1.4 Gradient Descent Process to get θ ^ hattheta θ^

The Logic: start with θ = ( θ 0 , θ 1 ) theta = (theta_0,theta_1) θ=(θ0​,θ1​) , then keep to change ( θ 0 , θ 1 ) (theta_0,theta_1) (θ0​,θ1​) to reduce J ( θ 0 , θ 1 ) J(theta_0,theta_1) J(θ0​,θ1​) until reach m i n θ 0 , θ 1 J ( θ 0 , θ 1 ) underset{theta_0,theta_1} {min}J(theta_0,theta_1) θ0​,θ1​min​J(θ0​,θ1​)

1.4.1 Algorithm
    Initialize ( θ 1 , θ 2 , . . . , θ n ) (theta_1,theta_2,...,theta_n) (θ1​,θ2​,...,θn​) ( θ ~ 1 , θ ~ 2 , . . . , θ ~ n ) = ( θ 1 , θ 2 , . . . , θ n ) (tildetheta_1,tildetheta_2,...,tildetheta_n)=(theta_1,theta_2,...,theta_n) (θ~1​,θ~2​,...,θ~n​)=(θ1​,θ2​,...,θn​)Update θ i theta_i θi​ with θ i = θ ˉ i − a ⋅ ∂ ∂ θ i J ( θ ~ 1 , θ ~ 2 , . . . , θ ~ n ) theta_i=bartheta_i - a cdot frac{partial}{partialtheta_i}J(tildetheta_1,tildetheta_2,...,tildetheta_n) θi​=θˉi​−a⋅∂θi​∂​J(θ~1​,θ~2​,...,θ~n​), where a a a is the learning rateRepeat 2, 3 until θ i theta_i θi​ converges
1.4.2 Learning Rate

The learning rate a a a used in the algorithm should be > 0, constant.

Learning rate can affect convergence and speed of convergence:
Too small learning rate might lead to slow convergence, while too large learning rate might
lead to non-convergence or divergence.Can be selected using a validation set or cross-validation. 1.4.3 Stoping Convergence

Check the error tolerance close to 0:

Absolute error tolerance:
ε a b s = ∣ J ( θ ( k + 1 ) ) − J ( θ k ) ∣ varepsilon_{abs}=|J(theta^{(k+1)})-J(theta^{k})| εabs​=∣J(θ(k+1))−J(θk)∣Relative error tolerance:
ε r e l = ∣ J ( θ ( k + 1 ) ) − J ( θ k ) J ( θ ( k + 1 ) ) ∣ varepsilon_{rel}=left|frac{J(theta^{(k+1)})-J(theta^{k})}{J(theta^{(k+1)})}right| εrel​=∣∣∣∣​J(θ(k+1))J(θ(k+1))−J(θk)​∣∣∣∣​ 2 Logistic Regression

Output: { 0 , 1 } {0,1} {0,1}
Decision boundary: x i T θ x_i^{mathsf{T}}theta xiT​θ
Hypothesis: h θ ( x i ) = g ( x i T θ ) = 1 1 + e x p ( − x i T θ ) h_theta(x_i)=g(x_i^mathsf{T}theta) =frac{1}{1+exp(-x_i^mathsf{T}theta)} hθ​(xi​)=g(xiT​θ)=1+exp(−xiT​θ)1​

Cost function:
J ( θ 0 , θ 1 ) = { − l o g ( h ( θ 0 , θ 1 ) ( x ) ) ,             i f   y = 1 − l o g ( 1 − h ( θ 0 , θ 1 ) ( x ) ) ,             i f   y = 0 J(theta_0,theta_1)=left{ begin{array}{cc} -log(h_{(theta_0,theta_1)}(x)),~~~~~~~~~~~if~y=1 & \ -log(1-h_{(theta_0,theta_1)}(x)),~~~~~~~~~~~if~y=0 & end{array} right. J(θ0​,θ1​)={−log(h(θ0​,θ1​)​(x)),           if y=1−log(1−h(θ0​,θ1​)​(x)),           if y=0​​

General form for n samples cost function:
J ( θ ) = − 1 n ∑ i = 1 n ( y i l o g ( h θ ( x i ) ) + ( 1 − y i ) l o g ( 1 − h θ ( x i ) ) ) J(theta)=-frac{1}{n}sum^n_{i=1}(y_i log(h_theta(x_i))+(1-y_i)log(1-h_theta(x_i))) J(θ)=−n1​i=1∑n​(yi​log(hθ​(xi​))+(1−yi​)log(1−hθ​(xi​)))

Classification:

x i T θ ≥ 0   ⟹   h θ ( x i ) ≥ 0.5   ⟹   y ^ = 1 x_i^mathsf{T}theta ge 0 ~Longrightarrow ~h_theta(x_i) ge 0.5 ~Longrightarrow ~ hat{y}=1 xiT​θ≥0 ⟹ hθ​(xi​)≥0.5 ⟹ y^​=1
y = 1 y=1 y=1, then J ( θ )   →   0 J(theta)~rightarrow~0 J(θ) → 0
y = 0 ,   J ( θ )   →   ∞ y=0,~J(theta)~rightarrow~infty y=0, J(θ) → ∞ x i T θ < 0   ⟹   h θ ( x i ) < 0.5   ⟹   y ^ = 0 x_i^mathsf{T}theta < 0 ~Longrightarrow ~h_theta(x_i) < 0.5 ~Longrightarrow ~ hat{y}=0 xiT​θ<0 ⟹ hθ​(xi​)<0.5 ⟹ y^​=0
y = 1 y=1 y=1, then J ( θ )   →   ∞ J(theta)~rightarrow~infty J(θ) → ∞
y = 0 ,   J ( θ )   →   0 y=0,~J(theta)~rightarrow~0 y=0, J(θ) → 0 3 Avoid Overfitting

higher overfitting ⟹ Longrightarrow ⟹ higher variance of the predictions
higher underfitting ⟹ Longrightarrow ⟹ higher bias of the predictions

3.1 Penalizing parameters

∑ i = 1 n d ( y i − h θ ( x i ) ) 2 + λ r ( θ ) sum^{n_d}_{i=1}(y_i - h_theta(x_i))^2+lambda r(theta) i=1∑nd​​(yi​−hθ​(xi​))2+λr(θ), where λ r ( θ ) lambda r(theta) λr(θ) is the Penalty, r ( θ ) r(theta) r(θ) is the parameter penalty, λ lambda λ is the regularization parameter.l

regularization parameter λ lambda λ

λ   →   0 lambda ~ rightarrow~0 λ → 0 regularized regression tend to linear regression estimates λ   →   ∞ lambda ~ rightarrow~infty λ → ∞ parameters get penalized more, the less the over-fitting to the dataselect λ lambda λ via cross-validation

parameter penalty functions r ( θ ) r(theta) r(θ)
Some typical parameter penalty functions (参考distance)

d − v a r i a t i o n d-variation d−variation function: r ( θ ) = ( ∑ i = 1 n p ∣ θ i ∣ d ) 1 / d r(theta)=left(sum^{n_p}_{i=1}|theta_i|^dright)^{1/d} r(θ)=(∑i=1np​​∣θi​∣d)1/d L 1   n o r m :   r ( θ ) = ∑ i = 1 n p ∣ θ i ∣ L_{1}~norm:~r(theta)=sum^{n_p}_{i=1}|theta_i| L1​ norm: r(θ)=∑i=1np​​∣θi​∣ s q u a r e d   L 2   n o r m   r ( θ ) = ∑ i = 1 n p ∣ θ i ∣ 2 squared~L_2~norm~r(theta)=sum^{n_p}_{i=1}|theta_i|^2 squared L2​ norm r(θ)=∑i=1np​​∣θi​∣2 3.2 Regression model with penalty function 3.2.1 Lasso regression

Lasso regression uses L 1   n o r m L_1~norm L1​ norm as penalty function. Lasso regression “zeroes out” coefficients, so it performs variable selection and to some extent parameter shrinkage.

Cost function
J L ( θ ) = ∣ ∣ y − X θ ∣ ∣ 2 + λ ∣ ∣ θ ∣ ∣ 1 = ∑ i = 1 n d ( y i − x i T θ ) 2 + λ ∑ j = 1 n d ∣ θ j ∣ J_L(theta)=||y-Xtheta||^2+lambda||theta||_1= sum^{n_d}_{i=1}(y_i - x_i^{mathsf{T}}theta)^2+lambdasum^{n_d}_{j=1}|theta_j| JL​(θ)=∣∣y−Xθ∣∣2+λ∣∣θ∣∣1​=i=1∑nd​​(yi​−xiT​θ)2+λj=1∑nd​​∣θj​∣

Lasso estimate
θ ^ L = a r g m i n θ ( ∑ i = 1 n d ( y i − x i T θ ) 2 + λ ∑ j = 1 n d ∣ θ j ∣ ) hattheta_L=underset{theta}{argmin}left(sum^{n_d}_{i=1}(y_i - x_i^{mathsf{T}}theta)^2 + lambdasum^{n_d}_{j=1}|theta_j| right) θ^L​=θargmin​(i=1∑nd​​(yi​−xiT​θ)2+λj=1∑nd​​∣θj​∣)
Can’t be solved, but J L ( θ ) J_L(theta) JL​(θ) is convex, using the least angle regression algorithm to solve.

3.2.2 Ridge regression

Ridge regression uses L 2   n o r m L_2~norm L2​ norm as penalty function. So it does not “zero out” coefficients, i.e. it can not perform variable selection, but rather shrinks parameter values.

Cost function
J R ( θ ) = ∣ ∣ y − X θ ∣ ∣ 2 + λ ∣ ∣ θ ∣ ∣ 2 2 = ∑ i = 1 n d ( y i − x i T θ ) 2 + λ ∑ j = 1 n d ( θ j ) 2 J_R(theta)=||y-Xtheta||^2+lambda||theta||_2^2= sum^{n_d}_{i=1}(y_i - x_i^{mathsf{T}}theta)^2+lambdasum^{n_d}_{j=1}(theta_j)^2 JR​(θ)=∣∣y−Xθ∣∣2+λ∣∣θ∣∣22​=i=1∑nd​​(yi​−xiT​θ)2+λj=1∑nd​​(θj​)2

ridge estimate
θ ^ R = ( X T X + λ I ) − 1 X T y hattheta_R = (X^mathsf{T} X + lambda I)^{-1}X^mathsf{T}y θ^R​=(XTX+λI)−1XTy
This is the closed form solution.

The ridge estimate is biased because:

E ( θ ^ R ) = E ( ( X T X + λ I ) − 1 X T y ) = ( X T X + λ I ) − 1 X T E ( y ) = ( X T X + λ I ) − 1 X T X θ = ( X T X + λ I ) − 1 ( ( X T X ) − 1 ) − 1 θ = [ ( X T X ) − 1 ( X T X + λ I ) ] − 1 θ = [ I + λ ( X T X ) − 1 ] − 1 θ begin{aligned} E(hattheta_R) &= Eleft( (X^mathsf{T} X + lambda I)^{-1}X^mathsf{T}y right) \ &=(X^mathsf{T} X + lambda I)^{-1}X^mathsf{T} E(y)\ &=(X^mathsf{T} X + lambda I)^{-1}X^mathsf{T}X theta\ &=(X^mathsf{T} X + lambda I)^{-1}((X^mathsf{T}X)^{-1})^{-1}theta\ &=left[(X^mathsf{T}X)^{-1}(X^mathsf{T} X + lambda I)right]^{-1} theta\ &=left[ I + lambda(X^mathsf{T}X)^{-1}right]^{-1}theta end{aligned} E(θ^R​)​=E((XTX+λI)−1XTy)=(XTX+λI)−1XTE(y)=(XTX+λI)−1XTXθ=(XTX+λI)−1((XTX)−1)−1θ=[(XTX)−1(XTX+λI)]−1θ=[I+λ(XTX)−1]−1θ​
X T X X^mathsf{T}X XTX is positive definite, λ > 0 lambda > 0 λ>0 by definition, so E ( θ ^ R ) ≠ θ E(hattheta_R)neq theta E(θ^R​)​=θ

Tips:
Although ridge regression does not perform variable selection, it performs grouped selection. So if one variable amongst a group of correlated ones is selected, ridge regression automatically includes the whole group. Ridge regression can resolve near multicollinearity.

3.2.3 Elastic net

The elastic net penalty is a compromise between Lasso and ridge.

Cost function
J E ( θ ) = ∣ ∣ y − X θ ∣ ∣ 2 + λ 1 ∣ ∣ θ ∣ ∣ 1 + λ 2 ∣ ∣ θ ∣ ∣ 2 2 J_E(theta)=||y-Xtheta||^2+lambda_1||theta||_1+lambda_2||theta||_2^2 JE​(θ)=∣∣y−Xθ∣∣2+λ1​∣∣θ∣∣1​+λ2​∣∣θ∣∣22​

elastic net estimate
θ ^ E = a r g m i n θ ( ∣ ∣ y − X θ ∣ ∣ 2 + λ 1 ∣ ∣ θ ∣ ∣ 1 + λ 2 ∣ ∣ θ ∣ ∣ 2 2 ) hattheta_E =underset{theta}{argmin}left(||y-Xtheta||^2+lambda_1||theta||_1+lambda_2||theta||_2^2right) θ^E​=θargmin​(∣∣y−Xθ∣∣2+λ1​∣∣θ∣∣1​+λ2​∣∣θ∣∣22​)

no closed form solution for θ E theta_E θE​, but J E ( θ ) J_E(theta) JE​(θ) is convex to have a unique minimum

4 Learning curve

Increase prediction accuracy:

variable selection: increase / reduce number of covariatesadd polynomial features: / x 2 , x 1 x 2 / /{x^2, x_1x_2/} /x2,x1​x2​/regularized regression (Lasso, ridge regression and elastic nets):Collect more data

Learning curve:

A learning curve is a metric of prediction accuracy, like cost function J ( θ ) J(theta) J(θ)Or error metric (function of a parameter that affects the metric)

转载请注明:文章转载自 www.mshxw.com
本文地址:https://www.mshxw.com/it/751417.html
我们一直用心在做
关于我们 文章归档 网站地图 联系我们

版权所有 (c)2021-2022 MSHXW.COM

ICP备案号:晋ICP备2021003244-6号