Word2Vec词向量

文章目录

- Linear Regression
- Logistic Regression
- Neural Network
参考文献

Linear Regression

y = h θ ( x 1 , x 2 , . . . , x n ) + e ，其中建模误差 e = 代价函数 c o s t f u n c t i o n y = h_{theta}(x_1, x_2, ..., x_n) + e，其中建模误差 e=代价函数costspace function y=hθ(x1,x2,...,xn)+e，其中建模误差e=代价函数cost function

h θ ( x ) = θ 0 + θ 1 x = [ θ 0 , θ 1 ] [ x 0 x 1 ] h_{theta}(x) = theta_0 + theta_1x =begin{bmatrix} theta_{0}, theta_{1} end{bmatrix} begin{bmatrix} x_0 \ x_1 end{bmatrix} hθ(x)=θ0+θ1x=[θ0,θ1][x0x1]
线性回归算法优化的目标是：选取最有可能与数据相拟合的直线。数据与直线的误差，称为建模误差 m o d e l i n g modeling modeling e r r o r error error，可侧面反映直线与数据的拟合程度。建模误差由参数 θ 0 theta_0 θ0 和 θ 1 theta_1 θ1决定，所以可以表示为 θ 0 theta_0 θ0 和 θ 1 theta_1 θ1 的函数，称为代价函数 J ( θ 0 , θ 1 ) J(theta_0, theta_1) J(θ0,θ1)。
J ( θ 0 , θ 1 ) = 1 2 m ∑ i = 0 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(theta_0, theta_1) = frac{1}{2m}sum_{i=0}^{m}(h_{theta}(x^{(i)})-y^{(i)})^{2} J(θ0,θ1)=2m1i=0∑m(hθ(x(i))−y(i))2

# 代价函数 cost function
def computeCost(x, y, coef):
    return np.power( (( x * coef.T)  - y ), 2).sum()/(2*len(x))

x = np.hstack((np.ones(len(x)).reshape((-1,1)), x))
x = np.matrix(x)
y = np.matrix(y)
coef = np.matrix(np.array([0,0]))

computeCost(x,y,coef)

求解目标
m i n i m i z e θ 0 , θ 1 J ( θ 0 , θ 1 ) underset{theta_{0}, theta_{1}}{minimize} space J(theta_{0}, theta_{1}) θ0,θ1minimize J(θ0,θ1)

梯度下降法

θ j − ∂ ∂ θ j J ( θ 0 , θ 1 ) = θ j − α m ∑ i = 0 m ( h θ ( x ( i ) ) − y ( i ) ) ∗ x j ( i ) theta_{j} - frac{ partial{} }{ partial{theta_{j}} }J(theta_{0}, theta_{1}) = theta_{j} - frac{alpha}{m} sum_{i=0}^{m}(h_{theta}(x^{(i)})-y^{(i)})*x^{(i)}_{j} θj−∂θj∂J(θ0,θ1)=θj−mαi=0∑m(hθ(x(i))−y(i))∗xj(i)

## 梯度下降
coef = np.matrix(np.zeros(x.shape[1]))                 # θ_0=0  θ_1=0                    
for j in range(parameters):
	term = np.multiply((x*coef.T) - y,x[:,j])          # (h(x)-y) * x
	temp[0,j] = coef[0,j] - ((alpha/len(x)) * np.sum(term))
coef = temp

正规方程法

# 正规方程
x = np.c_[np.ones(len(x)),x]
y = np.c_[y]
theta = np.linalg.inv(x.T.dot(x)).dot(x.T.dot(y))

Logistic Regression

生活中会遇到许多二分类任务，比如识别邮件是否携带病毒、判别信息是否为诈骗信息以及确认肿瘤的良性与恶性。以肿瘤识别为例，肿瘤大小(tumor size) 为 x x x 值，肿瘤性质为 y y y 值， 0 0 0 代表良性， 1 1 1代表恶性。映射函数为线性回归 h θ ( x ) = θ T x h_{theta}(mathbf{x}) = mathbf{theta^{T}x} hθ(x)=θTx，阈值设为 0.5 0.5 0.5，二分类判别公式为
y = { 1 h θ ( x ) ≥ 0.5 0 h θ ( x ) < 0.5 y=begin{cases} 1 & h_{theta}(mathbf{x}) ge 0.5 \ 0 & h_{theta}(mathbf{x}) < 0.5 end{cases} y={10hθ(x)≥0.5hθ(x)<0.5

线性回归执行二分类任务是比较函数值 h θ ( x ) h_{theta}(mathbf{x}) hθ(x) 与阈值 0.5 0.5 0.5，判别 y = 0 y=0 y=0 或 1 1 1。但线性回归分类有时效果较差，右侧的点被识别为良性。

δ ( z ) = 1 1 + e − z delta(z) = frac{1}{1+e^{-z}} δ(z)=1+e−z1
同时，线性回归输出结果是实数集，即含有输出 h θ ( x ) < 0 h_{theta}(mathbf{x})<0 hθ(x)<0 或 h θ ( x ) > 1 h_{theta}(mathbf{x})>1 hθ(x)>1 的情况，实际上我们只需要比较 h θ ( x ) h_{theta}(mathbf{x}) hθ(x) 是否大于等于 0.5 0.5 0.5即可。因此，我们可以使用 s i g m o i d sigmoid sigmoid 函数将线性回归的结果映射到范围 [ 0 , 1 ] [0,1] [0,1]。
δ ( θ T x ) = 1 1 + e − θ T x delta(mathbf{theta^{T}x}) = frac{1}{ 1+e^{-mathbf{theta^{T}x}} } δ(θTx)=1+e−θTx1

为统一符号，我们仍使用 h θ ( x ) h_{theta}(mathbf{x}) hθ(x) 表示映射函数。
h θ ( x ) = 1 1 + e − θ T x h_{theta}(mathbf{x}) = frac{1}{ 1+e^{-mathbf{theta^{T}x}} } hθ(x)=1+e−θTx1

为检验映射函数判别结果的有效性，我们需要使用代价函数 J ( θ ) J(theta) J(θ) 来衡量
J ( θ ) = 1 2 m ∑ i = 0 m ( h θ ( x ( i ) ) − y ( i ) ) 2 = 1 2 m ( h θ ( x ) − y ) 2 J(theta) = frac{1}{2m}sum_{i=0}^{m}(h_{theta}(x^{(i)})-y^{(i)})^{2} =frac{1}{2m}(h_{theta}(mathbf{x})-mathbf{y})^{2} J(θ)=2m1i=0∑m(hθ(x(i))−y(i))2=2m1(hθ(x)−y)2

令 h θ ( x ) = θ T x h_{theta}(mathbf{x})= mathbf{theta^{T}x} hθ(x)=θTx 和 h θ ( x ) = 1 / ( 1 + e − θ T x ) h_{theta}(mathbf{x})= 1/ (1+e^{-mathbf{theta^{T}x}}) hθ(x)=1/(1+e−θTx)，可得代价函数曲线如下，

显然，线性回归映射函数是线性的，其代价函数曲线是凸函数，可使用梯度下降法得到全局最优解，但逻辑回归的映射函数是非线性的，其代价函数曲线是非凸函数，无法使用梯度下降法得到全局最优解。因此，我们需要为逻辑回归寻找一个凸的代价函数。

回到肿瘤识别的二分类问题，本质是给定一个肿瘤大小 x x x，判断它是恶性肿瘤的概率，即后验概率 p ( y = 1 ∣ x ) p(y=1| x) p(y=1∣x)，贝叶斯公式可得 p ( y = 1 ∣ x ) = p ( y = 1 , x ) p ( x ) = p ( x ∣ y = 1 ) p ( y = 1 ) p ( x ∣ y = 1 ) p ( y = 1 ) + p ( x ∣ y = 0 ) p ( y = 0 ) = 1 1 + p ( x ∣ y = 0 ) p ( y = 0 ) p ( x ∣ y = 1 ) p ( y = 1 ) p(y=1|x)=frac{p(y=1, x)}{p(x)} =frac{p(x|y=1)p(y=1)}{p(x|y=1)p(y=1)+p(x|y=0)p(y=0)} =frac{1}{1+frac{p(x|y=0)p(y=0)}{ p(x|y=1)p(y=1) } } p(y=1∣x)=p(x)p(y=1,x)=p(x∣y=1)p(y=1)+p(x∣y=0)p(y=0)p(x∣y=1)p(y=1)=1+p(x∣y=1)p(y=1)p(x∣y=0)p(y=0)1

令 z = ln ⁡ p ( x ∣ y = 1 ) p ( y = 1 ) p ( x ∣ y = 0 ) p ( y = 0 ) z=ln frac{p(x|y=1)p(y=1)}{p(x|y=0)p(y=0)} z=lnp(x∣y=0)p(y=0)p(x∣y=1)p(y=1)，则 p ( y = 1 ∣ x ) = 1 1 + e − z = δ ( z ) p(y=1|x) = frac{1}{1+e^{-z}} =delta(z) p(y=1∣x)=1+e−z1=δ(z)

因此，我们可以知道使用 s i g m o i d sigmoid sigmoid 函数将回归结果映射到范围 [0,1] 执行分类任务并非偶然，二分类的本质就是 s i g m o i d sigmoid sigmoid 函数。我们将线性回归函数代入，
p ( y = 1 ∣ x ; θ ) = 1 1 + e − θ T x = h θ ( x ) ， p ( y = 0 ∣ x ; θ ) = 1 − h θ ( x ) p(y=1|x; theta) = frac{1}{1+e^{-mathbf{theta^{T}x}}} = h_{theta}(mathbf{x}) ，p(y=0|x; theta) = 1 - h_{theta}(mathbf{x}) p(y=1∣x;θ)=1+e−θTx1=hθ(x)，p(y=0∣x;θ)=1−hθ(x)

0-1分布的分布函数可将两者统一为 p ( y ∣ x ; θ ) = h θ ( x ) y ( 1 − h θ ( x ) ) 1 − y p(y|x; theta)=h_{theta}(mathbf{x})^{y} (1 - h_{theta}(mathbf{x}) )^{1-y} p(y∣x;θ)=hθ(x)y(1−hθ(x))1−y

极大似然估计可以利用已知的样本结果信息,反推最具有可能(最大概率)导致这些样本结果出现的模型参数值。极大似然函数为 L ( θ ) = ∏ i = 1 m p ( y ( i ) ∣ x ( i ) ; θ ) = ∏ i = 1 m ( h θ ( x ( i ) ) ) y ( i ) ( 1 − h θ ( x ( i ) ) ) 1 − y ( i ) L(theta)=prod_{i=1}^{m}p(y^{(i)} | x^{(i)}; theta) =prod_{i=1}^{m} ( h_{theta}( mathbf{x}^{(i)} ) )^{ y^{(i)} } ( 1- h_{theta}( mathbf{x}^{(i)} ) )^{1- y^{(i)} } L(θ)=i=1∏mp(y(i)∣x(i);θ)=i=1∏m(hθ(x(i)))y(i)(1−hθ(x(i)))1−y(i)

两边取对数简化运算， l ( θ ) = ln ⁡ L ( θ ) = ∑ i = 0 m [ y ( i ) ln ⁡ ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) ln ⁡ ( 1 − h θ ( x ( i ) ) ) ] l(theta)=ln L(theta) = sum_{i=0}^{m} [y^{(i)} ln ( h_{theta}( mathbf{x}^{(i)} ) ) + (1- y^{(i)}) ln ( 1- h_{theta}( mathbf{x}^{(i)} ) )] l(θ)=lnL(θ)=i=0∑m[y(i)ln(hθ(x(i)))+(1−y(i))ln(1−hθ(x(i)))]

极大似然估计是尽可能拟合数据的参数，而最小代价函数是模型与数据最小误差的参数，两者含义一致公式相反，所以最小代价函数是 J ( θ ) = − 1 m l ( θ ) = − 1 m ln ⁡ L ( θ ) = − 1 m ∑ i = 0 m [ y ( i ) ln ⁡ ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) ln ⁡ ( 1 − h θ ( x ( i ) ) ) ] J(theta)=-frac{1}{m}l(theta)=-frac{1}{m}ln L(theta) = - frac{1}{m} sum_{i=0}^{m} [y^{(i)} ln ( h_{theta}( mathbf{x}^{(i)} ) ) + (1- y^{(i)}) ln ( 1- h_{theta}( mathbf{x}^{(i)} ) )] J(θ)=−m1l(θ)=−m1lnL(θ)=−m1i=0∑m[y(i)ln(hθ(x(i)))+(1−y(i))ln(1−hθ(x(i)))]

s i g m o i d sigmoid sigmoid 函数求导
δ ′ ( x ) = e − x ( 1 + e − x ) 2 = δ ( x ) [ 1 − δ ( x ) ] delta^{'}(x)=frac{ e^{-x} }{ (1+e^{-x})^{2} }=delta(x)[1-delta(x)] δ′(x)=(1+e−x)2e−x=δ(x)[1−δ(x)]

从而，可得 ln ⁡ δ ( x ) ln delta(x) lnδ(x) 和 ln ⁡ ( 1 − δ ( x ) ) ln ( 1-delta(x) ) ln(1−δ(x)) 导数分别为
[ log ⁡ δ ( x ) ] ′ = δ ( x ) ′ δ ( x ) = 1 − δ ( x ) , [ log ⁡ ( 1 − δ ( x ) ) ] ′ = [ 1 − δ ( x ) ] ′ 1 − δ ( x ) = − δ ( x ) [log delta(x)]^{'} = frac{delta(x)^{'}}{ delta(x) } = 1-delta(x), [log ( 1-delta(x) )]^{'} = frac{[1-delta(x)]^{'}}{ 1-delta(x) } = -delta(x) [logδ(x)]′=δ(x)δ(x)′=1−δ(x),[log(1−δ(x))]′=1−δ(x)[1−δ(x)]′=−δ(x)
最小代价函数是凸函数，可使用梯度下降法寻找最优解，同时最小代价函数求导并代入 s i g m o i d sigmoid sigmoid 求导公式 ∂ J ( θ ) ∂ θ j = ∑ i = 0 m ( h θ ( x ( i ) ) − y ( i ) ) ∗ x j ( i ) frac{ partial{J(theta)} }{ partial{theta_{j}} } =sum_{i=0}^{m}(h_{theta}( mathbf{x}^{(i)})-y^{(i)})* mathbf{x}^{(i)}_{j} ∂θj∂J(θ)=i=0∑m(hθ(x(i))−y(i))∗xj(i)

J ( θ ) = − 1 m ∑ i = 0 m [ y ( i ) ln ⁡ ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) ln ⁡ ( 1 − h θ ( x ( i ) ) ) ] (1) J(theta)= - frac{1}{m} sum_{i=0}^{m} [y^{(i)} ln ( h_{theta}( mathbf{x}^{(i)} ) ) + (1- y^{(i)}) ln ( 1- h_{theta}( mathbf{x}^{(i)} ) )] tag{1} J(θ)=−m1i=0∑m[y(i)ln(hθ(x(i)))+(1−y(i))ln(1−hθ(x(i)))](1)

J ( θ ) = 1 2 m ∑ i = 0 m ( h θ ( x ( i ) ) − y ( i ) ) 2 (2) J(theta) = frac{1}{2m}sum_{i=0}^{m}(h_{theta}(x^{(i)})-y^{(i)})^{2} tag{2} J(θ)=2m1i=0∑m(hθ(x(i))−y(i))2(2)

公式(1) 是交叉熵损失函数 (cross-entropy loss function)，而公式(2)是平方误差损失函数(square error loss function)。针对分类任务的映射函数，损失函数曲面图像为最左侧图像，右侧为两种损失函数图像沿着某一特定方向的截面。截面生成两条曲线，交叉熵损失函数曲线两侧梯度较大，可以使用梯度下降法很快地找到最优解，而平方误差损失函数曲线两侧平坦，梯度下降法不适用于寻找最优解。因此，不同的映射函数需要不同的损失函数。

# 梯度下降
coef = np.matrix(np.zeros(x.shape[1]))                 					 # θ_0=0  θ_1=0
for j in range(parameters):
	term = np.multiply(1/(1+np.exp(-( x * coef.T))) - y,x[:,j])          # (h(x)-y) * xj
	temp[0,j] = coef[0,j] - alpha * np.sum(term) 
coef = temp

样本特征只有一个平均直径时，判别准确率比较差，大部分的肿瘤被错判为良性肿瘤，我们可以试着增加样本特征的方式提高识别准确率。

样本特征数量增加到30个时，判别识别率得到极大提升，但是程序运行时间从0.46s骤增到6.01s。随着特征数量的增加，样本交互矩阵会呈指数级增长，所以我们需要新的算法平衡样本特征数量和程序运行时间的关系。

Neural Network

右半部分其实就是以a,a1,a2,a3按照逻辑回归的方式输出h(x)：

其实神经网络就像是逻辑回归，只不过我们把逻辑回归中的输入向量 [ x 1 , x 2 , x 3 ] [x1,x2,x3] [x1,x2,x3]变成了中间层的 [ a 1 ( 2 ) , a 2 ( 2 ) , a 3 ( 2 ) ] [a^{(2)}_{1},a^{(2)}_{2},a^{(2)}_{3}] [a1(2),a2(2),a3(2)]，即

h(x)=g(θ(2)a(2)+θ(2)1a(2)1+θ(2)2a(2)2+θ(2)3a(2)3)

我们可以把a,a1,a2,a3看成更为高级的特征值，也就是x,x1,x2,x3的进化体，并且它们是由x与θ决定的，因为是梯度下降的，所以a是变化的，并且变得越来越厉害，所以这些更高级的特征值远比仅仅将x次方厉害，也能更好的预测新数据。

这就是神经网络相比于逻辑回归和线性回归的优势。

参考文献

吴恩达机器学习系列课程
李宏毅2020机器学习深度学习(完整版)国语
逻辑回归(logistic regression)的本质——极大似然估计
1.什么是LR回归？LR的公式及求导？为什么sigmoid函数可以作为概率？

Word2Vec词向量

Python相关栏目本月热门文章