读书笔记《Deep Learning for Computer Vision with Python》- 第一卷

第一卷第九章优化方法和正则化

几乎所有的深度学习都由一种非常重要的算法提供支持：随机梯度下降(SGD)。

至此，我们对参数化学习的概念有了深刻的理解。在过去的几章中，我们讨论了参数化学习的概念以及这种类型的学习如何使我们能够定义将输入数据映射到输出类标签的评分函数。

这个评分函数是根据两个重要参数定义的；具体来说，我们的权重矩阵W和我们的偏置向量b。我们的评分函数接受这些参数作为输入并返回对每个输入数据点Xi的预测。

我们还讨论了两种常见的损失函数：多类SVM损失和交叉熵损失。在最基本的层面上，损失函数用于量化给定预测器（即一组参数）在对我们数据中的输入数据点进行分类时的“好”或“坏”程度。

鉴于这些模块，我们现在可以转向机器学习、神经网络和深度学习最重要的方面——优化。优化算法是为神经网络提供动力并使它们能够从数据中学习模式的引擎。在整个讨论中，我们了解到获得高精度分类器取决于找到一组权重W和b，以便我们的数据点被正确分类。

但是我们如何去寻找和获得获得高分类精度的权重矩阵W和偏置向量b呢？我们是否对它们进行随机初始化、一遍又一遍地重复评估，希望在某个时候我们得到一组参数以获得合理的分类？我们可以——但鉴于现代深度学习网络的参数数量达到数千万，我们可能需要很长时间才能盲目地偶然发现一组合理的参数。

与其依赖纯随机性，我们还需要定义一种优化算法，使我们能够从字面上改进W和b。在本章中，我们将研究用于训练神经网络和深度学习模型的最常用算法——梯度下降。梯度下降有很多变体（我们也会涉及到），但是，在每种情况下，想法都是一样的：迭代地评估你的参数，计算你的损失，然后朝着最小化你的损失的方向迈出一小步。

一、梯度下降

梯度下降算法有两种主要风格：

1、Vanilla gradient descent，批梯度下降

2、更常用的优化“随机”版本。

在本节中，我们将回顾基本的Vanilla实现，以形成我们理解的基线。在我们了解了梯度下降的基础知识之后，我们将转到随机版本。然后，我们将回顾一些梯度下降中的“知识点”，包括动量和Nesterov加速度。

1.1损失情况和优化面

梯度下降法是一种迭代优化算法，它在损失范围（也称为优化表面）上运行。规范梯度下降示例是将我们沿x轴的权重可视化，然后沿y轴可视化一组给定权重的损失。

给定一组参数W（权重矩阵）和b（偏置向量），沿碗表面的每个位置对应于特定的损失值。我们的目标是尝试W和b的不同值，评估它们的损失，然后朝着（理想情况下）具有更低损失的更优化值迈出一步。

1.2梯度下降中的梯度

左：我们的机器人Chad。右图：查德的工作是导航我们的损失景观并下降到盆地底部。不幸的是，Chad唯一可以用来控制他的导航的传感器是一个特殊的函数，称为损失函数L。这个函数必须将他引导到一个损失较低的区域。

现在Chad的工作是导航到盆地底部（损失最小的地方）。看起来很容易吧？Chad所要做的就是调整自己的方向，使他面向“下坡”并沿着斜坡滑行直到到达碗底。

但问题是：Chad不是一个非常聪明的机器人。Chad只有一个传感器——这个传感器允许他获取他的参数W和b，然后计算一个损失函数L。因此，Chad能够计算他在损失景观上的相对位置，但他完全不知道他应该往哪个方向迈出一步，让自己靠近盆底。

Chad要做什么？答案是应用梯度下降。Chad需要做的就是遵循梯度W的斜率。我们可以使用以下等式计算所有维度的梯度W：

在>1维中，我们的梯度变成了偏导数的向量。这个等式的问题在于：

1.它是梯度的近似值。

2.非常缓慢。

1.3把它当做一个凸问题来对待

使用图中的碗作为损失情况的可视化，也使我们能够在现代神经网络中得出一个重要的结论——我们将损失情况视为一个凸问题，即使它不是。如果某个函数F是凸函数，则所有局部最小值也是全局最小值。这个想法非常适合碗的可视化。我们的优化算法只需要在碗的顶部系上一双滑雪板，然后慢慢沿着坡度滑下，直到我们到达底部。

问题是我们应用神经网络和深度学习算法的几乎所有问题都不是整齐的凸函数。相反，在这个碗中，我们会发现尖峰状的山峰、更类似于峡谷的山谷、陡峭的下降，甚至是损失急剧下降但又再次急剧上升的槽。

鉴于我们数据集的非凸性质，我们为什么要应用梯度下降？答案很简单：因为它做得足够好。引用Goodfellow等人的话。[10]：

“[An]优化算法可能无法保证在合理的时间内达到局部最小值，但它通常会足够快地找到[loss]函数的非常低的值以供使用。”

在训练深度学习网络时，我们可以设置寻找局部/全局最小值的高期望，但这种期望很少与现实相符。相反，我们最终找到了一个低损失区域——这个区域甚至可能不是局部最小值，但在实践中，事实证明这已经足够了。

1.4偏差技巧

左：通常我们将权重矩阵和偏置向量视为两个独立的参数。右：然而，我们实际上可以将偏置向量嵌入到权重矩阵中（从而通过用额外的一列初始化我们的权重矩阵，使其成为直接在权重矩阵内的可训练参数。

1.5梯度下降伪代码

1.6使用Python实现基本的梯度下降

gradient_descent.py：代码如下：

# import the necessary packages
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import numpy as np
import argparse
def sigmoid_activation(x):
	# compute the sigmoid activation value for a given input
	return 1.0 / (1 + np.exp(-x))
def sigmoid_deriv(x):
	# compute the derivative of the sigmoid function ASSUMING
	# that the input `x` has already been passed through the sigmoid
	# activation function
	return x * (1 - x)
def predict(X, W):
	# take the dot product between our features and weight matrix
	preds = sigmoid_activation(X.dot(W))
	# apply a step function to threshold the outputs to binary
	# class labels
	preds[preds <= 0.5] = 0
	preds[preds > 0] = 1
	# return the predictions
	return preds

# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-e", "--epochs", type=float, default=100,
	help="# of epochs")
ap.add_argument("-a", "--alpha", type=float, default=0.01,
	help="learning rate")
args = vars(ap.parse_args())

# generate a 2-class classification problem with 1,000 data points,
# where each data point is a 2D feature vector
(X, y) = make_blobs(n_samples=1000, n_features=2, centers=2,
	cluster_std=1.5, random_state=1)
y = y.reshape((y.shape[0], 1))
# insert a column of 1's as the last entry in the feature
# matrix -- this little trick allows us to treat the bias
# as a trainable parameter within the weight matrix
X = np.c_[X, np.ones((X.shape[0]))]
# partition the data into training and testing splits using 50% of
# the data for training and the remaining 50% for testing
(trainX, testX, trainY, testY) = train_test_split(X, y, test_size=0.5, random_state=42)

# initialize our weight matrix and list of losses
print("[INFO] training...")
W = np.random.randn(X.shape[1], 1)
losses = []

# loop over the desired number of epochs
for epoch in np.arange(0, args["epochs"]):
	# take the dot product between our features "X" and the weight
	# matrix "W", then pass this value through our sigmoid activation
	# function, thereby giving us our predictions on the dataset
	preds = sigmoid_activation(trainX.dot(W))
	# now that we have our predictions, we need to determine the
	# "error", which is the difference between our predictions and
	# the true values
	error = preds - trainY
	loss = np.sum(error ** 2)
	losses.append(loss)
	# the gradient descent update is the dot product between our
	# (1) features and (2) the error of the sigmoid derivative of
	# our predictions
	d = error * sigmoid_deriv(preds)
	gradient = trainX.T.dot(d)
	# in the update stage, all we need to do is "nudge" the weight
	# matrix in the negative direction of the gradient (hence the
	# term "gradient descent" by taking a small step towards a set
	# of "more optimal" parameters
	W += -args["alpha"] * gradient
	# check to see if an update should be displayed
	if epoch == 0 or (epoch + 1) % 5 == 0:
		print("[INFO] epoch={}, loss={:.7f}".format(int(epoch + 1),	loss))

# evaluate our model
print("[INFO] evaluating...")
preds = predict(testX, W)
print(classification_report(testY, preds))

# plot the (testing) classification data
plt.style.use("ggplot")
plt.figure()
plt.title("Data")
plt.scatter(testX[:, 0], testX[:, 1], marker="o", c=testY[:, 0], s=30)
# construct a figure that plots the loss over time
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, args["epochs"]), losses)
plt.title("Training Loss")
plt.xlabel("Epoch #")
plt.ylabel("Loss")
plt.show()

1.7简单的梯度下降结果

二、随机梯度下降SGD

在上一节中，我们讨论了梯度下降，这是一种一阶优化算法，可用于学习一组用于参数化学习的分类器权重。然而，这种梯度下降的“普通”实现在大型数据集上运行可能会非常缓慢——事实上，它甚至可以被认为是计算浪费。

相反，我们应该应用随机梯度下降(SGD)，这是对标准梯度下降算法的简单修改，该算法计算梯度并更新小批量训练数据的权重矩阵W，而不是整个训练集。虽然这种修改会导致“更嘈杂”的更新，但它也允许我们沿着梯度采取更多的步骤（每批次一个步骤，而不是每个阶段一个步骤），最终导致更快的收敛，并且不会对损失和分类精度产生负面影响。

SGD可以说是训练深度神经网络时最重要的算法。尽管SGD的最初版本是在57年前引入的[85]，但它仍然是使我们能够训练大型网络以从数据点学习模式的引擎。最重要的是本书涵盖的所有其他算法，花点时间了解SGD。

2.1小批量SGD

2.2实现小批量SGD

sgd.py，代码如下：

# import the necessary packages
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
import numpy as np
import argparse
def sigmoid_activation(x):
	# compute the sigmoid activation value for a given input
	return 1.0 / (1 + np.exp(-x))
def sigmoid_deriv(x):
	# compute the derivative of the sigmoid function ASSUMING
	# that the input "x" has already been passed through the sigmoid
	# activation function
	return x * (1 - x)
def predict(X, W):
	# take the dot product between our features and weight matrix
	preds = sigmoid_activation(X.dot(W))
	# apply a step function to threshold the outputs to binary
	# class labels
	preds[preds <= 0.5] = 0
	preds[preds > 0] = 1
	# return the predictions
	return preds
def next_batch(X, y, batchSize):
	# loop over our dataset "X" in mini-batches, yielding a tuple of
	# the current batched data and labels
	for i in np.arange(0, X.shape[0], batchSize):
		yield (X[i:i + batchSize], y[i:i + batchSize])

# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-e", "--epochs", type=float, default=100,
	help="# of epochs")
ap.add_argument("-a", "--alpha", type=float, default=0.01,
	help="learning rate")
ap.add_argument("-b", "--batch-size", type=int, default=32,
	help="size of SGD mini-batches")
args = vars(ap.parse_args())

# generate a 2-class classification problem with 1,000 data points,
# where each data point is a 2D feature vector
(X, y) = make_blobs(n_samples=1000, n_features=2, centers=2,
	cluster_std=1.5, random_state=1)
y = y.reshape((y.shape[0], 1))
# insert a column of 1's as the last entry in the feature
# matrix -- this little trick allows us to treat the bias
# as a trainable parameter within the weight matrix
X = np.c_[X, np.ones((X.shape[0]))]
# partition the data into training and testing splits using 50% of
# the data for training and the remaining 50% for testing
(trainX, testX, trainY, testY) = train_test_split(X, y,
	test_size=0.5, random_state=42)

# initialize our weight matrix and list of losses
print("[INFO] training...")
W = np.random.randn(X.shape[1], 1)
losses = []

# loop over the desired number of epochs
for epoch in np.arange(0, args["epochs"]):
	# initialize the total loss for the epoch
	epochLoss = []
	# loop over our data in batches
	for (batchX, batchY) in next_batch(trainX, trainY, args["batch_size"]):
		# take the dot product between our current batch of features
		# and the weight matrix, then pass this value through our
		# activation function
		preds = sigmoid_activation(batchX.dot(W))
		# now that we have our predictions, we need to determine the
		# "error", which is the difference between our predictions
		# and the true values
		error = preds - batchY
		epochLoss.append(np.sum(error ** 2))
		# the gradient descent update is the dot product between our
		# (1) current batch and (2) the error of the sigmoid
		# derivative of our predictions
		d = error * sigmoid_deriv(preds)
		gradient = batchX.T.dot(d)
		# in the update stage, all we need to do is "nudge" the
		# weight matrix in the negative direction of the gradient
		# (hence the term "gradient descent" by taking a small step
		# towards a set of "more optimal" parameters
		W += -args["alpha"] * gradient
	# update our loss history by taking the average loss across all
	# batches
	loss = np.average(epochLoss)
	losses.append(loss)
	# check to see if an update should be displayed
	if epoch == 0 or (epoch + 1) % 5 == 0:
		print("[INFO] epoch={}, loss={:.7f}".format(int(epoch + 1), loss))

# evaluate our model
print("[INFO] evaluating...")
preds = predict(testX, W)
print(classification_report(testY, preds))

# plot the (testing) classification data
plt.style.use("ggplot")
plt.figure()
plt.title("Data")
plt.scatter(testX[:, 0], testX[:, 1], marker="o", c=testY[:, 0], s=30)
# construct a figure that plots the loss over time
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, args["epochs"]), losses)
plt.title("Training Loss")
plt.xlabel("Epoch #")
plt.ylabel("Loss")
plt.show()

2.3SGD结果

三、正则化 4.1什么是正则化？为什么需要正则化？

正则化帮助我们控制模型容量，确保我们的模型能够更好地对未经训练的数据点进行（正确）分类，我们称之为泛化能力。如果我们不应用正则化，我们的分类器很容易变得过于复杂并过度拟合我们的训练数据，在这种情况下，我们将失去泛化到我们的测试数据（以及测试集之外的数据点，例如新图像）。

然而，过多的正则化可能是一件坏事。我们可能冒着欠拟合的风险，在这种情况下，我们的模型在训练数据上表现不佳，并且无法对输入数据和输出类别标签之间的关系进行建模（因为我们过多地限制了模型容量）。

4.2正则化技术的类型

通常，您会看到直接应用于损失函数的三种常见类型的正则化。第一个，我们之前回顾过，L2正则化（又名“权重衰减”）：

我们还有L1正则化，它采用绝对值而不是平方：

ElasticNet[98]正则化试图结合L1和L2正则化：

读书笔记《Deep Learning for Computer Vision with Python》- 第一卷 - 第9章优化方法和正则化

Python相关栏目本月热门文章

读书笔记《Deep Learning for Computer Vision with Python》- 第一卷 - 第9章 优化方法和正则化

Python相关栏目本月热门文章

读书笔记《Deep Learning for Computer Vision with Python》- 第一卷 - 第9章优化方法和正则化