从零开始的Nesterov动量梯度下降

2024-03-19 21:47:34

【翻译自： Gradient Descent With Nesterov Momentum From Scratch】

【说明：Jason Brownlee PhD大神的文章个人很喜欢，所以闲暇时间里会做一点翻译和学习实践的工作，这里是相应工作的实践记录，希望能帮到有需要的人！】

梯度下降是一种优化算法，遵循目标函数的负梯度以定位函数的最小值。

梯度下降的一个限制是，如果目标函数返回嘈杂的梯度，则它可能会卡在平坦的区域中或反弹。动量是一种加速搜索过程的方法，可在平坦区域上浏览并平滑弹力梯度。

在某些情况下，动量的加速度会导致搜索丢失或超出盆地或山谷底部的最小值。 Nesterov动量是动量的扩展，涉及计算搜索空间中投影位置的梯度的递减移动平均值，而不是计算实际位置本身。

这具有利用动量加速收益的作用，同时在接近最优值时允许搜索减慢速度，并减少了丢失或超调的可能性。

在本教程中，您将发现如何从头开始使用Nesterov Momentum开发梯度下降优化算法。完成本教程后，您将知道：

梯度下降是一种优化算法，它使用目标函数的梯度来导航搜索空间。
通过扩展算法并添加Nesterov动量，可以加快梯度下降优化算法的收敛速度。
如何从头开始实现Nesterov动量优化算法，并将其应用于目标函数并评估结果。

教程概述

本教程分为三个部分：他们是：

梯度下降
内斯特罗夫动量
Nesterov动量的梯度下降
    二维测试问题
    Nesterov动量的梯度下降优化
    内斯特罗夫动量的可视化

梯度下降

梯度下降是一种优化算法。它在技术上称为一阶优化算法，因为它明确利用了目标目标函数的一阶导数。

一阶导数，或简称为“导数”，是目标函数在特定点（例如，点）上的变化率或斜率。用于特定输入。

如果目标函数采用多个输入变量，则将其称为多元函数，并且可以将输入变量视为向量。反过来，多元目标函数的导数也可以视为向量，通常称为“梯度”。

梯度：多元目标函数的一阶导数。
对于特定输入，导数或梯度指向目标函数最陡峭的上升方向。梯度下降是指一种最小化优化算法，该算法遵循目标函数的下坡梯度负值来定位函数的最小值。梯度下降算法需要一个正在优化的目标函数和该目标函数的导数函数。目标函数f（）返回给定输入集合的分数，导数函数f'（）给出给定输入集合的目标函数的导数。梯度下降算法需要问题中的起点（x），例如输入空间中的随机选择点。假设我们正在最小化目标函数，然后计算导数并在输入空间中采取一步，这将导致目标函数下坡运动。首先通过计算输入空间中要移动多远的距离来进行下坡运动，计算方法是将步长（称为alpha或学习率）乘以梯度。然后从当前点减去该值，以确保我们逆梯度移动或向下移动目标函数。

x（t + 1）= x（t）–step* f'（x（t））

在给定点的目标函数越陡峭，梯度的大小越大，反过来，在搜索空间中采取的步伐也越大。使用步长超参数来缩放步长的大小。

步长（alpha）：超参数，控制算法每次迭代时相对于梯度在搜索空间中移动多远。
如果步长太小，则搜索空间中的移动将很小，并且搜索将花费很长时间。如果步长太大，则搜索可能会在搜索空间附近反弹并跳过最优值。现在我们已经熟悉了梯度下降优化算法，下面我们来看看Nesterov动量。

内斯特罗夫动量

Nesterov动量是梯度下降优化算法的扩展。这种方法由Yurii Nesterov在1983年发表的题为“一种用收敛速度为O（1 / k ^ 2）解决凸规划问题的方法”描述（并以此名称命名）。Ilya Sutskever等。负责在他们的2013年论文“深度学习中初始化和动量的重要性”中描述了Nesterov动量在具有随机梯度下降的神经网络训练中的广泛应用。他们将这种方法称为“ Nesterov的加速梯度”，简称NAG。Nesterov动量就像更传统的动量一样，除了使用投影更新的偏导数而不是导数当前变量值执行更新。

然后，该变量的最后更新或最后更改将添加到由“动量”超参数缩放的变量中，该超参数控制要添加的最后更改的数量，例如 0.9为90％。更容易从两个步骤来考虑此更新，例如，使用偏导数计算变量的变化，然后计算变量的新值。

change（t + 1）=（momentum * change（t））–（step_size * f'（x（t）））
x（t + 1）= x（t）+change（t + 1）

我们可以从下坡滚球的角度来思考动量，即使在有小山丘的情况下，它也会加速并继续朝着相同的方向前进。

动量的问题在于，加速度有时会导致搜索超出盆地或山谷底部的最小值。内斯特罗夫动量可以被认为是对动量的一种修正，可以克服这个过小问题。它涉及首先使用从上次迭代开始的更改来计算变量的投影位置，并在计算变量的新位置时使用投影位置的导数。计算投影位置的坡度就像已累积加速度的校正因子一样。

Nesterov Momentum可以从以下四个步骤中轻松考虑这一点：

1.投影解决方案的位置。
2.计算投影的梯度。
3.使用偏导数计算变量的变化。
4.更新变量。

让我们更详细地完成这些步骤。首先，使用在算法的最后一次迭代中计算出的变化来计算整个解决方案的投影位置。

projection(t+1) = x(t) + (momentum * change(t))

然后，我们可以计算该新位置的梯度。

gradient(t+1) = f'(projection(t+1))

现在，我们可以使用投影的梯度来计算每个变量的新位置，首先通过计算每个变量的变化。

change(t+1) = (momentum * change(t)) – (step_size * gradient(t+1))

最后，使用计算出的变化为每个变量计算新值。

x(t+1) = x(t) + change(t+1)

更一般地，在凸优化领域中，已知Nesterov Momentum可以提高优化算法的收敛速度（例如，减少找到解所需的迭代次数）。

现在，我们已经熟悉Nesterov Momentum算法，接下来，我们将探讨如何实现它并评估其性能。

Nesterov动量的梯度下降

在本节中，我们将探索如何使用Nesterov动量实现梯度下降优化算法。

二维测试问题

首先，让我们定义一个优化函数。我们将使用一个简单的二维函数，该函数将每个维的输入平方，并定义有效输入的范围（从-1.0到1.0）。下面的Objective（）函数实现了此功能。

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

我们可以创建数据集的三维图，以了解响应面的曲率。下面列出了绘制目标函数的完整示例。

# 3d plot of the test function
from numpy import arange
from numpy import meshgrid
from matplotlib import pyplot

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# define range for input
r_min, r_max = -1.0, 1.0
# sample input range uniformly at 0.1 increments
xaxis = arange(r_min, r_max, 0.1)
yaxis = arange(r_min, r_max, 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a surface plot with the jet color scheme
figure = pyplot.figure()
axis = figure.gca(projection='3d')
axis.plot_surface(x, y, results, cmap='jet')
# show the plot
pyplot.show()

运行示例将创建目标函数的三维表面图。我们可以看到全局最小值为f（0，0）= 0的熟悉的碗形状。

我们还可以创建函数的二维图。这在以后要绘制搜索进度时会很有帮助。下面的示例创建目标函数的轮廓图。

# contour plot of the test function
from numpy import asarray
from numpy import arange
from numpy import meshgrid
from matplotlib import pyplot

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')
# show the plot
pyplot.show()

运行示例将创建目标函数的二维轮廓图。我们可以看到碗的形状被压缩为以颜色渐变显示的轮廓。我们将使用该图来绘制在搜索过程中探索的特定点。

现在我们有了一个测试目标函数，让我们看一下如何实现Nesterov动量优化算法。

Nesterov动量的梯度下降优化

我们可以将Nesterov动量的梯度下降应用于测试问题。首先，我们需要一个函数来计算此函数的导数。x ^ 2的导数在每个维度上均为x * 2，并且derivative（）函数在下面实现此功能。

# derivative of objective function
def derivative(x, y):
	return asarray([x * 2.0, y * 2.0])

接下来，我们可以实现梯度下降优化。

首先，我们可以选择问题范围内的随机点作为搜索的起点。

假定我们有一个数组，该数组定义搜索范围，每个维度一行，并且第一列定义最小值，第二列定义维度的最大值。

# generate an initial point
solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

接下来，我们需要从当前位置计算投影点并计算其导数。

# calculate the projected solution
projected = [solution[i] + momentum * change[i] for i in range(solution.shape[0])]
# calculate the gradient for the projection
gradient = derivative(projected[0], projected[1])

然后，我们可以创建一个新的解决方案，一次创建一个变量。

首先，使用偏导数和学习率以及变量最后一次变化的动量来计算变量的变化。存储此更改以供算法的下一次迭代。然后，更改将用于计算变量的新值。

# build a solution one variable at a time
new_solution = list()
for i in range(solution.shape[0]):
	# calculate the change
	change[i] = (momentum * change[i]) - step_size * gradient[i]
	# calculate the new position in this variable
	value = solution[i] + change[i]
	# store this variable
	new_solution.append(value)

对目标函数的每个变量重复此操作，然后对算法的每个迭代重复此操作。然后可以使用Objective（）函数评估该新解决方案，并可以报告搜索的性能。

# evaluate candidate point
solution = asarray(new_solution)
solution_eval = objective(solution[0], solution[1])
# report progress
print('>%d f(%s) = %.5f' % (it, solution, solution_eval))

就是这样。我们可以将所有这些绑定到一个名为nesterov（）的函数中，该函数采用目标函数和导数函数的名称，该数组具有域边界和超参数值，用于算法迭代的总数，学习率，和动量，并返回最终解决方案及其评估。下面列出了完整的功能。

# gradient descent algorithm with nesterov momentum
def nesterov(objective, derivative, bounds, n_iter, step_size, momentum):
	# generate an initial point
	solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	# list of changes made to each variable
	change = [0.0 for _ in range(bounds.shape[0])]
	# run the gradient descent
	for it in range(n_iter):
		# calculate the projected solution
		projected = [solution[i] + momentum * change[i] for i in range(solution.shape[0])]
		# calculate the gradient for the projection
		gradient = derivative(projected[0], projected[1])
		# build a solution one variable at a time
		new_solution = list()
		for i in range(solution.shape[0]):
			# calculate the change
			change[i] = (momentum * change[i]) - step_size * gradient[i]
			# calculate the new position in this variable
			value = solution[i] + change[i]
			# store this variable
			new_solution.append(value)
		# evaluate candidate point
		solution = asarray(new_solution)
		solution_eval = objective(solution[0], solution[1])
		# report progress
		print('>%d f(%s) = %.5f' % (it, solution, solution_eval))
	return [solution, solution_eval]

请注意，为了提高可读性，我们有意使用列表和命令式编码样式，而不是矢量化操作。随意将实现改编为带有NumPy数组的矢量化实现，以实现更好的性能。

然后，我们可以定义我们的超参数并调用nesterov（）函数来优化我们的测试目标函数。

在这种情况下，我们将使用算法的30次迭代，学习速率为0.1，动量为0.3。经过一些反复试验后，发现了这些超参数值。

# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 30
# define the step size
step_size = 0.1
# define momentum
momentum = 0.3
# perform the gradient descent search with nesterov momentum
best, score = nesterov(objective, derivative, bounds, n_iter, step_size, momentum)
print('Done!')
print('f(%s) = %f' % (best, score))

综合所有这些，下面列出了使用Nesterov Momentum进行梯度下降优化的完整示例。

# gradient descent optimization with nesterov momentum for a two-dimensional test function
from math import sqrt
from numpy import asarray
from numpy.random import rand
from numpy.random import seed

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# derivative of objective function
def derivative(x, y):
	return asarray([x * 2.0, y * 2.0])

# gradient descent algorithm with nesterov momentum
def nesterov(objective, derivative, bounds, n_iter, step_size, momentum):
	# generate an initial point
	solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	# list of changes made to each variable
	change = [0.0 for _ in range(bounds.shape[0])]
	# run the gradient descent
	for it in range(n_iter):
		# calculate the projected solution
		projected = [solution[i] + momentum * change[i] for i in range(solution.shape[0])]
		# calculate the gradient for the projection
		gradient = derivative(projected[0], projected[1])
		# build a solution one variable at a time
		new_solution = list()
		for i in range(solution.shape[0]):
			# calculate the change
			change[i] = (momentum * change[i]) - step_size * gradient[i]
			# calculate the new position in this variable
			value = solution[i] + change[i]
			# store this variable
			new_solution.append(value)
		# evaluate candidate point
		solution = asarray(new_solution)
		solution_eval = objective(solution[0], solution[1])
		# report progress
		print('>%d f(%s) = %.5f' % (it, solution, solution_eval))
	return [solution, solution_eval]

# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 30
# define the step size
step_size = 0.1
# define momentum
momentum = 0.3
# perform the gradient descent search with nesterov momentum
best, score = nesterov(objective, derivative, bounds, n_iter, step_size, momentum)
print('Done!')
print('f(%s) = %f' % (best, score))

运行示例将优化算法与Nesterov Momentum一起应用于我们的测试问题，并报告算法每次迭代的搜索性能。

注意：由于算法或评估程序的随机性，或者数值精度的差异，您的结果可能会有所不同。考虑运行该示例几次并比较平均结果。

在这种情况下，我们可以看到，在大约15次搜索迭代后找到了接近最佳的解决方案，输入值接近0.0和0.0，评估为0.0。

>0 f([-0.13276479 0.35251919]) = 0.14190
>1 f([-0.09824595 0.2608642 ]) = 0.07770
>2 f([-0.07031223 0.18669416]) = 0.03980
>3 f([-0.0495457 0.13155452]) = 0.01976
>4 f([-0.03465259 0.0920101 ]) = 0.00967
>5 f([-0.02414772 0.06411742]) = 0.00469
>6 f([-0.01679701 0.04459969]) = 0.00227
>7 f([-0.01167344 0.0309955 ]) = 0.00110
>8 f([-0.00810909 0.02153139]) = 0.00053
>9 f([-0.00563183 0.01495373]) = 0.00026
>10 f([-0.00391092 0.01038434]) = 0.00012
>11 f([-0.00271572 0.00721082]) = 0.00006
>12 f([-0.00188573 0.00500701]) = 0.00003
>13 f([-0.00130938 0.0034767 ]) = 0.00001
>14 f([-0.00090918 0.00241408]) = 0.00001
>15 f([-0.0006313 0.00167624]) = 0.00000
>16 f([-0.00043835 0.00116391]) = 0.00000
>17 f([-0.00030437 0.00080817]) = 0.00000
>18 f([-0.00021134 0.00056116]) = 0.00000
>19 f([-0.00014675 0.00038964]) = 0.00000
>20 f([-0.00010189 0.00027055]) = 0.00000
>21 f([-7.07505806e-05 1.87858067e-04]) = 0.00000
>22 f([-4.91260884e-05 1.30440372e-04]) = 0.00000
>23 f([-3.41109926e-05 9.05720503e-05]) = 0.00000
>24 f([-2.36851711e-05 6.28892431e-05]) = 0.00000
>25 f([-1.64459397e-05 4.36675208e-05]) = 0.00000
>26 f([-1.14193362e-05 3.03208033e-05]) = 0.00000
>27 f([-7.92908415e-06 2.10534304e-05]) = 0.00000
>28 f([-5.50560682e-06 1.46185748e-05]) = 0.00000
>29 f([-3.82285090e-06 1.01504945e-05]) = 0.00000
Done!
f([-3.82285090e-06 1.01504945e-05]) = 0.000000

内斯特罗夫动量的可视化

我们可以在域的轮廓图上绘制Nesterov动量搜索的进度。这可以为算法迭代过程中的搜索进度提供直观的认识。我们必须更新nesterov（）函数以维护在搜索过程中找到的所有解决方案的列表，然后在搜索结束时返回此列表。下面列出了具有这些更改的功能的更新版本。

# gradient descent algorithm with nesterov momentum
def nesterov(objective, derivative, bounds, n_iter, step_size, momentum):
	# track all solutions
	solutions = list()
	# generate an initial point
	solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	# list of changes made to each variable
	change = [0.0 for _ in range(bounds.shape[0])]
	# run the gradient descent
	for it in range(n_iter):
		# calculate the projected solution
		projected = [solution[i] + momentum * change[i] for i in range(solution.shape[0])]
		# calculate the gradient for the projection
		gradient = derivative(projected[0], projected[1])
		# build a solution one variable at a time
		new_solution = list()
		for i in range(solution.shape[0]):
			# calculate the change
			change[i] = (momentum * change[i]) - step_size * gradient[i]
			# calculate the new position in this variable
			value = solution[i] + change[i]
			# store this variable
			new_solution.append(value)
		# store the new solution
		solution = asarray(new_solution)
		solutions.append(solution)
		# evaluate candidate point
		solution_eval = objective(solution[0], solution[1])
		# report progress
		print('>%d f(%s) = %.5f' % (it, solution, solution_eval))
	return solutions

然后，我们可以像以前一样执行搜索，这一次将检索解决方案列表，而不是最佳的最终解决方案。

# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 50
# define the step size
step_size = 0.01
# define momentum
momentum = 0.8
# perform the gradient descent search with nesterov momentum
solutions = nesterov(objective, derivative, bounds, n_iter, step_size, momentum)

然后，我们可以像以前一样创建目标函数的轮廓图。

# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')

最后，我们可以将在搜索过程中找到的每个解决方案绘制成一条由一条线连接的白点。

# plot the sample as black circles
solutions = asarray(solutions)
pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')

综上所述，下面列出了对测试问题执行Nesterov动量优化并将结果绘制在轮廓图上的完整示例。

# example of plotting the nesterov momentum search on a contour plot of the test function
from math import sqrt
from numpy import asarray
from numpy import arange
from numpy.random import rand
from numpy.random import seed
from numpy import meshgrid
from matplotlib import pyplot
from mpl_toolkits.mplot3d import Axes3D

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# derivative of objective function
def derivative(x, y):
	return asarray([x * 2.0, y * 2.0])

# gradient descent algorithm with nesterov momentum
def nesterov(objective, derivative, bounds, n_iter, step_size, momentum):
	# track all solutions
	solutions = list()
	# generate an initial point
	solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	# list of changes made to each variable
	change = [0.0 for _ in range(bounds.shape[0])]
	# run the gradient descent
	for it in range(n_iter):
		# calculate the projected solution
		projected = [solution[i] + momentum * change[i] for i in range(solution.shape[0])]
		# calculate the gradient for the projection
		gradient = derivative(projected[0], projected[1])
		# build a solution one variable at a time
		new_solution = list()
		for i in range(solution.shape[0]):
			# calculate the change
			change[i] = (momentum * change[i]) - step_size * gradient[i]
			# calculate the new position in this variable
			value = solution[i] + change[i]
			# store this variable
			new_solution.append(value)
		# store the new solution
		solution = asarray(new_solution)
		solutions.append(solution)
		# evaluate candidate point
		solution_eval = objective(solution[0], solution[1])
		# report progress
		print('>%d f(%s) = %.5f' % (it, solution, solution_eval))
	return solutions

# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 50
# define the step size
step_size = 0.01
# define momentum
momentum = 0.8
# perform the gradient descent search with nesterov momentum
solutions = nesterov(objective, derivative, bounds, n_iter, step_size, momentum)
# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')
# plot the sample as black circles
solutions = asarray(solutions)
pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')
# show the plot
pyplot.show()

运行示例将像以前一样执行搜索，但是在这种情况下，将创建目标函数的轮廓图。

在这种情况下，我们可以看到在搜索过程中找到的每个解决方案都显示一个白点，从最优点开始，逐渐靠近图中心的最优点。

码农公寓