梯度下降、AdaGrad算法内容及实现

2023-02-14 07:49:56

梯度下降、AdaGrad算法内容及实现

AdaGrad算法

在一般的优化算法中，目标函数自变量的每一个变量都采用统一的学习率来进行迭代。

\[w = w-\eta\frac{\partial f}{\partial w},\\ b = b-\eta\frac{\partial f}{\partial b} \]

但是AdaGrad算法根据自变量在每个维度的梯度值大小来调整各个维度上的学习率，从而避免统一的学习率难以适应所有维度的问题。

自适应学习率的优化算法（来自以上链接）

该算法的思想是独立地适应模型的每个参数：具有较大偏导的参数相应有一个较大的学习率，而具有小偏导的参数则对应一个较小的学习率
具体来说，每个参数的学习率会缩放各参数反比于其历史梯度平方值总和的平方根
AdaGrad 算法描述
AdaGrad 存在的问题
- 学习率是单调递减的，训练后期学习率过小会导致训练困难，甚至提前结束
- 需要设置一个全局的初始学习率

代码

import math
import matplotlib.pyplot as plt
import numpy as np
‘’‘
初始数据
’‘’
xdata = np.array([8., 3., 9., 7., 16., 05., 3., 10., 4., 6.]).reshape(-1, 1)
ydata = np.array([30., 21., 35., 27., 42., 24., 10., 38., 22., 25.]).reshape(-1, 1)
m = xdata.shape[0]
w_g = 8
b_g = 90
w_a = 8
b_a = 90
h_w = 0
h_b = 0
eps = 1e-5
tw_g = []
tb_g = []
tw_a = []
tb_a = []
iter = 10000

一、代价函数

def cost(w, b):
    tsum = 0
    for i in range(m):
        tsum += (w * xdata[i] + b - ydata[i]) ** 2
    return tsum / (2 * m)

二、梯度

def grad(w, b):
    dw = (1 / m) * ((w * xdata + b - ydata) * xdata).sum()
    db = (1 / m) * (w * xdata + b - ydata).sum()
    return dw, db

三、梯度下降

def graddescent(alpha, w, b, iter):
    Cost = np.zeros(iter)
    for i in range(iter):
        tw_g.append(w)
        tb_g.append(b)
        w, b = Cal_Gradient(alpha, w, b)
        Cost[i] = cost(w, b)
    return w, b, Cost

四、AdaGrad实现

def Cal_Adagrad(lr, w, b, h_w, h_b, eps):
    dw, db = grad(w, b)
    h_w += dw ** 2
    h_b += db ** 2
    w = w - lr * (1 / math.sqrt(h_w + eps)) * dw
    b = b - lr * (1 / math.sqrt(h_b + eps)) * db
    return w, b


def Adagrad(lr, w, b, iter):
    Cost = np.zeros(iter)
    for i in range(iter):
        tw_a.append(w)
        tb_a.append(b)
        w, b = Cal_Adagrad(lr, w, b, h_w, h_b, eps)
        Cost[i] = cost(w, b)
    return w, b, Cost

五、图像

w_g, b_g, Cost_g = graddescent(0.01, w_g, b_g, iter)
w_a, b_a, Cost_a = Adagrad(0.03, w_a, b_a, iter)
x = np.linspace(3, 16, 100)
y = w_a * x + b_a
cx = np.arange(0, iter)

plt.figure()
plt.subplot(221)
plt.plot(x, y, 'orange')
plt.scatter(xdata, ydata)
plt.subplot(222)
w = np.linspace(-10, 10, 10)
b = np.linspace(-100, 100, 10)
w_, b_ = np.meshgrid(w, b)
Z = cost(w_, b_)
plt.contourf(b_, w_, Z, 30)
plt.contour(b_, w_, Z)
plt.plot(tb_g, tw_g,marker = 'o',c='black',linewidth = 2)
plt.plot(tb_a, tw_a,marker = '.', c='green',linewidth = 1)
plt.scatter(b_a,w_a,s = 200,c = 'yellow',marker='x')
plt.subplot(223)
plt.plot(cx, Cost_g,label = 'BGD',marker = '*',color = 'black',linewidth = 2)
plt.plot(cx,Cost_a, label = 'Adagrad',marker = '.',c = 'green',linewidth = 1)
plt.legend()
plt.show()

码农公寓

梯度下降、AdaGrad算法内容及实现

AdaGrad算法

自适应学习率的优化算法（来自以上链接）

代码

一、代价函数

二、梯度

三、梯度下降

四、AdaGrad实现

五、图像

相关文章