Deep Learning Gradient Descent Methods — Comprehensive Tests and Analysis

Sudharsan Asaithambi
6 min readMay 3, 2021

Optimization has been a major topic of study in mathematics since the early 20th century and has vast applications in real-life. Applications like supply chain in retail stores to optimal travel path and deliveries of e-commerce depend on the constrained optimization techniques.

While some of these methods like using first order derivative are quite old, the field is still growing with continued progress in Convex Optimization and Gradient Descent methods.

In this article, we will look at the performance of various Gradient Descent Methods among different extreme curves and infer their properties.

What is Optimization?

Optimization in mathematics refers to the selection of the best available choice from the domain of different choices. Here, the best is defined by maximizing or minimizing the objective value achieved by the choice.

Gradient Descent

Gradient Descent is one such empirical optimization technique where we move in the negative gradient direction at each iteration making us get closer to the local minima. Gradient Descent is one of the most used optimization technique in Machine Learning as it lets us go towards empirical risk minimization without the explicit knowledge of the Bayes Prediction Function.

General Sketch of Gradient Descent Optimization

Although, Gradient Descent is ubiquitously used in Machine Learning, it is still considered an art to achieve convergence in a short amount of time.

Few of the challenges are,

  1. We do not the underlying data-generating distribution and Bayes Prediction Function in Machine Learning and hence cannot confirm if we have reached the Global Minima.
  2. Gradient Descent is a greedy algorithm and converges to the local minima.
  3. The rate of Gradient Descent is mainly determined by the learning rate parameter and optimal learning rate needs to be determined. Having high learning rate causes the optimization problem to converge and having low learning rate would be costly both in time and computational resources.

To alleviate this problem, several variations of gradient descent methods have been proposed.

Based on the amount of data used per iterations, Gradient Descent is divided into
1. Full Batch Gradient Descent

2. Mini Batch Gradient Descent and

3. Stochastic Gradient Descent

Variations of Gradient Descent are also introduced to achieve faster convergence. Some of the variations are

  1. Gradient Descent
  2. Gradient Descent with Momentum
  3. AdaGrad
  4. AdaDelta
  5. RMS Prop, etc.

Optimization Test Functions

In mathematics, various artificial functions have been created to test the robustness of the optimization methods against different function topologies.

Three-hump camel function

Function Properties

Domain : -5 < x,y < 5

Minima : 0 at x = 0, y = 0

Initial value : 349.04 at (-3.9, 4.1)

Contour Lines

We observe that the gradient is almost zero near the origin and is very high farther from the origin.

Performance of Gradient Descent Variations

SGD, RMS Prop, Adam and AdamW converged to the global minima within the first 20000 iterations, while AdaDelta and Adagrad converges slowly. Adagrad has the worst convergence rate.

Adam : 0 at [0,0]

SGD : 0 at [0,0]

RMSProp : 0 at [0,0]

AdaDelta : 0.94 at [-0.1606, -0.8654]

AdaGrad : 3.54 at [-0.6817, -1.3736]

AdamW : 0 at [0,0]

Eason Function

Function Properties

Domain : -100 < x,y < 100

Minima : -1 at x =pi , y =pi

Initial Value : 0 at -8.89, -61.51

Contour Lines

Performance of Gradient Descent Variations

Performance of Gradient Descent Variations

Adam : 0 at [-8.89, -61.51]

SGD : 0 at [-8.89, -61.51]

RMSProp : 0 at [-8.89, -61.51]

AdaDelta : 0 at [-8.89, -61.51]

AdaGrad : 0 at [-8.89, -61.51]

AdamW : 0 at [-8.05, -55.66]

Note that for Eason function the gradient is almost zero, except near the origin and takes the value 0. Hence none of the gradient descent learns as the gradient is 0 while the minima is -1.

Egg-holder Function

Function Properties

Domain : -512 < x,y < 512

Minima : -959.6407 at x = 512, y =404.2319

Initial value : 88.96 at (362.43, 183.13)

Contour Lines

Performance of Gradient Descent Variations

Adam : -131.43 at [372.5047, 173.0551]

SGD : -623.60 at [411.73, 164.92]

RMSProp : -129.98 at [372.45, 173.11]

AdaDelta : 79.43 at [363.12, 182.44]

AdaGrad : 88.13 at [362.50, 183.06]

AdamW : 194.22 at [337.49, 156.25]

McCormick Function

Function Properties

Domain : -512 < x,y < 512

Minima : 0 at x = 0, y =0

Initial value : 30552 at (89.34,-86.44)

Contour Lines

Performance of Gradient Descent Variations

Adam : 23967.93 at [ 79.35, -76.46]

SGD : 1.23 at [2.59, 1.59]

RMSProp : 23953.68 at [ 79.33, -76.43]

AdaDelta : 30087.91 at [-2.45, 41.94]

AdaGrad : 30507.98 at [-3.05, 42.54]

AdamW : 567.67 at [ 6.63, 29.10]

Booth Function

Function Properties

Domain : -10 < x,y < 10

Minima : 0 at x = 1, y =3

Initial value : 116.16 at (7.85, 0.04)

Contour Lines

Performance of Gradient Descent Variations

Adam : 0 at [ 1, 3]

SGD : 0 at [ 1, 3]

RMSProp : 0 at [ 1, 3]

AdaDelta : 80.16 at [7.23,-0.54]

AdaGrad : 11.84 at [7.79, -0.02]

AdamW : 0 at [ 1, 3]

Schaffer function

Function Properties

Domain : -100 < x,y < 100

Minima : 0 at x = 0, y =0

Initial value : 0.49 at (-16.30,-86.83)

Contour Lines

Performance of Gradient Descent Variations

Adam : 0.49 at [-14.7121, -86.5404]

SGD : 0.49 at [-16.2977, -86.8240]

RMSProp : 0.49 at [-16.2053, -86.8066]

AdaDelta : 0.49 at [-16.2967, -86.8238]

AdaGrad : 0.49 at [-16.2966, -86.8238]

AdamW : 0.49 at [-7.1560, -85.5803]

Conclusion

With these toy functions, we were able to understand the inner mechanics of Gradient descents and where they perform well.

--

--