← Back to homepage

Extensions to Gradient Descent: from momentum to AdaBound

November 3, 2019 by Chris

Today, optimizing neural networks is often performed with what is known as gradient descent: analogous to walking down a mountain, an algorithm attempts to find a minimum in a neural network's loss landscape.

Traditionally, one of the variants of gradient descent - batch gradient descent, stochastic gradient descent and minibatch gradient descent - were used for this purpose.

However, over many years of usage, various shortcomings of traditional methods were found to exist. In this blog post, I'll cover these challenges based on the available literature, and introduce new optimizers that have flourished since then. Even today's standard optimizers, such as Adam, are covered here.

As you'll see, it's going to be somewhat of a chronology - many of the optimizers covered in this post will be improvements of each other. Funnily, the usage of adaptive optimizers has caused renewed interest in traditional gradient descent as of recently. This is due to the fact that adaptive optimizers were found to perform worse than traditional ones in terms of generalization - i.e., on the test set. We'll therefore also cover more state-of-the-art optimizers here, such as AdaBound, which aims to combine the best of both worlds.

If you have questions - feel free to leave a message in my comments box below! 👇 I'll happily answer any question you have. Thanks and enjoy the post! 😄

After reading this article, you will understand...

Update 01/Mar/2021: ensure that article is still relevant in 2021.

Traditional gradient descent & challenges

When considering the high-level machine learning process for supervised learning, you'll see that each forward pass generates a loss value that can be used for optimization.

Although backpropagation generates the actual gradients in order to perform the optimization, the optimizer algorithm used determines how optimization is performed, i.e., where to apply what change in the weights of your neural network in order to improve loss during the next iteration.


Source: Christopher Gondek

Gradient descent is the traditional class of algorithms used for this purpose. In another blog post detailing three of the traditional variants, we introduced these optimizers that can also be used today:

Various challenges were identified with usage of traditional gradient descent over the years:

Adaptive optimizers

Over the years, given those challenges, many new optimizers have been created, most of which belong to the class of adaptive optimizers. Contrary to gradient descent optimizers, which adapt the weights in a static way with a fixed learning rate across all parameters, adaptive optimizers have more flexibility built-in. We'll now cover many of these optimizers, starting with momentum.


Remember that gradient descent was like walking down a mountain?

Now suppose that you're a ball instead, and roll down that same mountain.

What would happen when rolling down? Indeed, you would go faster and faster, and if you were to change direction, that would be more difficult when you're already rolling.

Traditional gradient descent techniques don't have this built in: they just change direction and keep setting small steps, even though it has been moving downward for some time already.

This is problematic when your loss landscape contains many local minima (Ruder, 2016). In fact, what happens, is that your optimizer will move towards the bottom of the local minimum with small baby steps. Eventually, it could get stuck in the local minimum, not being able to escape from it. This leaves you a suboptimal model.

Now, what if we could model gradient descent after the ball-like way of descending a mountain, overshooting the minimum because it already built up some momentum during descent?

This is exactly what momentum based gradient descent does for you (Qian, 1999; Ruder, 2016). To each gradient update, it adds a little bit of the previous update; that is, its current speed (= direction + velocity) is added to the update. This ensures two things:

Take a look at this video which we found at Towards AIMLPY, which shows how momentum works:


Source: Towards AIMLPY

Nesterov accelerated gradient

In his work, Ruder (2016) asked himself: what if we can get an [even] smarter ball?

One that can essentially look ahead, to see what's coming, merged with whatever happened in the past?

Enter Nesterov accelerated gradient (Nesterov, 1983), which is based on traditional momentum.

If we're at some position and know our momentum, we know where we'll move towards with some margin of error (caused by the gradient update we didn't take into account yet).

This means than rather computing the gradient update given our current position plus our momentum, we can compute the gradient update given our current update + momentum, then adding momentum again. This is as if we're looking one step ahead of time, given our current position and where we're going to. Ruder (2016) argues that this "results in increased responsiveness" of our model. That's nice 😊


While the previous adaptive optimizers were able to allow neural networks to better navigate the data's loss landscapes by adapting their weights altogether, another step forward could be to adapt the individual weights only.

In response to this desire, a subgradient method called Adagrad was developed about ten years ago (Duchi et al., 2011). Contrary to the learning rate of the previous methods, which is set and fixed, Adagrad uses dynamic learning rates for every parameter (Ruder, 2016).

This learning rate, for which a basic rate can be set a priori, adapts itself as it is divided by the square root of the past gradients computed for the parameter, plus some error rate (Ruder, 2016). The square root is computed by means of an efficient matrix operation. If the gradient updates in the past were large, the learning rate gets smaller and smaller, while if they were small, the learning rate remains relatively large.

The result: parameters that didn't update as much as the ones that did, will have larger updates later on in the model's training process. This has multiple benefits:

Unfortunately, Adagrad also comes with a big drawback: the speed of learning will decrease with each time step (Milman (2), n.d.). What's more, at some point in time, the model will effectively stop learning, since the self-adaptive learning rate converges to zero.

Why this happens is simple: it's due to the square root of the past gradients by which the fixed learning rate is divided.

If you have an infinite amount of steps, and hence an infinitely large history of gradients, the squared root will also be infinite. Dividing anything by an infinitely large number yields a value that is approximately zero - and this also applies to the adaptive learning rate. While no practical training process will remotely approach an infinite amount of epochs, you get what happens with many iterations of your training process.

Fortunately, Adadelta comes to the rescue.


We recall that Adagrad benefited from dynamic learning rates, that they were computed by dividing the preconfigured learning rate by the squared root of the parameter's previous gradients, and that precisely this computation yields vanishing gradients due to the increasing nature of the denominator (Ruder, 2016; Milman (2), n.d.).

Adadelta provides a fix for this problem (Zeiler, 2012). It does so by following an interesting philosophy. Suppose that you're currently in the 101st epoch, or iteration. For adapting the parameters, how interesting is the gradient update between the 4th and 5th epoch? Isn't it less interesting than the update between the 30th and 31st epoch, which itself is less interesting than the 67th and 68th one, 80th and 81st, and so on?

Yep, according to Zeiler (2012): the older the gradient update, the less interesting it is.

Adadelta, by consequence, uses a decaying average of some \(w\) previous gradients. If \(w = 5\), for example, only the last 5 gradient updates would be used in the computation, of which the last is most relevant, followed by the second last, and so on (Ruder, 2016). What's more, given the maths behind Adadelta, it's no longer necessary to provide a learning rate a priori (Zeiler, 2012).


An optimizer that was developed in parallel to Adadelta, and actually resembles it quite much, is RMSprop (Ruder, 2016; Hinton, n.d.). In line with Adadelta, RMSprop also divides the learning rate by an exponentially decaying average of some previous gradients. Contrary to Adadelta, however, it is still necessary to configure an initial learning rate when using RMSprop (Hinton, n.d.).


Now what if we could add both momentum and adapt our neuron's weights locally?

This is what the Adam optimizer does, which combines the best of both worlds - i.e., it performs similar to RMSprop and Adadelta, as well as momentum (Ruder, 2016; Kingma & Ba, 2014).

By consequence, it is one of the most widely used optimizers today - and it is the reason that we use it in many of our blog posts, such as How to create a CNN classifier with Keras?


In their same work, Kingma & Ba (2014) introduce the AdaMax optimizer. It is a variant of the Adam optimizer that uses the infinity norm, while the Adam optimizer itself uses the L2 norm for optimization (Kingma & Ba, 2014).

What is a norm, you might now think - that's what I did, at least, when I was confronted with these mathematical terms. Since I'm not a mathematician, I'll try to explain norms intuitively, based on sources on the internet.

Essentially, a norm is a function that maps some input to some output. In the case of norms, it assigns a positive length to any vector in some vector space (Wikipedia, 2004). Essentially, norms therefore tell us something about the length of a vector in a certain mathematical space (Karthick, n.d.). Beyond this, norms must also satisfy other properties which ensure that norms do return particular vector lengths given their space (Wikipedia, 2004):

Multiple norms exist, such as the L0 norm (which essentially tells you something about the number of non-zero elements in a vector; Vishnu, n.d.), the L1 norm (which in any space produces the taxicab norm or a block-style length for the vector), the L2 norm (which produces the shortest possible distance or Euclidian distance), and so on.

This can be generalized to the p-norm, which essentially computes the L-norm for some \(p\) (p = 2 is the Euclidian norm, and so on).

L1 norm (red, blue and yellow) versus L2 norm (green) / public domain.

When you let p approach infinity, you'll get what is known as the max norm or infinity norm. Given some vector \(\textbf{x} = \{ x_1, x_2, ..., x_n \}\), the infinity norm gives you the maximum element in the vector.

Now this is exactly the difference between Adam and the Adamax optimizer, which is essentially a generalization of the L2 norm into the L-infinity norm.

The Adam optimizer updates the gradients inversely proportional to the L2 norm of the "past gradients (...) and current gradient" (Ruder, 2016). In plain English, what gets computed is a vector that is composed of the past gradients and the current gradient (I would like to refer to the Ruder (2016) paper if you wish to get the maths as well!), which by consequence is the smallest vector possible to represent the distance between the two vector edges, and hence represents the value output by the L2 norm.

When generalizing Adam to the L-infinity norm, and hence Adamax, you'll find that the gradient update is the maximum between the past gradients and current gradient. That is, if large weight swings (by virtue of the current gradient) are required, this is possible, but only if they are really significant (given the influence of the past gradients).

In practice, this means that data that is traditionally noisy in terms of gradient updates (e.g., datasets with many outliers) can benefit from Adamax over Adam; this obviously for the reason that previous stable gradient updates can counteract one significant gradient swing. Across the internet, you'll see that many examples illustrate perceived benefits of Adamax when using word embeddings (Ghosh, n.d.). Those representations are traditionally sparse and indeed, Adamax could then be of value. However, as illustrated by Ghosh (n.d.), it's always best to just try at first to see whether Adamax is actually better than Adam or even traditional SGD 😄


Now this one is simple: Nadam = Nesterov acceleration + Adam (Dozat, 2016).

Adam benefits from momentum and individual parameter updates, but does so based on the current gradient.

Momentum based SGD also computes the gradient update based on the current gradient, and we can recall from above that Nesterov acceleration ensures that SGD can essentially look one step ahead by computing the estimated position given current momentum.

Well, Dozat (2016) thought, why can't we incorporate this into Adam?

The result is Nadam: Nesterov momentum based Adam. Now that we've covered Adamax with some abstract norm stuff, life gets easier with Nadam 😊 ...but we aren't there yet 😄

Challenges with adaptive optimizers & new ones

Even though adaptive optimizers have improved traditional SGD in terms of momentum and/or individual parameter optimization, they have their own challenges as well 😄

Those challenges are as follows:

In an attempt to resolve these problems, a range of new optimizers was developed to improve Adam and AdaMax. AMSGrad was one of the first ones, but Luo et al. (2019) argue that it still doesn't fix both issues, and propose AdaBound and AMSBound to fix either Adam or AmsGrad.

The premise behind those two methods is easy: use an adaptive optimizer during the first epochs, when it performs better than traditional methods, and switch to traditional gradient descent later (Luo et al., 2019). So far, results are promising: it seems that at least parts of the generalization gap can be closed.


In this blog post, we've covered many of the optimizers that are in use in today's neural networks. We covered gradient descent as well as adaptive optimizers, and took a look at some of their extensions to overcome problems with convergence and generalization.

I hope you've learnt something from this post 😊 If you miss something, or if you have any questions or other remarks, please feel free to leave a comment below! I'll happily answer and/or adapt my post. Thanks! 😄


Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747.

O'Reilly. (n.d.). Fundamentals of Deep Learning. Retrieved from https://www.oreilly.com/library/view/fundamentals-of-deep/9781491925607/ch04.html

Dabbura, I. (2019, September 3). Gradient Descent Algorithm and Its Variants. Retrieved from https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3

Shaikh, F. (2019, June 24). Introduction to Gradient Descent Algorithm (along with variants) in Machine Learning. Retrieved from https://www.analyticsvidhya.com/blog/2017/03/introduction-to-gradient-descent-algorithm-along-its-variants/

Qian, N. (1999). On the momentum term in gradient descent learning algorithms. Neural networks, 12(1), 145-151.

Nesterov, Y. (1983). A method for unconstrained convex minimization problem with the rate of convergence O (1/k^ 2). In Doklady AN USSR (Vol. 269, pp. 543-547).

Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121-2159.

Milman, O., & Dali (pseudonym), S. (n.d.). Gradient Descent vs Adagrad vs Momentum in TensorFlow. Retrieved from https://stackoverflow.com/a/44225502

Gupta, S. (n.d.). Shashank Gupta's answer to What is the purpose of AdaGrad for stochastic gradient decent neural network training? Retrieved from https://www.quora.com/What-is-the-purpose-of-AdaGrad-for-stochastic-gradient-decent-neural-network-training/answer/Shashank-Gupta-75

Mallinar, N. (n.d.). Neil Mallinar's answer to What is the purpose of AdaGrad for stochastic gradient decent neural network training? Retrieved from https://www.quora.com/What-is-the-purpose-of-AdaGrad-for-stochastic-gradient-decent-neural-network-training/answer/Neil-Mallinar

Zeiler, M. D. (2012). ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.

Milman (2), O. (n.d.). Understanding the mathematics of AdaGrad and AdaDelta. Retrieved from https://datascience.stackexchange.com/a/38319

Hinton, G. (n.d.). Overview of mini-batch gradient descent [PDF]. Retrieved from http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

Vishnu. (n.d.). Vishnu's answer to What is a norm of vector (intuitive definition)? Retrieved from https://www.quora.com/What-is-a-norm-of-vector-intuitive-definition/answer/Vishnu-55

Karthick, N. G. (n.d.). Karthick N.G.'s answer to What is a norm of vector (intuitive definition)? Retrieved from https://www.quora.com/What-is-a-norm-of-vector-intuitive-definition/answer/Karthick-N-G

Wikipedia. (2004, September 16). Norm (mathematics). Retrieved from https://en.wikipedia.org/wiki/Norm_(mathematics)

Ghosh, T. (n.d.). Tapa Ghosh's answer to When would you use Adamax over Adam? Retrieved from https://www.quora.com/When-would-you-use-Adamax-over-Adam/answer/Tapa-Ghosh

Dozat, T. (2016). Incorporating nesterov momentum into Adam.

Keskar, N. S., & Socher, R. (2017). Improving generalization performance by switching from adam to sgd. arXiv preprint arXiv:1712.07628.

Luo, L., Xiong, Y., Liu, Y., & Sun, X. (2019). Adaptive gradient methods with dynamic bound of learning rate. arXiv preprint arXiv:1902.09843.

Reddi, S. J., Kale, S., & Kumar, S. (2019). On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237.

Hi, I'm Chris!

I know a thing or two about AI and machine learning. Welcome to MachineCurve.com, where machine learning is explained in gentle terms.