Introduction to Neural Networks/Ch. 2 — Training Neural Networks/5 of 6

Optimization Algorithms

Basic gradient descent has a significant limitation: the same learning rate applies to every single weight, at every iteration. In reality, some weights should be updated quickly (rare, informative features) and others slowly (common, redundant features). And since we set the learning rate before training begins, we are essentially navigating blindly.

Adaptive optimization algorithms address this by adjusting the learning rate dynamically — differently for each weight — based on what has happened during training.

AdaGrad

\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \odot g_t

Key Idea

Accumulates the sum of all past squared gradients (G) per parameter. Parameters with large historical gradients get smaller updates; rare parameters get larger ones.

When to Use

Sparse data with many rare features — e.g., NLP with large vocabularies, recommendation systems. Works well when infrequent features carry high signal.

Challenges

The accumulated sum G grows monotonically and never shrinks. Over long training runs the effective learning rate collapses toward zero, causing the model to stop learning entirely.

Key Hyperparams

η (learning rate, typically 0.01); ε (small constant ~1e-8 for numerical stability)

RMSProp

w_j^{t+1} = w_j^t - \frac{\eta}{\sqrt{E[(g_j^t)^2]}}\, g_j^t \qquad E[(g_j^t)^2] = 0.9\, E[(g_j^{t-1})^2] + 0.1\,(g_j^t)^2

Key Idea

Replaces AdaGrad's cumulative sum with an exponentially decaying average of squared gradients. Recent gradients matter more than old ones, so the learning rate stabilizes rather than collapsing.

When to Use

Non-stationary objectives and recurrent networks. A good upgrade from AdaGrad when training stalls. Also works well for online and mini-batch settings.

Challenges

No bias correction for early training steps. Can still be sensitive to the choice of β and η. Does not use momentum, so convergence can be noisier than Adam.

Key Hyperparams

η (learning rate, ~0.001); β (decay rate, default 0.9); ε (~1e-8)

Adam

w_j^{t+1} = w_j^t - \frac{\eta}{\sqrt{v_j^t}}\, m_j^t \qquad \begin{aligned} m_j^t &= 0.9\, m_j^{t-1} + 0.1\, g_j^t \\ v_j^t &= 0.9\, v_j^{t-1} + 0.1\,(g_j^t)^2 \end{aligned}

Key Idea

Combines RMSProp (adaptive per-parameter learning rates via v̂) with momentum (m̂). Bias correction terms ensure accurate estimates in early training when the running averages are cold-started at zero.

When to Use

The default choice for most deep learning tasks. Robust across architectures (CNNs, Transformers, MLPs). Requires minimal tuning — defaults work remarkably often.

Challenges

Can converge to sharp minima that generalize worse than SGD on some vision tasks. AdamW (Adam + decoupled weight decay) is preferred when L2 regularization matters. Not always ideal for fine-tuning large pretrained models.

Key Hyperparams

η (learning rate, ~0.001); β₁ (momentum decay, 0.9); β₂ (RMS decay, 0.999); ε (~1e-8)

1 / 3

✦

Adam — The Current Best All-Around Optimizer

Adam (Adaptive Moment Estimation) is what you should reach for first in most situations. It combines:

The idea behind RMSProp: an exponentially decaying average of squared gradients
The idea of momentum: an exponentially decaying average of gradients themselves

The algorithm maintains two running averages and uses them together to compute an adaptive learning rate for each weight. It includes bias correction terms important during early training.

Adam is robust across a wide range of architectures and problems, requires minimal hyperparameter tuning (the default values work remarkably often), and typically converges faster than SGD with a fixed learning rate. It became the default optimizer for most deep learning work in the mid-2010s and has maintained that status.

Newer variants — AdamW, Nadam, RAdam — address various edge cases. But when starting a new project, Adam with default parameters is a reasonable first move.

∇Interactive · Optimizer Race

Learning rate0.05

Speed

OptimizerStepsLossDist to min

SGD09.01051.3342

RMSProp09.01051.3342

Adam09.01051.3342

Loss surface: a narrow curved valley — the same learning rate is used for all three optimizers. Watch how momentum (Adam) and adaptive scaling (RMSProp, Adam) handle the tight curves differently from plain SGD.

All three optimizers start from the same point on the same loss surface. Adjust the learning rate to see how each handles the narrow curved valley differently.

Checkpoint•Multiple Choice

Your dataset contains text descriptions of products, where most products use very common words but a small fraction use highly specialized, rare terminology. Which optimizer might be especially well-suited to this problem, and why?

Adam, because it adapts learning rates per-parameter and handles sparse gradients wellBatch gradient descent with a fixed learning rate, because it uses all data each stepSGD with no momentum, because it is simple and avoids overfittingTanh activation, because it handles sparse inputs better

◈

Real-World Application

If you are using AdaGrad and your model stops learning mid-training, this is likely the accumulation problem. Switch to RMSProp or Adam. If you are training a large language model on a very sparse vocabulary, AdaGrad may actually outperform Adam for certain embedding layers. In practice: start with Adam, check if other optimizers perform better for your specific problem, and document what you try. The choice of optimizer interacts with learning rate, batch size, and architecture in ways that are problem-specific.

←

Backpropagation

Training Challenges and How to Overcome Them

→