Optimization Algorithms
Basic gradient descent has a significant limitation: the same learning rate applies to every single weight, at every iteration. In reality, some weights should be updated quickly (rare, informative features) and others slowly (common, redundant features). And since we set the learning rate before training begins, we are essentially navigating blindly.
Adaptive optimization algorithms address this by adjusting the learning rate dynamically — differently for each weight — based on what has happened during training.
Adam — The Current Best All-Around Optimizer
Adam (Adaptive Moment Estimation) is what you should reach for first in most situations. It combines:
- The idea behind RMSProp: an exponentially decaying average of squared gradients
- The idea of momentum: an exponentially decaying average of gradients themselves
The algorithm maintains two running averages and uses them together to compute an adaptive learning rate for each weight. It includes bias correction terms important during early training.
Adam is robust across a wide range of architectures and problems, requires minimal hyperparameter tuning (the default values work remarkably often), and typically converges faster than SGD with a fixed learning rate. It became the default optimizer for most deep learning work in the mid-2010s and has maintained that status.
Newer variants — AdamW, Nadam, RAdam — address various edge cases. But when starting a new project, Adam with default parameters is a reasonable first move.
Loss surface: a narrow curved valley — the same learning rate is used for all three optimizers. Watch how momentum (Adam) and adaptive scaling (RMSProp, Adam) handle the tight curves differently from plain SGD.
All three optimizers start from the same point on the same loss surface. Adjust the learning rate to see how each handles the narrow curved valley differently.
Your dataset contains text descriptions of products, where most products use very common words but a small fraction use highly specialized, rare terminology. Which optimizer might be especially well-suited to this problem, and why?
Real-World Application
If you are using AdaGrad and your model stops learning mid-training, this is likely the accumulation problem. Switch to RMSProp or Adam. If you are training a large language model on a very sparse vocabulary, AdaGrad may actually outperform Adam for certain embedding layers. In practice: start with Adam, check if other optimizers perform better for your specific problem, and document what you try. The choice of optimizer interacts with learning rate, batch size, and architecture in ways that are problem-specific.