The Learning Rate
Current state
Try different learning rates — notice how a large value causes the optimizer to overshoot, while a small one converges slowly.
The learning rate η controls the size of your steps down the mountain. It is the single most important hyperparameter in training a neural network.
- Too large: You overshoot the minimum, bouncing back and forth across the valley. The loss oscillates wildly instead of decreasing.
- Too small: You take minuscule steps and training takes an eternity.
- Well-tuned: You converge efficiently, neither overshooting nor crawling.
Imagine you are lost in the mountains at dusk and need to get to the bottom before dark. You have a device that measures the slope beneath your feet. Check too often (tiny learning rate): you shuffle a few inches at a time and never reach the trailhead. Check too rarely (large learning rate): you walk a mile in what you thought was the right direction, realize you veered off course, and now you have to backtrack. Typical starting values are between 0.001 and 0.01, but this varies. Expect to tune it (or use an optimizer!).
During training, you notice your loss is fluctuating wildly — going up, then down, in large swings, without converging. What is the most likely cause, and what would you change?