Introduction to Neural Networks/Ch. 2 — Training Neural Networks/3 of 6

The Learning Rate

∇Interactive · Gradient Descent

Learning rate0.3

Speed0.5s

Current state

Slope (gradient)12.000→ moving left (slope is positive)

Position x8.000

Loss f(x)37.000

Steps taken0

f(x) = (x − 2)² + 1descent pathtangent (slope)

Try different learning rates — notice how a large value causes the optimizer to overshoot, while a small one converges slowly.

The learning rate η controls the size of your steps down the mountain. It is the single most important hyperparameter in training a neural network.

Too large: You overshoot the minimum, bouncing back and forth across the valley. The loss oscillates wildly instead of decreasing.
Too small: You take minuscule steps and training takes an eternity.
Well-tuned: You converge efficiently, neither overshooting nor crawling.

Imagine you are lost in the mountains at dusk and need to get to the bottom before dark. You have a device that measures the slope beneath your feet. Check too often (tiny learning rate): you shuffle a few inches at a time and never reach the trailhead. Check too rarely (large learning rate): you walk a mile in what you thought was the right direction, realize you veered off course, and now you have to backtrack. Typical starting values are between 0.001 and 0.01, but this varies. Expect to tune it (or use an optimizer!).

Checkpoint•Multiple Choice

During training, you notice your loss is fluctuating wildly — going up, then down, in large swings, without converging. What is the most likely cause, and what would you change?

The batch size is too small; increase itThe learning rate is too large; reduce itThe model is underfitting; add more layersThe loss function is wrong; switch from MSE to MAE

←

Gradient Descent

Backpropagation

→