Gradient Descent

Frame 1 of 9
Frame 2 of 9
Frame 3 of 9
Frame 4 of 9
Frame 5 of 9
Frame 6 of 9
Frame 7 of 9
Frame 8 of 9
Frame 9 of 9
1/9
Imagine you are on the Amazing Race and you need to reach the finish line at the bottom of the mountain before dark. You (the parameters) are randomly placed on the mountain. You need to minimize the loss function (the distance to the finish line) by taking steps downhill. Backpropagation tells you which direction to go. Gradient descent takes a step in that direction. How large your step is is called the learning rate.

Now that we can measure how wrong we are, we need a way to systematically become less wrong. That mechanism is gradient descent.

Imagine your loss function as a landscape of hills and valleys. Your weights determine your position in this landscape. You want to reach the lowest valley — the minimum loss. Gradient descent is how you navigate there.

At each step, you compute the gradient of the loss with respect to the weights — essentially, "which direction is uphill from here?" — and take a step in the opposite direction, because you want to go down.

wⱼ(t+1) = wⱼ(t) − η · ∂J/∂wⱼ
  • wⱼ(t) is the current weight
  • η (eta) is the learning rate
  • ∂J/∂wⱼ is the gradient of the loss with respect to that weight

Gradient Descent Variants

  • Stochastic Gradient Descent (SGD): Update weights after every single training example. Computationally intensive but has a regularizing effect due to the noise introduced by individual samples.
  • Batch Gradient Descent: Compute the gradient over the entire training set before updating. Mathematically clean, but memory-intensive and prone to convergence issues on large datasets.
  • Mini-Batch Gradient Descent: The method of choice in practice. Divide the training data into subsets (batches) of, say, 32 or 128 examples. Update weights after each batch. Gets the best of both worlds: vectorized operations for efficiency and sufficient noise for regularization. When someone says they are "training with gradient descent," they almost certainly mean mini-batch.

Real-World Application: Batch Size

Batch size is a hyperparameter that affects both training speed and model quality in subtle ways. Larger batches train faster per epoch (more parallelism on GPUs) but can generalize worse. Smaller batches are slower but often produce models that generalize better, partly because the noise in gradient estimates acts as a regularizer. Common batch sizes are 32, 64, 128, or 256. In production environments, batch size is often tuned carefully alongside learning rate.

💭Reflection

Explain the difference between stochastic, batch, and mini-batch gradient descent. Under what circumstances would you choose each?

Momentum

Momentum is an extension to gradient descent that adds inertia. It accumulates a fraction of the previous weight update and adds it to the current update. This smooths out the training trajectory, and helps the optimizer roll through shallow local minima rather than getting stuck. Think of a ball rolling downhill — it has momentum that carries it through small bumps in the terrain.