Backpropagation Through Time and the Vanishing Gradient

RNN training is backpropagation, but unrolled. You compute the loss across all time steps and propagate gradients back through the unrolled chain using the chain rule. This is Backpropagation Through Time (BPTT).

It works. But for long sequences, the chain of multiplied gradients gets very long, and each link is typically less than 1 in magnitude. Multiply enough of them together, and the gradient goes to zero. This is the vanishing gradient problem, and it means RNNs cannot effectively learn long-range dependencies. Information from twenty time steps ago can't influence the current loss because its gradient signal vanishes.

Vanishing vs. Exploding Gradients

The vanishing gradient problem is the more common failure in practice. But the opposite — exploding gradients — also occurs when the gradient magnitudes grow exponentially. The fix for exploding gradients is gradient clipping: if the gradient norm exceeds a threshold, scale it down proportionally. This is a standard hyperparameter in RNN training.