Backpropagation
Backpropagation — How the Gradient Reaches Every Weight
Gradient descent tells us how to update weights, but how do we compute the gradient for every weight in a large network with millions of parameters? The answer is backpropagation, an elegant application of the chain rule from calculus.
Once you have computed the error at the output layer, you propagate that error backward through the network, computing the contribution of each weight to the final error using the chain rule. For a weight w1,1,2 connecting the first input to the first node in the hidden layer:
∂J/∂w₁,₁,₂ = (∂J/∂a) · (∂a/∂z₃) · (∂z₃/∂a₁₂) · (∂a₁₂/∂z₁₂) · (∂z₁₂/∂w₁,₁,₂)
Each term represents how the error flows backward through one computational step. This formulation allows us to compute gradients efficiently for every weight in the network in a single backward pass — no matter how many layers.
Chain rule — ∂L/∂w₁
The Network
What we want to find
The chain rule says we can decompose a compound derivative into a product of simpler ones. Each fraction above cancels with the next — notice ∂ŷ, ∂z₂, ∂a₁, ∂z₁ all appear in both numerator and denominator of adjacent terms.
Derive it
We want to know how the loss L changes with each weight. For w₁ (input→hidden):
w₁ isn't directly connected to L — it affects z₁, which affects a₁, which affects z₂, which affects ŷ, which affects L. The chain rule lets us multiply those individual sensitivities together.
We'll derive each term one by one, working backward from the loss.
Step through a complete forward and backward pass on a 2→1→1 network. Each step shows the chain-rule formula being applied with real numbers, and a running panel tracks every value computed so far.