Backpropagation

Backpropagation — How the Gradient Reaches Every Weight

Gradient descent tells us how to update weights, but how do we compute the gradient for every weight in a large network with millions of parameters? The answer is backpropagation, an elegant application of the chain rule from calculus.

Once you have computed the error at the output layer, you propagate that error backward through the network, computing the contribution of each weight to the final error using the chain rule. For a weight w1,1,2 connecting the first input to the first node in the hidden layer:

∂J/∂w₁,₁,₂ = (∂J/∂a) · (∂a/∂z₃) · (∂z₃/∂a₁₂) · (∂a₁₂/∂z₁₂) · (∂z₁₂/∂w₁,₁,₂)

Each term represents how the error flows backward through one computational step. This formulation allows us to compute gradients efficiently for every weight in the network in a single backward pass — no matter how many layers.

Backpropagation — Symbolic Derivation1 / 8

Chain rule — ∂L/∂w₁

∂L∂w₁
=
∂L∂ŷ
·
∂ŷ∂z₂
·
∂z₂∂a₁
·
∂a₁∂z₁
·
∂z₁∂w₁
w₁w₂w₃x₁x₂σ(z₁)a₁σ(z₂)ŷInputHiddenOutput

The Network

What we want to find

The chain rule says we can decompose a compound derivative into a product of simpler ones. Each fraction above cancels with the next — notice ∂ŷ, ∂z₂, ∂a₁, ∂z₁ all appear in both numerator and denominator of adjacent terms.

Derive it

We want to know how the loss L changes with each weight. For w₁ (input→hidden):

∂L∂w₁
= ?

w₁ isn't directly connected to L — it affects z₁, which affects a₁, which affects z₂, which affects ŷ, which affects L. The chain rule lets us multiply those individual sensitivities together.

∂L∂w₁
=
∂L∂ŷ
·
∂ŷ∂z₂
·
∂z₂∂a₁
·
∂a₁∂z₁
·
∂z₁∂w₁

We'll derive each term one by one, working backward from the loss.

Step through a complete forward and backward pass on a 2→1→1 network. Each step shows the chain-rule formula being applied with real numbers, and a running panel tracks every value computed so far.