Convolutional Neural Network
Putting It All Together
A complete CNN for image classification has three sections:
- Feature Extraction — A stack of alternating convolutional layers and pooling layers. This part of the network learns to detect features at progressively higher levels of abstraction: edges → textures → shapes → object parts → objects.
- Flattening — The 3D output of the last pooling layer (height × width × channels) is reshaped into a 1D vector. This bridges the spatial feature maps to the classifier.
- Classification Head — One or more fully connected (dense) layers, ending with a softmax activation that outputs a probability distribution over the classes.
Input Image
Raw pixels enter the network
8 × 8 grayscale — dark = pixel on, light = pixel off
The network receives an 8 × 8 grayscale image — 64 pixel values, each between 0 and 1. Here it's a simplified digit "1" — the goal is to classify it as a 0 or a 1. In a real network this might be 224 × 224 × 3 (height × width × RGB channels).
Trace a single image through a small CNN layer by layer. Each step shows the feature maps produced and how dimensions change at each conv and pooling layer.
Translation Invariance: Why CNNs "Get" Objects
Here is the payoff for shared weights: because the same filter is applied at every position in the image, the network can detect the same feature regardless of where it appears. A basketball in the top-right corner activates the same filter as a basketball in the bottom-left corner. This property is called translation invariance, and it is exactly what you want in an object recognition system.
Real World: Self-Driving Cars
A stop sign is a stop sign whether it appears at the left edge of the camera frame or the center. A pedestrian detection model needs to fire regardless of where the pedestrian is in the image. CNNs deliver this naturally because of their shared weight structure.
Backpropagation in Convolutional Layers
Training a CNN uses the same backpropagation algorithm as any other neural network. The only difference is that instead of gradient computations involving simple multiplication, they involve convolution. The chain rule still applies. Gradients still flow backward through the network. The kernel weights are still updated to minimize the loss.
This is a reassuring continuity: once you understand backpropagation in a fully connected network, you already understand it in principle for CNNs.
In what way is a CNN's learned feature hierarchy similar to the hand-crafted features of traditional computer vision, and in what way is it fundamentally different?