Convolutional Layers

What Convolutional Layers Learn

In a CNN, convolutional layers replace the fully connected layers of an MLP for the feature-extraction portion of the network. The key differences:

Local connectivity — Each unit is connected only to a local region of the previous layer, the patch covered by the kernel.
Shared weights — The same kernel is applied everywhere across the input. The number of learnable parameters is determined by the kernel size, not the image size. A 3 × 3 × 3 kernel has only 27 weights, regardless of whether you apply it to a 100 × 100 image or a 1000 × 1000 image.
Multiple filters — We typically use many kernels per layer, each learning to detect a different pattern. If we use 64 kernels, we produce 64 feature maps.

During training, each filter learns to recognize a specific pattern through backpropagation. Early layers tend to learn simple patterns: edges, oriented lines, color gradients. Deeper layers combine those patterns into more complex representations: corners, textures, object parts, eventually whole objects.

ℹ

CNNs as Automated Feature Engineering

You can think of the entire stack of convolutional layers in a CNN as an automated feature engineering pipeline. The difference from traditional computer vision is that the features are learned from data, not hand-designed by humans. This is the core reason CNNs outperform traditional approaches on large, complex datasets.

Key Parameters: Filter Size, Stride, and Padding

Filter Size — Filter dimensions are expressed as F × F × D, where D is the depth (number of channels) of the input. We use odd filter sizes — 3 × 3, 5 × 5, 7 × 7 — to preserve spatial symmetry. The modern consensus favors many small filters (3 × 3) over fewer large ones, as they learn more complex features with fewer parameters.

Stride — How many positions the filter moves at each step. A stride of 1 means maximum overlap and maximum spatial resolution. A stride of 2 means the filter jumps two positions — less overlap, smaller output, more computational efficiency.

Padding — When a filter slides across an image, pixels near the edges are covered fewer times. Padding adds extra pixels (typically zeros) around the edges to compensate. "Same" padding is chosen so that the output feature map has the same spatial dimensions as the input.

✦

Output Dimension Formula

The output size of a convolutional or pooling layer can be computed with:

\text{Output Dimension} = \left\lfloor \dfrac{\text{Input} - \text{Filter Size} + 2 \times \text{Padding}}{\text{Stride}} \right\rfloor + 1

Example: Input of 28 × 28, filter size 3, padding 0, stride 1: (28 − 3 + 0) / 1 + 1 = 26. Output is 26 × 26.

Getting comfortable with this formula is essential. As you build and debug CNN architectures, you will use it to verify that your dimensions work out at each layer.

Convolution Parameter Explorer

Kernel size3

Padding0

Stride1

Output size

⌊(7 − 3 + 2×0) / 1⌋ + 1

= 5 × 5

Input (7×7)

Kernel (3×3)

-1

-2

-1

Blue = positive, red = negative

Output (5×5) — hover to highlight receptive field

-2

-4

-1

-2

-1

-4

Hover an output cell to see its receptive field and computed value.

Adjust kernel size, padding, and stride to see how output dimensions change. Hover over any output cell to highlight its receptive field in the input.

Checkpoint

A convolutional layer receives an input of shape [32, 32, 3]. It applies 16 filters of size 3×3 with padding=1 and stride=1. What is the output shape?

←PreviousConvolutionConvolutional Neural Networks Next→Pooling LayersConvolutional Neural Networks