The Problem
Neural networks — multilayer perceptrons with fully connected layers — are powerful general-purpose function approximators. So what happens when we just throw raw image pixels at one?
Consider a modest image: 228 × 228 pixels with three color channels. That gives us 228 × 228 × 3 = 155,952 input values. A single fully connected layer with 1,000 neurons would require 155,952 × 1,000 = over 155 million weights — just for the first layer!
This is a disaster for several reasons. Training that many parameters requires enormous amounts of data and compute. The model is highly prone to overfitting. And there is a deeper problem: a fully connected layer treats every pixel as an independent input, with no concept of spatial structure. It does not know that neighboring pixels are likely to be related. It does not know that the same pattern appearing in the top-left corner and the bottom-right corner of an image should probably be recognized as the same thing.
The image is 228×228 pixels across 3 color channels (R, G, B). That gives 155,952 individual numbers.
Adjust the image size and number of output neurons to see how the parameter count scales.
The Parameter Count Contrast
A CNN layer for the same 228 × 228 × 3 image can get away with as few as 280 parameters — a reduction of several orders of magnitude. How? Through two ideas: local connectivity and shared weights. Both emerge naturally from the operation of convolution.