CNNs as Feature Extractors

Here is something remarkable about the convolutional layers of a CNN trained on a large, diverse dataset: the features they learn are not dataset-specific. A model trained on ImageNet (a dataset of everyday objects) learns to detect edges in its early layers, textures in its middle layers, and object parts in its later layers. Because these building blocks (especially the early ones) are important to any visual task, CNNs trained on ImageNet or other large datasets can be applied across tasks and domains.

This means that a CNN trained on ImageNet has already done most of the visual feature engineering for you — for almost any visual task you can imagine. All that remains is adapting the final classification head to your specific problem.

This is the core idea behind transfer learning.

Visualization of CNN feature hierarchy: early layers detect edges, middle layers detect textures and parts, late layers detect whole objects
Learned CNN features generalize across datasets: early layers detect universal visual primitives (edges, gradients), while later layers encode progressively more task-specific concepts.[Source]

Foundation Models

The concept of pre-trained models that serve as a starting point for downstream tasks has recently been rebranded as "foundation models" in the ML community. If you hear that term, it means the same thing: a large model trained on a large dataset that serves as the starting point for downstream tasks. The vocabulary changed; the idea did not.