U-Net
U-Net was originally designed for biomedical image segmentation — specifically, to track individual cells under a microscope. It has since become one of the most widely used segmentation architectures across domains.
The architecture gets its name from its shape. Look at it from the side and you see a U:
- Left arm (Contracting Path / Encoder) — A series of convolutional and max pooling blocks that successively reduce spatial dimensions while increasing the number of feature channels. This part of the network learns "what is in the image at a high level" — it captures context.
- Bottom of the U — The bottleneck, where spatial dimensions are smallest and feature channels are most numerous. This is the most abstract representation.
- Right arm (Expanding Path / Decoder) — A series of upsampling operations followed by convolutions, progressively restoring spatial resolution while reducing channel count. This part of the network asks "where exactly does each semantic class appear?"

U-Net Architecture
Hover over a region to explore
U-Net pairs a contracting encoder path with a symmetric expanding decoder path. Skip connections bridge the two halves, giving the decoder direct access to the encoder's high-resolution feature maps.
The architecture was designed for biomedical image segmentation, where training data is scarce and precise boundary delineation is critical. Its encoder–decoder design with skip connections has since become a foundational pattern in segmentation across all domains.
Click a region to pin it. Hover to preview. Each zone corresponds to a distinct functional part of the architecture.
Trace a sample image through U-Net layer by layer. Watch spatial dimensions shrink in the encoder and grow back in the decoder. Toggle skip connections on and off to see how they affect the final segmentation mask. Image source: Ronneberger et al., 2015
What Skip Connections Enabled
The encoder naturally loses spatial precision as it compresses the image. To restore that precision in the decoder, U-Net uses skip connections: direct connections from each encoder stage to the corresponding decoder stage. The encoder's feature maps are concatenated with the decoder's upsampled maps, giving the decoder access to the fine-grained spatial detail that would otherwise be lost.
The final layer applies a 1 × 1 convolution to map the feature vector to the desired number of output classes, followed by a pixel-wise softmax that produces class probabilities for each pixel independently.
Real World: Cell Tracking Challenge
The U-Net architecture originated in the context of tracking individual cells through a microscope image sequence — identifying which pixels belong to which cell at each time step. The challenge requires both semantic understanding (this is a cell) and spatial precision (this specific set of pixels is this cell's boundary). U-Net's skip connections deliver exactly that combination.

Why are skip connections necessary in U-Net? What problem do they solve?