Semantic v. Instance Segmentation
Semantic segmentation is the most granular computer vision task in the standard toolkit. Rather than predicting a label for a bounding box or for the whole image, semantic segmentation predicts a label for every single pixel. The output is a mask, an array with the same spatial dimensions as the input image, where each value is a class index.
The distinction between semantic and instance segmentation matters:
- Semantic segmentation: two people in an image both receive the label "person." Their pixels are indistinguishable by class.
- Instance segmentation: Person A gets a separate mask from Person B. Each object instance is individually labeled.
Instance segmentation is harder: it requires detecting not just class membership but individual object identity. Mask R-CNN is the canonical architecture for this task.
Data flow
Every pixel is assigned a class label. The output is a single mask the same size as the input, where each value is an integer class index. Two people in the same scene get identical labels — their pixels are indistinguishable.
Input shape
[H × W × 3]Output shape
[H × W × 1]Output fields
One class label per pixelInteger values: 0 to C–1C = number of classes
Example output
Toggle between semantic and instance segmentation to compare their data flow and output formats.
From the Case Files: Segmentation Is Harder Than It Looks
For a project in the interior design space, I was working on computer vision project that required segmenting furniture from room images as a processing step before downstream tasks. What looked simple in testing — just draw a mask around the sofa — became a real challenge in deployment. The model would leave behind sofa legs, miss partially occluded cushions, and struggle with edge cases. Segmentation is challenging!