Semantic v. Instance Segmentation

Semantic segmentation is the most granular computer vision task in the standard toolkit. Rather than predicting a label for a bounding box or for the whole image, semantic segmentation predicts a label for every single pixel. The output is a mask, an array with the same spatial dimensions as the input image, where each value is a class index.

The distinction between semantic and instance segmentation matters:

  • Semantic segmentation: two people in an image both receive the label "person." Their pixels are indistinguishable by class.
  • Instance segmentation: Person A gets a separate mask from Person B. Each object instance is individually labeled.

Instance segmentation is harder: it requires detecting not just class membership but individual object identity. Mask R-CNN is the canonical architecture for this task.

Segmentation Output Formats

Data flow

Every pixel is assigned a class label. The output is a single mask the same size as the input, where each value is an integer class index. Two people in the same scene get identical labels — their pixels are indistinguishable.

Input shape

[H × W × 3]

Output shape

[H × W × 1]

Output fields

  • One class label per pixel
  • Integer values: 0 to C–1
  • C = number of classes
Trade-off: Simple and fast — but loses object identity. You know which pixels are "person", not which person is which.

Example output

Toggle between semantic and instance segmentation to compare their data flow and output formats.

From the Case Files: Segmentation Is Harder Than It Looks

For a project in the interior design space, I was working on computer vision project that required segmenting furniture from room images as a processing step before downstream tasks. What looked simple in testing — just draw a mask around the sofa — became a real challenge in deployment. The model would leave behind sofa legs, miss partially occluded cushions, and struggle with edge cases. Segmentation is challenging!