Introduction

Classification answers one question: "What is in this image?" Object detection answers a richer question: "What is in this image, where exactly is it, and are there multiple instances of it?"

This distinction matters for two practical reasons. First, explainability (not to be confused with explainable AI): when a model draws a bounding box around an object, a human can visually verify whether it is looking at the right thing. Classification models are opaque in a way that detection models are not. Second, autonomy: many real-world systems need to know where objects are, not just that they exist. A robot cannot pick up a cup unless it knows the cup's spatial location.

Detection Output Formats

Data flow

The most common detection output. For each detected object the model emits a bounding rectangle (normalized top-left origin + width/height), a class label, and a confidence score.

Input shape

[416 × 416 × 3]

Output shape

List of Detections

Output fields

  • bbox: [x, y, w, h] ∈ [0, 1]
  • class_id: integer
  • conf: ∈ [0, 1]
Trade-off: Fast and compact — but a box is a rough approximation of an object's true shape.

Example output

Each tab shows the data flow and output format for a different detection task. Toggle between them to compare input/output shapes.

What Object Detection Outputs

For each detected object, a bounding box model outputs:

  • Bounding box coordinates — Typically expressed as (x, y, width, height) where (x, y) is the top-left corner, normalized to [0, 1].
  • Class ID — An integer identifying the category of the detected object.
  • Confidence score — A probability reflecting how certain the model is about this detection.

This is the COCO annotation standard, named after the Common Objects in Context dataset — the benchmark for almost all object detection research. COCO contains over 200,000 labeled images with 1.5 million annotated object instances across 80 categories.

Keypoint Detection as an Extension

Object detection is a family of tasks, and bounding boxes are the most common output format. One important variant is Keypoint detection, which outputs a set of (x, y) coordinates corresponding to specific structural points on an object — for people, each joint: nose, left eye, right elbow, left knee, etc. This is the foundation of pose estimation.