Evaluating Image Classification
Here is a scenario: a colleague tells you their cat-vs-dog classifier achieves 57% accuracy. What do you actually know?
Almost nothing. You do not know whether the model is failing more on cats or dogs. You do not know whether it is consistently wrong in a particular way. You do not know whether the test set has class imbalance. You do not know whether it performs differently on small cats versus large cats, or on cats in dim lighting versus bright lighting.
Meaningful model evaluation requires digging deeper — and for computer vision tasks, the appropriate metrics vary by task type.
Evaluating Image Classification
For classification, the standard toolkit from other ML contexts applies directly:
- Accuracy — Correct predictions / total predictions. Useful as a headline number but misleading with class imbalance.
- Precision — True positives / (true positives + false positives). How many of the "cat" predictions were actually cats?
- Recall — True positives / (true positives + false negatives). Of all the actual cats, how many did we catch?
- F1 Score — Harmonic mean of precision and recall. Balances both concerns into a single number.
- AUC-ROC — Measures classification quality across all possible thresholds. More robust to class imbalance than accuracy.
- Confusion matrix — The gold standard for multi-class understanding. Shows exactly what the model predicted versus what the ground truth was, for every class combination.
Best practices: use stratified sampling for train/validation/test splits to handle class imbalance. Report per-class metrics for multi-class problems. Use top-K accuracy when the class list is very long (e.g., ImageNet's 1,000 categories).