Error Analysis and Computational Efficiency
Error Analysis
The most valuable thing you can do after computing metrics is look at your model's mistakes.
A rigorous error analysis process:
- Generate the confusion matrix. For multi-class problems, this shows you which classes are most commonly confused with each other.
- Compute per-class accuracy. A model that is 90% accurate overall but 30% accurate on one specific class has a problem that aggregate accuracy hides.
- Sample failure cases. Look at actual images where the model was wrong. Do you see a pattern? Are the failures concentrated in one type of image — low lighting, unusual angles, certain backgrounds?
- Test under different conditions. Evaluate separately on images with occlusion, with challenging lighting, at different scales. Real-world deployment conditions should be represented in your evaluation.
- Check for bias. A model that fails on tabby cats has a bias that per-class metrics might not catch unless you stratify even further.
This kind of analysis is also what makes model evaluation a creative, subjective, and judgment-based process.
Computational Efficiency as an Evaluation Dimension
A model that achieves 95% mAP but requires 10 seconds per image is not useful for real-time applications. Computational metrics should be reported alongside accuracy metrics when considering real world deployment.
Metrics to report alongside accuracy metrics:
- FLOPs (floating point operations) — A measure of computational complexity independent of hardware.
- Parameter count — Model size, relevant to memory requirements and deployment constraints.
- Inference latency — Measured on the target hardware. (A GPU latency number is not useful for a deployment that runs on a mobile CPU...)
- Memory footprint — Total GPU/CPU memory required during inference.
When comparing architectures, report both sides of this trade-off. A model with 5% lower mAP that is 10× faster may be the correct choice for your application.
The Evaluation Mindset
The best practitioners treat evaluation as a form of investigation, not just a final grade. Before shipping any model, ask: what kinds of failure would be most costly in production? Design your evaluation to stress-test those scenarios specifically. A model that passes a generic benchmark may still fail badly on the specific distribution it will encounter in deployment.
A skin lesion classification model achieves 94% overall accuracy across 5 classes on a balanced test set. The product team declares it ready for deployment. What evaluation questions would you ask before agreeing?