Error Analysis and Computational Efficiency

Error Analysis

The most valuable thing you can do after computing metrics is look at your model's mistakes.

A rigorous error analysis process:

  • Generate the confusion matrix. For multi-class problems, this shows you which classes are most commonly confused with each other.
  • Compute per-class accuracy. A model that is 90% accurate overall but 30% accurate on one specific class has a problem that aggregate accuracy hides.
  • Sample failure cases. Look at actual images where the model was wrong. Do you see a pattern? Are the failures concentrated in one type of image — low lighting, unusual angles, certain backgrounds?
  • Test under different conditions. Evaluate separately on images with occlusion, with challenging lighting, at different scales. Real-world deployment conditions should be represented in your evaluation.
  • Check for bias. A model that fails on tabby cats has a bias that per-class metrics might not catch unless you stratify even further.

This kind of analysis is also what makes model evaluation a creative, subjective, and judgment-based process.

Computational Efficiency as an Evaluation Dimension

A model that achieves 95% mAP but requires 10 seconds per image is not useful for real-time applications. Computational metrics should be reported alongside accuracy metrics when considering real world deployment.

Metrics to report alongside accuracy metrics:

  • FLOPs (floating point operations) — A measure of computational complexity independent of hardware.
  • Parameter count — Model size, relevant to memory requirements and deployment constraints.
  • Inference latency — Measured on the target hardware. (A GPU latency number is not useful for a deployment that runs on a mobile CPU...)
  • Memory footprint — Total GPU/CPU memory required during inference.

When comparing architectures, report both sides of this trade-off. A model with 5% lower mAP that is 10× faster may be the correct choice for your application.

The Evaluation Mindset

The best practitioners treat evaluation as a form of investigation, not just a final grade. Before shipping any model, ask: what kinds of failure would be most costly in production? Design your evaluation to stress-test those scenarios specifically. A model that passes a generic benchmark may still fail badly on the specific distribution it will encounter in deployment.

💭Reflection

A skin lesion classification model achieves 94% overall accuracy across 5 classes on a balanced test set. The product team declares it ready for deployment. What evaluation questions would you ask before agreeing?