Unit 2

Computer Vision

Learn how machines process and interpret visual information, from classical feature extraction through CNNs, detection, segmentation, and evaluation, to the modern architectures defining the current state of the art.

Chapter 1

Introduction to Computer Vision

Computer vision gives machines the ability to interpret visual information — yet seeing is far harder than it looks. This chapter maps the full landscape of CV tasks (classification, detection, segmentation, keypoints, tracking), shows how images become tensors of numbers, and identifies the core challenges — occlusion, lighting, scale, and inference speed — that make the field genuinely difficult.

Chapter 2

Traditional Computer Vision

Before deep learning, computer vision meant engineering features manually — color histograms, texture descriptors, edge detectors — and feeding them to classical models. This chapter builds a working vocabulary of those features, explains which models they pair with, and assesses why this approach has a ceiling.

Chapter 3

Convolutional Neural Networks

This chapter introduces convolutional neural networks from first principles — the sliding window operation of convolution, the parameter efficiency of shared weights, pooling as strategic forgetting, and the full architecture that transforms raw pixels into class probabilities. We also derive the output dimension formula you will use constantly when debugging CNN architectures.

Chapter 4

Image Classification

The history of CNN architectures is a series of solved problems: LeNet proved CNNs work, AlexNet proved scale matters, VGGNet revealed accuracy saturation, ResNet solved it with residual connections, Inception showed multi-scale parallel convolutions outperform single-filter designs, and DenseNet/SqueezeNet pushed efficiency to its limits. Each breakthrough reflects a fundamental insight about what makes deep networks trainable and deployable.

Chapter 5

CNN Implementation

You rarely need to train a computer vision model from scratch. Pre-trained CNNs have already learned the visual features of the world — your job is to adapt them. This chapter covers the mechanics of transfer learning (freeze, replace head, fine-tune), practical fine-tuning trade-offs, and data augmentation strategies that make small datasets punch well above their weight.

Chapter 6

Object Detection

Object detection outputs bounding boxes, class labels, and confidence scores for every object in an image. This chapter covers the COCO annotation standard, the anchor box mechanism that handles objects of varied sizes, Non-Maximum Suppression, and the two dominant detection architectures: YOLO (optimized for real-time speed) and Faster R-CNN (optimized for accuracy).

Chapter 7

Segmentation

Segmentation assigns a class label to every pixel in an image — the most spatially precise of the standard CV tasks. This chapter distinguishes semantic from instance segmentation, explains the encoder-decoder design of U-Net and why skip connections are essential for spatial precision, and surveys the broader segmentation landscape including Mask R-CNN, DeepLabV3, and PSPNet.

Chapter 8

Evaluation

A single accuracy number hides as much as it reveals. This chapter covers the full evaluation toolkit for CV: standard classification metrics (precision, recall, F1, AUC), mean Average Precision (mAP) for detection, mean IoU for segmentation, and the discipline of error analysis. It closes with computational efficiency, the often-neglected second axis every deployment decision must account for.

Chapter 9

SOTA: Vision Transformers, Diffusion, and Zero-Shot Detection

This chapter introduces Vision Transformers (patches as tokens, global attention from the first layer), the zero-shot detection of Grounding DINO (vision-language fusion), Meta's Segment Anything Model (interactive general-purpose segmentation), and the basic mechanics of diffusion models that now power both image generation and synthetic training data pipelines.