Chapter 9

SOTA: Vision Transformers, Diffusion, and Zero-Shot Detection

This chapter introduces Vision Transformers (patches as tokens, global attention from the first layer), the zero-shot detection of Grounding DINO (vision-language fusion), Meta's Segment Anything Model (interactive general-purpose segmentation), and the basic mechanics of diffusion models that now power both image generation and synthetic training data pipelines.