Chapter 9

Multimodal Models

The transformer's sequence-of-tokens insight extends far beyond language. This chapter covers the Vision Transformer (ViT), which tokenizes images as patches to apply a standard transformer encoder; CLIP, which places text and images in a shared embedding space via contrastive learning; and Mixture of Experts (MoE), which scales model capacity without proportional compute cost by routing each token to a small subset of specialized expert networks.

1. Beyond Language — Multimodal Models→
2. The Vision Transformer (ViT)→
3. CLIP: Connecting Text and Images→
4. Mixture of Experts→