Chapter 9
Multimodal Models
The transformer's sequence-of-tokens insight extends far beyond language. This chapter covers the Vision Transformer (ViT), which tokenizes images as patches to apply a standard transformer encoder; CLIP, which places text and images in a shared embedding space via contrastive learning; and Mixture of Experts (MoE), which scales model capacity without proportional compute cost by routing each token to a small subset of specialized expert networks.