Chapter 4

Transformers in Generative Systems

Transformers weren't designed exclusively for generation, but they've become the backbone of it — from decoder-only LLMs (GPT, Claude, Gemini) to the text encoders inside image diffusion models. This chapter revisits the three properties that make transformers so effective (attention, parallelism, transfer learning), revisits causal masking and autoregressive decoding, and previews how the transformer text encoder will appear as a key component inside the diffusion model architecture in the next chapter.

1. Transformers as Generative Models→