Transformers as Generative Models

We covered transformers exhaustively in the NLP chapters, so we won't rebuild them here. But it would be strange to talk about generative deep learning without acknowledging that transformers are, in fact, generative models. When you give a transformer a prompt like "What is the capital of Germany?", the model emits tokens one at a time — "The," "capital," "of," "Germany," "is," "Berlin" — feeding each new token back into the model to predict the next one. That's generation.

Most large language models you've ever interacted with — GPT, Claude, Gemini — are transformer-based models doing exactly this. The transformer isn't a special architecture for generative tasks; it's a general-purpose sequence model that happens to be extremely good at generation.

Three properties that make transformers dominant in generative work

  1. Attention mechanism. Self-attention captures long-range dependencies in data, so the model doesn't lose context. The output at any position can attend to any input position directly.
  2. Parallelism. Transformers can be trained in parallel across an entire sequence, making them dramatically faster to train and run inference on at scale.
  3. Transfer learning. A transformer pretrained on huge amounts of general data can be fine-tuned with small amounts of task-specific data and perform brilliantly on the new task. This is the foundation of how essentially every modern LLM gets deployed.

Transformers Beyond Text

The transformer's contribution to the generative AI landscape extends well beyond language. In the next section on diffusion models, you'll see transformers appear as the text encoder that bridges language and images. The same attention mechanism that learns word context in BERT also learns to align the semantic content of a text prompt with the visual features of a generated image.

Transformers are not perfect. When the authors of Attention Is All You Need spoke at NVIDIA GTC in March 2024, every one of them agreed transformers are not the end of the line. There are architectures yet to be developed that will outperform them. Maybe one of you will design that architecture. For now, transformers remain the workhorse of generative text — and a key building block of generative images too.

💭Reflection

Transformers are the current dominant architecture for generative AI. What limitations of transformers might future architectures address? Think about what you know about how transformers work and where they struggle.

Checkpoint

A decoder-only transformer generates text by sampling one token at a time, feeding each generated token back as input. This is called autoregressive generation. What is the key structural feature that prevents the model from 'cheating' by looking at future tokens during training?