Chapter 6

Attention and Transformers

Static word embeddings give every word one vector regardless of context — fundamentally unable to distinguish 'bank of the river' from 'deposited money at the bank.' Self-attention solves this by computing each word's representation as a weighted blend of all other words in the sequence. This chapter builds self-attention from scratch (dot product → softmax → weighted sum), introduces the Q/K/V parameterization that makes it trainable, extends to multi-head and cross-attention, then assembles the full transformer architecture from the 2017 'Attention Is All You Need' paper — encoder, decoder, positional encodings, masking, and three popular types of transformers (BERT, GPT, T5).

1. The Problem We Haven't Solved→
2. Self-Attention as Reweighting→
3. Queries, Keys, Values, and Multi-Head Attention→
4. The Transformer Architecture→
5. Transformers in Practice→