Chapter 6
Attention and Transformers
Static word embeddings give every word one vector regardless of context — fundamentally unable to distinguish 'bank of the river' from 'deposited money at the bank.' Self-attention solves this by computing each word's representation as a weighted blend of all other words in the sequence. This chapter builds self-attention from scratch (dot product → softmax → weighted sum), introduces the Q/K/V parameterization that makes it trainable, extends to multi-head and cross-attention, then assembles the full transformer architecture from the 2017 'Attention Is All You Need' paper — encoder, decoder, positional encodings, masking, and three popular types of transformers (BERT, GPT, T5).