Self-Attention as Reweighting
Think of self-attention as a procedure for improving each word's embedding by mixing in information from the other words in its sequence.
Token embeddings
Each word maps to a fixed vector

We start with the sentence "bank of the river". An embedding table maps each word to a fixed-length vector — v₁, v₂, v₃, v₄. These are static: the same vector for "bank" regardless of whether it means a riverbank or a financial institution. The goal of self-attention is to improve these embeddings by mixing in context from the surrounding words.
Step through the full self-attention computation — from raw embeddings to contextualized output vectors — one stage at a time.
weights =/= trainable weights
The "weights" computed by attention are not trainable parameters. They're computed live, every forward pass, from the data. The terminology is the same as for weights in a linear layer, but the things are completely different.
What we just built has some interesting properties:
- Order doesn't matter. Whether
riveris the second word or the fortieth, the same procedure runs. - Proximity doesn't matter. Distance in the sequence has no effect on whether two words can attend to each other.
- Sequence length doesn't matter. Short or long — same operation.
For long-range dependencies — exactly the case where RNNs choked — this is enormous. The flip side: order really does matter for language, and we've thrown it away. We'll patch this with positional encodings in the transformer section.
Attention matrix — hover a row to highlight a word's attention distribution
Note: These weights are computed from simplified embeddings for illustration. A real transformer learns Q/K/V projections that specialize each head for different semantic relationships.
Enter a short sentence and see the attention weights computed between every word pair. Hover over a word to see what it attends to.