Self-Attention as Reweighting

Think of self-attention as a procedure for improving each word's embedding by mixing in information from the other words in its sequence.

Self-Attention: Step-by-Step1 / 8

Token embeddings

Each word maps to a fixed vector

Token embeddings

We start with the sentence "bank of the river". An embedding table maps each word to a fixed-length vector — v₁, v₂, v₃, v₄. These are static: the same vector for "bank" regardless of whether it means a riverbank or a financial institution. The goal of self-attention is to improve these embeddings by mixing in context from the surrounding words.

Step through the full self-attention computation — from raw embeddings to contextualized output vectors — one stage at a time.

weights =/= trainable weights

The "weights" computed by attention are not trainable parameters. They're computed live, every forward pass, from the data. The terminology is the same as for weights in a linear layer, but the things are completely different.

What we just built has some interesting properties:

  • Order doesn't matter. Whether river is the second word or the fortieth, the same procedure runs.
  • Proximity doesn't matter. Distance in the sequence has no effect on whether two words can attend to each other.
  • Sequence length doesn't matter. Short or long — same operation.

For long-range dependencies — exactly the case where RNNs choked — this is enormous. The flip side: order really does matter for language, and we've thrown it away. We'll patch this with positional encodings in the transformer section.

Self-Attention Visualizer

Attention matrix — hover a row to highlight a word's attention distribution

bank
of
the
river
bank
0.38
0.11
0.11
0.40
of
0.11
0.56
0.16
0.17
the
0.07
0.09
0.77
0.07
river
0.35
0.14
0.10
0.41
Low attention → High attention

Note: These weights are computed from simplified embeddings for illustration. A real transformer learns Q/K/V projections that specialize each head for different semantic relationships.

Enter a short sentence and see the attention weights computed between every word pair. Hover over a word to see what it attends to.