Self-Attention as Reweighting

Think of self-attention as a procedure for improving each word's embedding by mixing in information from the other words in its sequence.

Self-Attention: Step-by-Step1 / 8

Token embeddings

Each word maps to a fixed vector

We start with the sentence "bank of the river". An embedding table maps each word to a fixed-length vector — v₁, v₂, v₃, v₄. These are static: the same vector for "bank" regardless of whether it means a riverbank or a financial institution. The goal of self-attention is to improve these embeddings by mixing in context from the surrounding words.

Step through the full self-attention computation — from raw embeddings to contextualized output vectors — one stage at a time.

⚠

weights =/= trainable weights

The "weights" computed by attention are not trainable parameters. They're computed live, every forward pass, from the data. The terminology is the same as for weights in a linear layer, but the things are completely different.

What we just built has some interesting properties:

Order doesn't matter. Whether river is the second word or the fortieth, the same procedure runs.
Proximity doesn't matter. Distance in the sequence has no effect on whether two words can attend to each other.
Sequence length doesn't matter. Short or long — same operation.

For long-range dependencies — exactly the case where RNNs choked — this is enormous. The flip side: order really does matter for language, and we've thrown it away. We'll patch this with positional encodings in the transformer section.

Self-Attention Visualizer

Input sentence

Attention matrix — hover a row to highlight a word's attention distribution

bank

the

river

bank

0.38

0.11

0.40

0.11

0.56

0.16

0.17

the

0.07

0.09

0.77

0.07

river

0.35

0.14

0.10

0.41

Low attention → High attention

Note: These weights are computed from simplified embeddings for illustration. A real transformer learns Q/K/V projections that specialize each head for different semantic relationships.

Enter a short sentence and see the attention weights computed between every word pair. Hover over a word to see what it attends to.

←PreviousThe Problem We Haven't SolvedAttention and Transformers Next→Queries, Keys, Values, and Multi-Head AttentionAttention and Transformers