Transformers in Practice

Three Transformer Types

Here are three popular transformer types you may encounter:

  • Encoder-only (BERT). Produces contextualized representations of input text. Doesn't generate — it interprets. Used for classification, NER, similarity, information extraction.
  • Decoder-only (GPT). Generates text autoregressively, one token at a time, each conditioned on all previous tokens. Used for text generation and completion.
  • Encoder-decoder (T5, BART). The full architecture. Used for machine translation, summarization, and any "text-to-text" task where input and output are both variable-length sequences.

Transformer Architectures, Side by Side

BERT · T5 · GPT — click any layer to see what it does

Tap a block ↘
Encoder-Only
BERT
Bidirectional encoder for understanding tasks. Input → contextual representations.
Encoder Block
× 12 (base) · × 24 (large)
Encoder–Decoder
T5
Reads with the encoder, writes with the decoder. Every task is text-in, text-out.
Encoder Block
× N encoder layers
Decoder Block
× N decoder layers
Decoder-Only
GPT
A causal stack. Predict the next token, then feed it back in.
Decoder Block
× 12 → × 96+ (scaled deep)
BERT · Input
Token + Segment + Position Embeddings
Three embeddings summed per token: the WordPiece token embedding, a segment embedding marking sentence A vs. B (for sentence-pair tasks), and a learned absolute position embedding capped at 512 positions.

Click any layer block to see exactly what it does. Compare how BERT, T5, and GPT handle the same fundamental pieces — embeddings, attention, and feed-forward — differently.

Inference, End-to-End

When the transformer encounters orange in "I ate an orange," attention blends in context about food and eating — the embedding lands near other fruits. In "The sunset is orange," the same word blends with color context and lands near other color words. Same word. Different embeddings. Determined by context. We finally solved bank of the river versus deposited money at the bank.

Transformer Inference: End-to-End1 / 7
1. Tokenize
2. Embed
3. Pos Enc
4. Encode
5. Decode
6. Softmax
7. Output

Step 1 — Tokenize

Split raw text into discrete token IDs

"I ate an orange"tokenize ↓Iateanorange

The model never sees raw text. A tokenizer splits the input string into subword units — tokens — and maps each one to an integer ID from a fixed vocabulary. For "I ate an orange" a BPE tokenizer might produce four tokens, one per word. Longer or rarer words split into multiple subword pieces (e.g., un + ##likely). The vocabulary size is typically 30K–100K.

Step through each stage of the inference pipeline — from raw text to generated output — and see exactly what the transformer does at every layer.

Why Transformers Have Token Limits

Attention has O(n²) complexity in sequence length. Double the context, quadruple the memory and compute. Mitigations exist — sparse attention (Longformer, BigBird), gradient checkpointing, mixed-precision training — and context windows have grown from ~2K tokens in 2020 to 1M+ in 2025. But the fundamental constraint hasn't gone away.

Checkpoint

You are building a model that takes a customer support email and produces a one-sentence summary, potentially using words not in the original. Which transformer type is most appropriate?