Transformers in Practice
Three Transformer Types
Here are three popular transformer types you may encounter:
- Encoder-only (BERT). Produces contextualized representations of input text. Doesn't generate — it interprets. Used for classification, NER, similarity, information extraction.
- Decoder-only (GPT). Generates text autoregressively, one token at a time, each conditioned on all previous tokens. Used for text generation and completion.
- Encoder-decoder (T5, BART). The full architecture. Used for machine translation, summarization, and any "text-to-text" task where input and output are both variable-length sequences.
Transformer Architectures, Side by Side
BERT · T5 · GPT — click any layer to see what it does
Click any layer block to see exactly what it does. Compare how BERT, T5, and GPT handle the same fundamental pieces — embeddings, attention, and feed-forward — differently.
Inference, End-to-End
When the transformer encounters orange in "I ate an orange," attention blends in context about food and eating — the embedding lands near other fruits. In "The sunset is orange," the same word blends with color context and lands near other color words. Same word. Different embeddings. Determined by context. We finally solved bank of the river versus deposited money at the bank.
Step 1 — Tokenize
Split raw text into discrete token IDs
The model never sees raw text. A tokenizer splits the input string into subword units — tokens — and maps each one to an integer ID from a fixed vocabulary. For "I ate an orange" a BPE tokenizer might produce four tokens, one per word. Longer or rarer words split into multiple subword pieces (e.g., un + ##likely). The vocabulary size is typically 30K–100K.
Step through each stage of the inference pipeline — from raw text to generated output — and see exactly what the transformer does at every layer.
Why Transformers Have Token Limits
Attention has O(n²) complexity in sequence length. Double the context, quadruple the memory and compute. Mitigations exist — sparse attention (Longformer, BigBird), gradient checkpointing, mixed-precision training — and context windows have grown from ~2K tokens in 2020 to 1M+ in 2025. But the fundamental constraint hasn't gone away.
You are building a model that takes a customer support email and produces a one-sentence summary, potentially using words not in the original. Which transformer type is most appropriate?