4:["$","div",null,{"className":"relative flex h-screen overflow-hidden","children":[["$","$L10",null,{"unit":{"id":"nlp","number":3,"title":"Natural Language Processing","description":"Explore how machines process and generate language, from bag-of-words representations through attention and transformers, to practical implementation, LLMs, RAG, and multimodal models.","chapters":[{"id":"welcome-to-nlp","number":1,"title":"Introduction to NLP","overview":"You are likely using NLP systems every day without even realizing it: search, spam filters, translation, sentiment analysis, and text generation are all NLP. This chapter maps the full landscape of what NLP does, identifies the fundamental challenges that make language so much harder to represent than images, and lays out the historical arc from bag-of-words to large language models that the rest of the unit follows.","sections":[{"number":2,"id":"why-text-is-harder","title":"Why Text Is Harder Than It Looks","blocks":[{"type":"text","html":"$11"},{"type":"text","html":"

So the goal of this entire unit is to take the word bank in a specific context and turn it into a vector of numbers that actually captures what it means there, in that sentence. And as you'll see, we don't solve this in one shot. We chip away at it for decades.

"},{"type":"nlp-challenge-slideshow"},{"type":"checkpoint","id":"nlp-q1-challenges","kind":"mc","question":"A sentiment classifier flags the tweet \"That concert was lowkey fire 🔥\" as neutral or negative. Which NLP challenge best explains this failure?","options":[{"label":"Homonymy","explanation":"Homonymy involves words with distinct dictionary meanings (like 'bank'). 'Fire' and 'lowkey' do have alternate meanings, but the core issue is that these informal usages simply don't appear in formal training data."},{"label":"Slang and colloquialisms","correct":true,"explanation":"\"Lowkey\" (subtly / understated) and \"fire\" (excellent) are informal terms unlikely to appear in formal training corpora with these meanings. The model applies their literal senses and misclassifies the sentiment."},{"label":"Dependence on history","explanation":"Dependence on history means meaning requires prior context. Here the tweet is self-contained — no prior sentences are needed to interpret it."},{"label":"Variable length","explanation":"Variable length refers to inputs of different sizes. The brevity of the tweet isn't the issue; it's that the vocabulary is informal."}]},{"type":"callout","variant":"warning","title":"The Eight Challenges of Representing Text","html":"

Homonyms — \"bank\" means different things in different contexts.
Synonyms — \"sneakers\" and \"running shoes\" mean the same thing.
Dependence on history — meaning depends on prior sentences or context.
Semantic ambiguity — \"I saw the boy on the beach with my binoculars.\" Are you using binoculars? Is he?
Slang and colloquialisms — \"That snowboarder did a sick jump\" doesn't mean the snowboarder is unwell.
Acronyms — even the degree program you might be in (Master of Engineering in AI) is one.
Variable length — a tweet is six words; a contract is sixty thousand. Both are valid inputs.
Sarcasm and humor — good luck.

"}]},{"number":3,"id":"roadmap","title":"The Roadmap","blocks":[{"type":"text","html":"

The roadmap of this unit mirrors how the field itself unfolded. We start with the most naive representations (bag-of-words), graduate to static word embeddings (Word2Vec) that capture some meaning, suffer through the era of recurrent neural networks, and then arrive at attention and transformers — which finally crack the \"bank of the river\" problem by letting context shape the representation on the fly.

"},{"type":"timeline","events":[{"year":"Pre-2013","title":"Bag of Words & TF-IDF","body":"Text represented as sparse count vectors over a vocabulary. No semantics or order, but shockingly effective for classification and search. Still in production today."},{"year":"2013","title":"Word2Vec","body":"Dense, low-dimensional word embeddings trained on co-occurrence. Semantic arithmetic becomes possible: king − man + woman ≈ queen. But every word still has one static vector regardless of context."},{"year":"2014–2018","title":"RNNs, LSTMs, GRUs","body":"Recurrent architectures carry information forward through sequences, allowing the model to use context. Long-range dependencies remain difficult, and training is slow because computation is sequential."},{"year":"2017","title":"Attention Is All You Need","body":"The transformer paper from Google DeepMind proposes replacing recurrence with self-attention. Every word can directly attend to every other word. The field pivots."},{"year":"2018–2019","title":"BERT, GPT-1, GPT-2","body":"Large-scale pretrained transformers. Fine-tune once, deploy everywhere. NLP becomes transfer learning."},{"year":"2020–present","title":"LLMs and Multimodality","body":"Models with billions to trillions of parameters. Text generation, code, reasoning. The same architecture expanding beyond language into images, audio, and video."}]},{"type":"callout","variant":"info","title":"Real-World Hook","html":"

Every model we talk about in this unit is currently powering something you use. Bag-of-words is still inside production spam filters. Word2Vec is in production recommendation systems. Transformers are in your phone's autocomplete.

"}]}]},{"id":"text-preprocessing","number":2,"title":"Text Preprocessing","overview":"Before text reaches a model, it must be transformed into a structured form the model can work with. This chapter walks through the standard preprocessing pipeline (tokenization, stop word removal, and stemming vs. lemmatization) and explains when each step matters (and when modern transformer-based models let you skip it).","sections":[{"number":1,"id":"preprocessing-pipeline","title":"The Preprocessing Pipeline","blocks":[{"type":"text","html":"

Take raw text — anything from a scraped webpage to a transcript of a lecture — and you'll typically do three things before it hits a traditional model:

Tokenize the text into pieces.
Remove stop words and punctuation that don't carry meaning.
Lemmatize or stem to collapse word variants to a root.

A modern transformer-based model needs much less of this — it can learn directly from raw token sequences. But for traditional models, and for understanding what your tokenizer is doing under the hood, you need to know each step.

"},{"type":"callout","variant":"info","title":"Preprocessing as Architecture-Dependent","html":"

The more powerful the model, the less preprocessing you need. BERT and GPT were trained on raw text with minimal preprocessing. A TF-IDF logistic regression classifier, on the other hand, benefits significantly from stop word removal and lemmatization. Match your preprocessing choices to your model.

"}]},{"number":2,"id":"tokenization","title":"Tokenization","blocks":[{"type":"text","html":"

Tokenization splits a string into substrings. The default is to split on whitespace and punctuation:

\"Which class is the best class at Duke? Deep Learning Applications.\"

becomes

['Which', 'class', 'is', 'the', 'best', 'class', 'at', 'Duke', '?', 'Deep', 'Learning', 'Applications', '.']

\n
\n

You can also tokenize by sentence (useful for long documents you want to summarize one sentence at a time), by subword (the modern default — tokenization → ['token', 'ization']), or by character (rarely useful, but possible).

"},{"type":"interactive","component":"TokenizerPlayground","caption":"Type any sentence and compare word-level, subword, and character-level tokenization side by side.","props":{}},{"type":"callout","variant":"warning","title":"A Real-World Rule That Trips People Up","html":"

When you fine-tune a pretrained model like BERT or GPT, you must use that model's specific tokenizer. BERT learned its embeddings against BERT's tokens.

"},{"type":"checkpoint","id":"nlp-q2-tokenization","kind":"mc","question":"You are fine-tuning BERT for a sentiment classification task. You decide to save time by tokenizing your text with spaCy's word tokenizer before passing it to BERT. What problem does this cause?","options":[{"label":"Your inputs will be too long for BERT's context window","explanation":"Token length is a separate concern. The fundamental problem here is a mismatch between the tokenizer used at pretraining and the one you're using now."},{"label":"BERT's embedding layer expects its own specific token IDs, which spaCy's tokenizer won't produce","correct":true,"explanation":"BERT was pretrained using WordPiece tokenization. Its embedding matrix maps specific token IDs to learned vectors. A different tokenizer produces different token boundaries and different IDs — BERT's weights are meaningless for those inputs."},{"label":"spaCy's tokenizer will remove stop words, which BERT needs","explanation":"spaCy's tokenizer doesn't remove stop words by default. And even if it did, that's a separate concern from the tokenizer mismatch issue."},{"label":"Word-level tokenization produces more tokens than subword tokenization","explanation":"Word tokenization generally produces fewer tokens, not more. And again, the key problem is not token count but token ID mismatch."}]}]},{"number":3,"id":"stop-words","title":"Stop Words","blocks":[{"type":"text","html":"

Stop Word Removal

Many common words — the, of, and, is — appear so frequently that they swamp the signal in your features. Stop word removal drops them so the model can focus on what carries meaning.

NLTK ships with a default English stop word list, but you can absolutely add to it. If you're classifying product reviews, the word \"product\" is technically informative but in practice useless, since it appears in every document. Add it.

Apply stop word removal to our example tokens and watch what gets stripped:

"},{"type":"interactive","component":"StopWordVisualizer","caption":"Tokens struck through in red are NLTK stop words. The filtered list keeps only content-bearing words.","props":{}},{"type":"callout","variant":"example","title":"Real World: Search, Sentiment, Legal Archives","html":"

Search engines and traditional sentiment classifiers still rely on this exact pipeline. When Grammarly checks for plagiarism, when a hospital tags chart notes, when a law firm searches its case archive — these three preprocessing steps are the first thing that happens to your text.

"},{"type":"image","src":"/nlp/stopwords.png","alt":"NLTK stop word list","caption":"NLTK's default stop word list"}]},{"number":4,"id":"stemming-lemmatization","title":"Stemming and Lemmatization","blocks":[{"type":"text","html":"

Stemming vs. Lemmatization

The words branch, branches, branching, branched all refer to roughly the same concept. We'd like to collapse them.

Stemming chops off suffixes mechanically. changes, changed, changing → chang. Not a real word. Doesn't matter — it's a feature, not a noun. Fast, crude.
Lemmatization uses a dictionary to map each form to a canonical root. is, am, were → be. changes → change. Slower, but the output is always a real word.

If you're throwing together a quick keyword classifier on millions of documents, stem. If you care about interpretability or accuracy, lemmatize.

"},{"type":"interactive","component":"StemmingLemmatizationDemo","caption":"Enter a word or sentence and compare the output of stemming (Porter stemmer) and lemmatization (WordNet).","props":{}}]}]},{"id":"traditional-toolkit","number":3,"title":"Traditional Approaches to NLP","overview":"Before neural networks took over NLP, the field ran on bag-of-words, TF-IDF, N-grams, and Hidden Markov Models. This chapter builds intuition for each — how they work, where they fail, and where they still belong. A TF-IDF logistic regression baseline is still the right first move for most text classification tasks, and understanding why sets up the contrast that motivates everything that comes next.","sections":[{"number":1,"id":"bag-of-words","title":"Bag of Words","blocks":[{"type":"text","html":"$12"},{"type":"interactive","component":"BagOfWordsBuilder","caption":"Enter a small corpus and watch it get converted to bag-of-words vectors.","props":{}}]},{"number":2,"id":"tfidf-ngrams","title":"TF-IDF and N-grams","blocks":[{"type":"text","html":"$13"},{"type":"interactive","component":"TFIDFCalculator","caption":"Enter a small corpus and see TF-IDF scores computed in real time. Compare which words rise and which fall relative to raw counts.","props":{}},{"type":"text","html":"

N-grams: Sneaking Word Order Back In

Bag-of-words throws away word order. \"Terrible acting but great plot\" and \"Great acting but terrible plot\" become the same vector.

N-grams patch this. Instead of counting single words (unigrams), count adjacent pairs (bigrams) or triples (trigrams):

Unigrams: great, movie, amazing, plot
Bigrams: great movie, movie amazing, amazing plot
Trigrams: great movie amazing, movie amazing plot

Bigrams helps with negation handling: \"not good\" becomes a single bigram token that the model can learn to associate with negative sentiment. Bag-of-words on unigrams will count good as positive — a bigram model can recognize the construction.

"},{"type":"callout","variant":"example","title":"Real World: Elasticsearch Search Relevance","html":"

TF-IDF still powers production search systems. Elasticsearch's relevance scoring is, at its core, TF-IDF with refinements. When you build a RAG system later in this unit, you'll see TF-IDF retrieval as a perfectly viable competitor to dense vector search in many domains.

"},{"type":"checkpoint","id":"nlp-q3-tfidf","kind":"mc","question":"In a corpus of 1,000 product reviews, the word 'product' appears in every single review. What would its IDF score be, and what does that mean for its usefulness as a feature?","options":[{"label":"IDF = log(1000/1000) = 0; the word is useless as a discriminating feature","correct":true,"explanation":"IDF = log(N / df) = log(1000/1000) = log(1) = 0. Multiplied by any TF score, TF-IDF = 0. The word contributes nothing to distinguishing documents from each other."},{"label":"IDF = log(1000) ≈ 3; it's the most important word in every document","explanation":"IDF = log(N / df). If 'product' appears in all 1,000 documents, df = 1000 and IDF = log(1000/1000) = 0, not log(1000)."},{"label":"IDF = 1/1000 = 0.001; it's slightly informative because it's rare per document","explanation":"IDF is a logarithmic formula, not a simple ratio. And document frequency (how many documents contain the word) is what matters, not total occurrences."},{"label":"IDF cannot be computed because the word appears in every document","explanation":"IDF is perfectly computable here — it just evaluates to zero, which is the correct and interpretable result: this word carries no distinguishing power."}]}]},{"number":3,"id":"hidden-markov-models","title":"Hidden Markov Models","blocks":[{"type":"text","html":"$14"},{"type":"interactive","component":"HMMWeatherDemo","caption":"Explore the transition and emission tables, then switch to Decode to build an observation sequence and watch the Viterbi algorithm infer the most likely hidden weather states.","props":{}},{"type":"text","html":"

In NLP, the hidden states are often parts of speech (noun, verb, adjective), and the observations are the words themselves. HMMs were used in speech recognition and part-of-speech tagging for many years.

A great way to feel the limits of these models is to train an HMM to generate text and read what it spits out. Output trained on Jane Austen's Pride and Prejudice is grammatical-looking but utterly nonsensical — full of dangling references and characters appearing where they shouldn't. It's actually quite funny if you've read the book. It's also a perfect demonstration of why HMMs are not good at text generation.

"},{"type":"callout","variant":"example","title":"Example: HMM Text Generation Trained on Pride and Prejudice","html":"$15"},{"type":"callout","variant":"info","title":"Why HMMs Eventually Fall Short","html":"

HMMs share a fundamental constraint with the Markov assumption itself: the next state depends only on the current state. They can't model long-range dependencies — a dependency that spans 20 words is invisible to an HMM. This is a hint of the challenge that RNNs will address next, and that transformers will finally solve.

"}]},{"number":4,"id":"when-to-use-traditional","title":"When to Still Use Traditional ML for NLP","blocks":[{"type":"text","html":"

These models share a fatal flaw: they don't carry context across long distances, they can't handle homonyms or synonyms, and their representations are sparse and inefficient.

However, they're computationally cheap, easy to implement, and they don't require a GPU. For text classification or clustering where context doesn't matter much, a TF-IDF + logistic regression baseline often gets you 90% of the way there in 10% of the engineering time.

"},{"type":"callout","variant":"tip","title":"Always Build a Stupid Baseline First","html":"

A TF-IDF logistic regression tells you whether your fancy model is actually doing anything. If your transformer can't beat a bag-of-words baseline by a meaningful margin, you should ask whether the task actually requires context. Many surprisingly don't.

"},{"type":"checkpoint","id":"nlp-q4-baselines","kind":"reflective","question":"You're building a customer support ticket classifier for a company with 12 labeled intent categories and ~50,000 tickets of training data. Describe the baseline you would build first and justify your choice.","sampleAnswer":"A TF-IDF + logistic regression or Naive Bayes classifier makes an excellent baseline here. It requires no GPU, is fast to train, and interpretable. With 50,000 labeled examples across 12 categories, the class sizes are substantial enough for a linear model to pick up the most important vocabulary signals. The baseline tells you immediately whether the problem has strong lexical signal (in which case you may not need a transformer at all) or whether context and semantics are critical (in which case fine-tuning a small pretrained model is justified)."}]}]},{"id":"word-embeddings","number":4,"title":"Word Embeddings","overview":"Bag-of-words is sparse, huge, and semantically meaningless. Word embeddings replace it with dense, low-dimensional vectors where geometric distance reflects semantic similarity. This chapter walks through Word2Vec's deceptively simple architecture, explains why extracting the weight matrix as embeddings works, demonstrates vector arithmetic (king − man + woman ≈ queen), and introduces Doc2Vec and GloVe.","sections":[{"number":1,"id":"why-better-representation","title":"Why We Need a Better Representation","blocks":[{"type":"text","html":"$16"},{"type":"callout","variant":"info","title":"Sparse vs. Dense Representations","html":"

Sparse representation: a vector of vocabulary size where almost all values are zero. It's easy to construct, but hard to use because the dimensions don't carry meaning.

Dense representation: a short vector (typically 50–1024 dimensions) where almost all values are non-zero. Much harder to construct, but dimensions can capture semantic relationships.

"}]},{"number":2,"id":"word2vec","title":"Word2Vec","blocks":[{"type":"text","html":"$17"},{"type":"interactive","component":"Word2VecArchitectureDiagram","caption":"Step through how Skip-gram and CBOW process a target word. Toggle between modes and click any word in the sentence to change the target.","props":{}},{"type":"text","html":"

Skip-gram vs. CBOW

Word2Vec comes in two flavors depending on which way you orient the prediction:

Skip-gram: given a center word, predict its context. Better for rare words and small datasets.
CBOW (Continuous Bag of Words): given the context, predict the center word. Faster to train, better for common words.

In CBOW, the context words get averaged (via a lambda layer) before going into the softmax. This averaging assumes all context words contribute equally (order doesn't matter inside the window) and makes the model more robust to small variations in word arrangement.

"},{"type":"callout","variant":"info","title":"Creating a Dataset","html":"

To generate training data for Word2Vec, slide a context window of size N across your corpus. For each position, the center word is the target and every word within the window is a context word.

Each (target, context) pair becomes one training example:

Window size N = 2, sentence: \"the quick brown fox jumps\"
Target: brown → context pairs: (brown, the), (brown, quick), (brown, fox), (brown, jumps)

Do this for every word in the corpus and you have your dataset — millions of (target word, context word) pairs ready for training.

"},{"type":"interactive","component":"Word2VecVisualizer","caption":"Explore a pretrained Word2Vec embedding space. Search for a word and see its nearest neighbors. Try words with multiple meanings.","props":{}}]},{"number":3,"id":"vector-arithmetic","title":"The Most Famous Equation in NLP","blocks":[{"type":"text","html":"

The wild thing about Word2Vec embeddings is that you can do arithmetic with them:

king − man + woman ≈ queen

This is evidence that the model has learned to separate the dimensions of meaning. Subtracting man from king removes the gender direction. Adding woman puts it back, oriented the other way. The result lands closest to queen.

You can do similar things with Paris − France + Italy ≈ Rome. Geography, gender, verb tense, plurality — all of these end up as approximate directions in the embedding space.

"},{"type":"interactive","component":"VectorArithmeticExplorer","caption":"Each word is a point in 2D embedding space. The red arrows show the two geometric steps of the analogy. Switch between presets to see how the same pattern holds across different word pairs.","props":{}},{"type":"text","html":"

Doc2Vec and GloVe

Two extensions of Word2Vec you should know:

Doc2Vec extends Word2Vec by adding a paragraph vector to the input, so the model captures topic-level context, not just local word context. Useful for document-level tasks like classification or similarity.
GloVe (2014) is a matrix factorization approach. Rather than training a neural net, it factorizes a co-occurrence matrix using log-bilinear regression. GloVe embeddings are often used as inputs to downstream neural networks.

"}]},{"number":4,"id":"the-catch","title":"The Limitation of Word Embeddings","blocks":[{"type":"text","html":"

Word2Vec embeddings are static. Each word has exactly one vector. So bank has one vector, regardless of whether you're talking about a river or a financial institution. The vector ends up somewhere in between, a kind of weighted compromise that satisfies neither meaning fully.

We haven't fixed bank of the river versus deposited money at the bank. We've made it less bad, but the contextual problem is still very much open.

"},{"type":"callout","variant":"info","title":"What's Still Missing","html":"

Static embeddings have no concept of context within a sentence. The embedding for orange in \"I ate an orange\" and in \"The sunset is orange\" is the same vector — a compromise between the fruit and the color. The word needs a dynamic, context-dependent representation.

"},{"type":"callout","variant":"example","title":"Real World: Recommendation and Classification at Scale","html":"

Pre-trained Word2Vec and GloVe embeddings are still production-grade features for many search, recommendation, and classification pipelines where the computational budget for transformers isn't justified. When recommending products based on product names, or routing customer service tickets to teams, Word2Vec embeddings + logistic regression is often the right answer.

"},{"type":"checkpoint","id":"nlp-q5-embeddings","kind":"mc","question":"A production search system uses pre-trained Word2Vec embeddings to represent product titles and customer queries. A user searches for 'running shoes' but the most relevant results are labeled 'athletic footwear.' The system returns poor results. Why?","options":[{"label":"Word2Vec embeddings are too high-dimensional for similarity search","explanation":"Word2Vec embeddings are typically 100–300 dimensions — quite manageable for similarity search. Dimensionality isn't the issue."},{"label":"Word2Vec cannot represent multi-word phrases like 'running shoes'","explanation":"You can average word embeddings to get phrase representations, which is a common approach. This is a limitation, but not the primary reason these two phrases would fail to match."},{"label":"'Running shoes' and 'athletic footwear' are synonymous but may have dissimilar Word2Vec vectors if they rarely appeared in the same context in the training corpus","correct":true,"explanation":"Word2Vec learns from co-occurrence. If 'running shoes' and 'athletic footwear' never appeared near the same words in the training corpus, their embeddings won't be close — even though they mean the same thing to a human."},{"label":"Word2Vec is only suitable for classification, not search","explanation":"Word2Vec embeddings are commonly used for similarity search via cosine distance. The architecture is not inherently limited to classification."}]}]}]},{"id":"rnns","number":5,"title":"Recurrent Neural Networks","overview":"Feed-forward networks are stateless — they treat every observation independently. Language requires memory. This chapter introduces RNNs as the first architecture that carries information forward through time, maps out the four sequence architecture types (seq-to-seq, seq-to-vector, vector-to-seq, encoder-decoder), explains the vanishing gradient problem that limits plain RNNs, and shows how LSTMs and GRUs solve it through gated memory.","sections":[{"number":1,"id":"feed-forward-problem","title":"Feed-Forward Networks Are Forgetful","blocks":[{"type":"text","html":"

Everything we've seen so far in deep learning (fully-connected networks and CNNs) is feed-forward. Information flows once, in one direction, from input to output. Observations are treated as independent.

That's problematic for language. The meaning of \"it\" in \"The dog ate the bone. It tasted good.\" comes entirely from the previous sentence. A feed-forward network has no mechanism to remember.

We need an architecture that carries information forward through time.

Enter: Recurrent Neural Networks

"}]},{"number":2,"id":"rnn-architecture","title":"Recurrent Neural Network (RNN)","blocks":[{"type":"text","html":"

A recurrent neural network has a simple twist: the output of a layer is added to the next input and fed back into the same layer. You can draw this two ways:

Unrolled — the same block copied left-to-right across time steps, with arrows showing how the hidden state passes forward. This is the more common way to draw an RNN.
Rolled — a single block with a loop arrow on it, simple and abstract.

"},{"type":"image","src":"/nlp/rnn.png","alt":"Unrolled and rolled RNN diagram","caption":"An RNN in rolled form (right) and unrolled form (left)."},{"type":"text","html":"$18"},{"type":"interactive","component":"RNNCellDiagram","caption":"Hover any part of the RNN cell — the equation highlights the corresponding term and explains its role.","props":{}},{"type":"text","html":"

RNN Types

The same RNN components handle a surprising range of tasks depending on how you wire up inputs and outputs:

Sequence-to-sequence: input at every step, output at every step. Used for stock price forecasting, where each new day produces a new prediction.
Sequence-to-vector: many inputs, one output. Used for spam classification — read the whole email, output a single yes/no.
Vector-to-sequence: one input, many outputs. Used for image captioning — encode an image once, then generate a caption word by word.
Encoder-decoder: one full sequence in (the encoder), one full sequence out (the decoder). Used for machine translation — read the whole French sentence before generating the English.

"},{"type":"interactive","component":"RNNFlavorExplorer","caption":"Select an RNN architecture type to see how inputs and outputs are wired, with a real-world task example for each.","props":{}},{"type":"checkpoint","id":"nlp-q6-rnn-types","kind":"mc","question":"You are building a machine translation system that reads a full English sentence and produces a full French sentence. The output length is different from the input length. Which RNN architecture type is most appropriate?","options":[{"label":"Sequence-to-sequence (one output per input step)","explanation":"Standard seq-to-seq produces one output for every input time step, which forces input and output to be the same length. Translation requires variable-length outputs."},{"label":"Encoder-decoder","correct":true,"explanation":"An encoder reads the full source sentence and produces a fixed-size context representation. A decoder generates the target sentence one token at a time, conditioned on that context — handling variable-length input and output naturally."},{"label":"Sequence-to-vector","explanation":"Sequence-to-vector produces a single output from the full input — useful for classification, not for generating a translated sentence."},{"label":"Vector-to-sequence","explanation":"Vector-to-sequence takes a single input and generates a sequence — useful for tasks like image captioning, but can't process a variable-length text input."}]}]},{"number":3,"id":"vanishing-gradient","title":"Backpropagation Through Time and the Vanishing Gradient","blocks":[{"type":"text","html":"

RNN training is backpropagation, but unrolled. You compute the loss across all time steps and propagate gradients back through the unrolled chain using the chain rule. This is Backpropagation Through Time (BPTT).

It works. But for long sequences, the chain of multiplied gradients gets very long, and each link is typically less than 1 in magnitude. Multiply enough of them together, and the gradient goes to zero. This is the vanishing gradient problem, and it means RNNs cannot effectively learn long-range dependencies. Information from twenty time steps ago can't influence the current loss because its gradient signal vanishes.

"},{"type":"callout","variant":"warning","title":"Vanishing vs. Exploding Gradients","html":"

The vanishing gradient problem is the more common failure in practice. But the opposite — exploding gradients — also occurs when the gradient magnitudes grow exponentially. The fix for exploding gradients is gradient clipping: if the gradient norm exceeds a threshold, scale it down proportionally. This is a standard hyperparameter in RNN training.

"}]},{"number":4,"id":"lstm","title":"Long Short-Term Memory (LSTM)","blocks":[{"type":"text","html":"$19"},{"type":"interactive","component":"LSTMCellDiagram","caption":"Click any component of the LSTM cell to see its equation and learn what it does. The cell state (green highway) is the key to understanding why LSTMs defeat the vanishing gradient.","props":{}},{"type":"callout","variant":"tip","title":"Why Does the Additive Update Fix Vanishing Gradients?","html":"$1a"},{"type":"checkpoint","id":"nlp-q7-lstm-gate","kind":"mc","question":"In an LSTM, which gate is primarily responsible for allowing gradient flow across many time steps without vanishing?","options":[{"label":"The forget gate","explanation":"The forget gate does affect gradient flow — when f_t ≈ 1, it passes gradients through. But the forget gate alone is not the reason; it's the additive structure of the cell state update that matters."},{"label":"The input gate","explanation":"The input gate controls what new information to write, but it doesn't directly determine how gradients flow backward through the architecture."},{"label":"The cell state update equation, which is additive","correct":true,"explanation":"C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t. The + node has a gradient of 1 — it passes gradients backward without multiplication. This is the key structural reason LSTMs solve the vanishing gradient problem."},{"label":"The output gate","explanation":"The output gate determines what h_t exposes, but it doesn't affect how gradients flow through the cell state highway backward through time."}]}]},{"number":5,"id":"gru","title":"Gated Recurrent Unit (GRU)","blocks":[{"type":"text","html":"$1b"},{"type":"interactive","component":"GRUCellDiagram","caption":"Step through each gate to see the highlighted diagram, its equation, and how it fits into the full computation.","props":{}},{"type":"callout","variant":"tip","title":"LSTM vs. GRU: When to Use Which","html":"

GRU tends to be the better starting point when: your dataset is small, sequences are short-to-medium, or you need faster training. Fewer parameters means less overfitting risk and quicker iteration.

LSTM tends to edge ahead when: sequences are very long, the task requires fine-grained control over what to remember and forget at each step, or you have enough data to train the extra parameters.

In practice, try both. The difference is often small enough that dataset quality and hyperparameter tuning matter more than which gated architecture you choose.

"},{"type":"checkpoint","id":"nlp-q8-gru-vs-lstm","kind":"mc","question":"In a GRU, what does the update gate z_t do when its value is close to 1?","options":[{"label":"It causes the network to forget most of the previous hidden state","explanation":"When z_t ≈ 1, the term (1 − z_t) ⊙ h_{t-1} ≈ 0, so the previous state is mostly discarded — but that means the new candidate is favored, not that the previous state is kept."},{"label":"It causes the new hidden state to strongly favor the candidate state h̃_t","correct":true,"explanation":"h_t = (1 − z_t) ⊙ h_{t-1} + z_t ⊙ h̃_t. When z_t ≈ 1, the second term dominates and the new hidden state is approximately h̃_t — the network is essentially rewriting its state with the new candidate."},{"label":"It forces the candidate state to be computed without any previous hidden state","explanation":"That is the reset gate r_t, not the update gate. The reset gate controls how much h_{t-1} influences the candidate h̃_t."},{"label":"It passes the cell state directly to the output without modification","explanation":"GRUs do not have a separate cell state — that is an LSTM concept. The GRU has a single hidden state h_t."}]}]},{"number":6,"id":"rnn-limitations","title":"Limitations and Opportunities","blocks":[{"type":"text","html":"

Even with LSTMs and GRUs, practical RNN training has serious problems:

They can't be parallelized. The hidden state at time t depends on time t-1. You compute sequentially.
They're slow. A direct consequence of sequential computation.
Long-range dependencies are still hard, even for LSTMs and GRUs. Just less hard than for vanilla RNNs.
Hyperparameter tuning is painful. Learning rate, batch size, hidden size, dropout, gradient clipping threshold, sequence length: they're all interlinked, and small changes blow up training.

"},{"type":"callout","variant":"example","title":"Real World: When RNNs Are Still the Right Call","html":"

RNNs and LSTMs are still production architectures for time series forecasting (energy demand, financial prices), wearable signal processing, and streaming/online inference where you don't have the full sequence in memory and can't afford a transformer's quadratic memory cost. For batch NLP work, transformers have eaten their lunch.

"},{"type":"checkpoint","id":"nlp-q9-rnn-limits","kind":"mc","question":"A model needs to predict the next word in a streaming audio transcript in real-time, processing tokens as they arrive without buffering the full sequence. Which architecture is best suited for this constraint?","options":[{"label":"A transformer encoder-only model","explanation":"Transformer encoders process the full sequence at once using self-attention — they require the complete input before producing outputs, which doesn't work for streaming."},{"label":"An LSTM","correct":true,"explanation":"LSTMs process tokens one at a time, maintaining a hidden state that's updated with each new input. This is exactly the streaming/online inference pattern RNNs are built for."},{"label":"A BERT-style bidirectional transformer","explanation":"Bidirectional models look at future tokens to compute representations for current tokens. That's impossible in a streaming context — you don't have the future yet."},{"label":"A TF-IDF + logistic regression pipeline","explanation":"TF-IDF is a document-level representation, not a sequential architecture. It has no mechanism for predicting the next word in a sequence."}]}]}]},{"id":"attention-transformer","number":6,"title":"Attention and Transformers","overview":"Static word embeddings give every word one vector regardless of context — fundamentally unable to distinguish 'bank of the river' from 'deposited money at the bank.' Self-attention solves this by computing each word's representation as a weighted blend of all other words in the sequence. This chapter builds self-attention from scratch (dot product → softmax → weighted sum), introduces the Q/K/V parameterization that makes it trainable, extends to multi-head and cross-attention, then assembles the full transformer architecture from the 2017 'Attention Is All You Need' paper — encoder, decoder, positional encodings, masking, and three popular types of transformers (BERT, GPT, T5).","sections":[{"number":1,"id":"problem-static-embeddings","title":"The Problem We Haven't Solved","blocks":[{"type":"text","html":"

Word2Vec was a real leap. It gave us dense, semantically meaningful embeddings, and even let us do arithmetic on them. But each word has one vector forever. bank is bank. Static.

What we want is for bank in \"bank of the river\" to end up close to embeddings about rivers and shores, while bank in \"deposited money at the bank\" ends up close to embeddings about finance. Same word, different vectors, dictated by context.

"},{"type":"callout","variant":"info","title":"Part II Begins Here","html":"

Everything in Part I gave us tools to represent language that are better than nothing. But they all share one fundamental limitation: representations don't depend on context. This chapter introduces the mechanism that finally solves that — and in doing so, redirected the entire field.

"}]},{"number":2,"id":"self-attention","title":"Self-Attention as Reweighting","blocks":[{"type":"text","html":"

Think of self-attention as a procedure for improving each word's embedding by mixing in information from the other words in its sequence.

"},{"type":"interactive","component":"SelfAttentionWalkthrough","caption":"Step through the full self-attention computation — from raw embeddings to contextualized output vectors — one stage at a time.","props":{}},{"type":"callout","variant":"warning","title":"weights =/= trainable weights","html":"

The \"weights\" computed by attention are not trainable parameters. They're computed live, every forward pass, from the data. The terminology is the same as for weights in a linear layer, but the things are completely different.

"},{"type":"text","html":"

What we just built has some interesting properties:

Order doesn't matter. Whether river is the second word or the fortieth, the same procedure runs.
Proximity doesn't matter. Distance in the sequence has no effect on whether two words can attend to each other.
Sequence length doesn't matter. Short or long — same operation.

For long-range dependencies — exactly the case where RNNs choked — this is enormous. The flip side: order really does matter for language, and we've thrown it away. We'll patch this with positional encodings in the transformer section.

"},{"type":"interactive","component":"SelfAttentionVisualizer","caption":"Enter a short sentence and see the attention weights computed between every word pair. Hover over a word to see what it attends to.","props":{}}]},{"number":3,"id":"qkv-multihead","title":"Queries, Keys, Values, and Multi-Head Attention","blocks":[{"type":"text","html":"$1c"},{"type":"interactive","component":"QKVWalkthrough","caption":"Step through how trainable Query, Key, and Value matrices are introduced into the self-attention pipeline.","props":{}},{"type":"text","html":"

Multi-Head Attention

"},{"type":"interactive","component":"MultiHeadWalkthrough","caption":"Step through how multiple attention heads run in parallel to capture different relationships, then get combined into a single output.","props":{}},{"type":"text","html":"

Self-Attention vs. Cross-Attention

Self-attention operates within a single sequence — every token attends to every other token in the same input.

Cross-attention operates between two different sequences. For each element in one sequence (the query sequence), cross-attention computes attention scores based on its relationship with every element in a separate sequence (the key-value sequence). This lets the model selectively focus on the most relevant parts of the other sequence when generating each output.

Machine translation is the classic example: the decoder generates the target language word by word, and at each step it uses cross-attention to decide which parts of the encoded source sentence to draw from.

"},{"type":"image","src":"/nlp/crossattention.png","alt":"Cross attention visual of translation","caption":"Cross Attention is helpful for tasks that involve understanding how elements from different sources relate to one another, like machine translation"}]},{"number":4,"id":"transformer-architecture","title":"The Transformer Architecture","blocks":[{"type":"text","html":"

In 2017, a team at Google DeepMind published Attention Is All You Need. At the time, nobody outside the research community paid much attention (pun aggressively intended). RNNs were the establishment; this paper was an upstart. We know now that this paper redirected the field, kicked off the LLM era, and — if you're reading this — quite possibly inspired you to study AI!

"},{"type":"slideshow","slides":[{"src":"/nlp/preview.png","alt":"The Transformer Architecture","caption":"You are more familiar with the architecture of the transformer than you think. A combination of feed forward layers, residual connections, and the new attention mechanism come together to form the transformer (see what I did there?)."},{"src":"/nlp/t1.png","alt":"Scaled dot-product attention formula and diagram","caption":"This should look pretty familiar to the attention module. We have our vectors, compute a dot product (--> scores) and a softmax (--> weights). The only new piece is the scaling. The scaling by √d_k prevents dot products from growing so large that the softmax saturates and gradients vanish."},{"src":"/nlp/t2.png","alt":"Multi-head attention architecture diagram","caption":"This module is exactly what we closed the attention module with! Our multi-head attention module runs h parallel attention heads, each with its own linear projections for keys, queries, and values. The outputs of all heads are concatenated and passed through a final dense layer, letting the model attend to different representation subspaces simultaneously."},{"src":"/nlp/t3.png","alt":"Multi-head attention in the transformer","caption":"In the transformer architecture, multi-head attention is the orange block, the one component you were unfamiliar with before starting the NLP Unit (assuming you are going through this book sequentially). "},{"src":"/nlp/t4.png","alt":"Encoder and decoder side by side with labels","caption":"Zooming out: the encoder processes the entire input sequence at once and produces contextualized representations. The decoder generates output tokens one at a time, attending to both its own prior outputs (masked) and the full encoder output via cross-attention."},{"src":"/nlp/t5.png","alt":"Encoder block with hyperparameter callouts for N and number of heads","caption":"Two key hyperparameters shape the encoder: N (the number of stacked blocks) and h (the number of attention heads). The original paper used N=6 and h=8; modern large models scale these dramatically."},{"src":"/nlp/t6.png","alt":"Encoder block highlighting the residual (skip) connection around multi-head attention","caption":"The curved arrow is a residual connection: the output of multi-head attention is added back to its own input before being normalized. This is the same skip-connection idea from ResNets — it keeps gradients flowing through deep stacks."},{"src":"/nlp/t7.png","alt":"Encoder block with both Add & Norm residual connections labeled","caption":"Both sub-layers (multi-head attention and feed-forward) use the same Add & Norm pattern: compute the sub-layer output, add it to the original input (residual), then apply layer normalization. This directly addresses the vanishing gradient problem in deep networks."},{"src":"/nlp/t8.png","alt":"Positional encoding diagram showing element-wise addition to embeddings","caption":"Attention is permutation-invariant — it has no built-in sense of order. Positional encodings fix this by adding a position vector p_i element-wise to each token embedding v_i before it enters the encoder, producing v*_i that carries both meaning and position."},{"src":"/nlp/t9.png","alt":"Sinusoidal positional encoding formulas","caption":"The original paper defined positional encodings using sine for even dimensions and cosine for odd dimensions, with wavelengths forming a geometric progression across the embedding dimensions. Both learned and sinusoidal encodings performed equally well; sinusoidal was chosen because it generalizes to sequence lengths unseen during training."},{"src":"/nlp/t10.png","alt":"Decoder architecture with masking, linear layer, and softmax output","caption":"The decoder generates one token at a time. Masked multi-head attention prevents any position from attending to future positions by adding −∞ to those scores before softmax, producing near-zero weights. After the decoder stack, a linear layer projects to vocabulary size and a softmax selects the most likely next token, which is fed back as the next input."},{"src":"/nlp/t11.png","alt":"Full encoder-decoder with cross-attention pathway highlighted","caption":"Cross-attention is the bridge between encoder and decoder: the decoder's queries come from the decoder stack, but the keys and values come from the encoder output. This lets every decoder position directly attend to any position in the input sequence, compressing the encoded representation into context for generation."}],"caption":"Deeper look: encoder-decoder mechanics, cross-attention, and training dynamics."},{"type":"callout","variant":"tip","title":"Go deeper: The Annotated Transformer","html":"

For a line-by-line walkthrough of the original paper with working PyTorch code, check out The Annotated Transformer. It's one of the best resources available for understanding how the architecture translates into actual implementation.

"}]},{"number":5,"id":"transformer-types-inference","title":"Transformers in Practice","blocks":[{"type":"text","html":"

Three Transformer Types

Here are three popular transformer types you may encounter:

Encoder-only (BERT). Produces contextualized representations of input text. Doesn't generate — it interprets. Used for classification, NER, similarity, information extraction.
Decoder-only (GPT). Generates text autoregressively, one token at a time, each conditioned on all previous tokens. Used for text generation and completion.
Encoder-decoder (T5, BART). The full architecture. Used for machine translation, summarization, and any \"text-to-text\" task where input and output are both variable-length sequences.

"},{"type":"interactive","component":"TransformerArchCompare","caption":"Click any layer block to see exactly what it does. Compare how BERT, T5, and GPT handle the same fundamental pieces — embeddings, attention, and feed-forward — differently.","props":{}},{"type":"text","html":"

Inference, End-to-End

When the transformer encounters orange in \"I ate an orange,\" attention blends in context about food and eating — the embedding lands near other fruits. In \"The sunset is orange,\" the same word blends with color context and lands near other color words. Same word. Different embeddings. Determined by context. We finally solved bank of the river versus deposited money at the bank.

"},{"type":"interactive","component":"InferenceWalkthrough","caption":"Step through each stage of the inference pipeline — from raw text to generated output — and see exactly what the transformer does at every layer.","props":{}},{"type":"callout","variant":"warning","title":"Why Transformers Have Token Limits","html":"

Attention has O(n²) complexity in sequence length. Double the context, quadruple the memory and compute. Mitigations exist — sparse attention (Longformer, BigBird), gradient checkpointing, mixed-precision training — and context windows have grown from ~2K tokens in 2020 to 1M+ in 2025. But the fundamental constraint hasn't gone away.

"},{"type":"checkpoint","id":"nlp-q10-transformer-types","kind":"mc","question":"You are building a model that takes a customer support email and produces a one-sentence summary, potentially using words not in the original. Which transformer type is most appropriate?","options":[{"label":"Encoder-only (e.g., BERT)","explanation":"Encoder-only models produce contextualized representations but don't generate text. You can't produce a summary from an encoder alone."},{"label":"Decoder-only (e.g., GPT)","explanation":"Decoder-only models can generate text, but for a task with a distinct input document and output summary, an encoder-decoder is more natural."},{"label":"Encoder-decoder (e.g., T5, BART)","correct":true,"explanation":"Encoder-decoder models process the full input in the encoder then generate a new sequence in the decoder — the standard architecture for abstractive summarization."},{"label":"Any architecture works equally well","explanation":"While large decoder-only LLMs can be prompted to summarize, the architectures aren't equivalent for training a specialized summarizer. Encoder-decoder naturally fits the input-to-output structure."}]}]}]},{"id":"nlp-implementation","number":7,"title":"NLP Implementation","overview":"The gap between a working NLP model in a notebook and a working NLP system in production is filled with implementation details and applied technique. This chapter covers the practical mechanics — padding and masks for variable-length sequences, text augmentation, imbalanced data, data-splitting pitfalls, and transfer learning strategies — then moves to the applied NLP tasks that put these skills to work: text similarity, summarization (extractive vs. abstractive), and topic modeling.","sections":[{"number":1,"id":"variable-length-sequences","title":"Handling Variable-Length Sequences","blocks":[{"type":"text","html":"

Natural text comes in every possible length. Your model expects fixed-size inputs. Two fixes:

Padding. Add zeros (or <PAD> tokens) to the end of shorter sequences until every sequence in the batch is the same length. Then provide a padding mask so the model knows to ignore those positions. This mask can be used in the loss calculation, in RNN hidden states, or in attention scores. Without a mask, your model will attend to padding tokens as if they were meaningful, and the loss signal will be polluted.

Packed Sequences (PyTorch). An optimization for RNNs that only processes actual tokens, not padding. At each time step, the batch size shrinks as shorter sequences finish. Same correctness, less wasted compute.

"},{"type":"interactive","component":"PaddingPackingVisualizer","caption":"Toggle between padding and packing to see how the same four sentences are handled differently. Use the mask toggle on the padding view to see how [PAD] tokens get excluded from computation.","props":{}}]},{"number":2,"id":"data-augmentation","title":"Text Data Augmentation","blocks":[{"type":"text","html":"

Computer vision has a deep playbook for augmentation — rotate, flip, crop, color jitter. Text is harder because every transformation risks changing meaning. The most common approaches:

Back-translation: translate to French, translate back to English. Often produces a paraphrase that preserves meaning.
Synonym replacement: swap words with synonyms (NLTK's WordNet synsets are great for this).
Random insertion / deletion / swap / substitution: aggressive, useful for robustness, dangerous for meaning.

"},{"type":"callout","variant":"warning","title":"Recommendations for Text Augmentation","html":"

Don't change the label. If you're doing sentiment analysis and your augmentation flips \"good\" to \"bad,\" you've created poisoned training data.
Augment equally across classes. Asymmetric augmentation creates an artificial imbalance.
Manually inspect samples. Always. Augmentation pipelines silently corrupt data, and you'll only notice when your validation metrics inexplicably tank.
Combine methods for diversity, but layer them carefully, stacking too many aggressive operations quickly produces nonsense.

"}]},{"number":3,"id":"imbalanced-data","title":"Handling Class Imbalance","blocks":[{"type":"text","html":"$1d"},{"type":"callout","variant":"tip","title":"Pick Your Evaluation Metric First","html":"

Accuracy is useless for imbalanced datasets — a model that always predicts the majority class gets 99% accuracy on a 99/1 split. Use precision, recall, F1, or AUC-ROC depending on the relative cost of false positives vs. false negatives for your application. Decide this before you train.

"}]},{"number":4,"id":"data-splitting","title":"Data Splitting Best Practices","blocks":[{"type":"text","html":"$1e"},{"type":"callout","variant":"warning","title":"The Most Common Leakage Pattern","html":"

Near-duplicate documents are more common than you'd think — especially in any dataset scraped from the web. A news article published by Reuters may appear in 50 downstream publications. If one copy ends up in training and another in test, your evaluation numbers are optimistic by an unknown amount. Always deduplicate before splitting.

"}]},{"number":5,"id":"transfer-learning","title":"Transfer Learning and Fine-Tuning Strategies","blocks":[{"type":"text","html":"$1f"},{"type":"interactive","component":"FineTuningStrategyComparison","caption":"Compare fine-tuning strategies across different dataset sizes and task-pretraining similarity. Note this is for example purposes only, and you should always consider your specific dataset and task when choosing a fine-tuning strategy.","props":{}},{"type":"text","html":"

Libraries to Know

NLTK: the classic toolkit for traditional NLP. Tokenization, POS tagging, NER, parsing, sentiment lexicons. Used heavily for teaching and older production pipelines.
spaCy: industrial-strength tokenization and parsing in Python. Faster than NLTK, with pretrained statistical models and word vectors built in.
Hugging Face Transformers: Pretrained BERT, GPT, T5, and hundreds of others. Three building blocks — tokenizer, model architecture, and a task-specific head — snap together. Supports PyTorch and TensorFlow. Used well beyond NLP at this point.

"},{"type":"callout","variant":"example","title":"Real World: Hugging Face in Production","html":"

Almost every NLP production system you build will end up touching at least one Hugging Face model. Get fluent with the library. The documentation is excellent and the community model hub is one of the great resources in machine learning.

"},{"type":"checkpoint","id":"nlp-q8-finetuning","kind":"mc","question":"You have a small labeled dataset (500 examples) for a specialized medical NLP task. You are fine-tuning a large pretrained language model. Which strategy is most likely to produce a well-generalizing model?","options":[{"label":"Full fine-tuning — update all parameters to maximize flexibility","explanation":"Full fine-tuning with 500 examples is likely to overfit. The model has billions of parameters and far fewer training examples — most parameters will just memorize the training set."},{"label":"Frozen backbone — only train the classification head","explanation":"This is conservative but reasonable for a small dataset. The risk is that if the medical domain is far from the pretraining distribution (e.g., the model was pretrained on web text, not medical literature), the frozen representations may not be well-suited to the task."},{"label":"Gradual unfreezing — start from the top and unfreeze layers iteratively","correct":true,"explanation":"Gradual unfreezing gives the new task-specific head time to stabilize before lower pretrained layers are modified. This reduces catastrophic forgetting and overfitting risk on small datasets — the recommended approach in the original ULMFiT paper."},{"label":"Train from scratch — pretrained weights may carry biases from general web text","explanation":"Training from scratch requires vastly more data to achieve competitive performance. With 500 examples, you won't get close to the representations a pretrained model already has. Transfer learning is almost always better."}]}]},{"number":6,"id":"text-similarity","title":"NLP Applications:Text Similarity","blocks":[{"type":"text","html":"

How similar are two documents? Two flavors of the question:

Lexical similarity: how much vocabulary do they share?
Semantic similarity: how close are they in meaning, regardless of whether they share words?

Modern semantic similarity is built on embeddings. Encode both documents using a pretrained model (Word2Vec for cheap, a transformer for serious), then compute cosine similarity on the resulting vectors. The closer the vectors, the more similar the documents.

"},{"type":"callout","variant":"example","title":"Real World: Plagiarism Detection, Deduplication, RAG","html":"

Plagiarism detection (Grammarly, Turnitin): compare student submissions against each other and the web.
Duplicate question detection (Stack Overflow, customer support): merge or deduplicate questions that ask the same thing differently.
Search relevance: every search bar you've ever used is doing some form of similarity computation.
Retrieval in RAG systems: find the most relevant chunks of your knowledge base for a given query.

"}]},{"number":7,"id":"summarization","title":"NLP Applications: Text Summarization","blocks":[{"type":"text","html":"

There exist two fundamentally different approaches to text summarization:

Extractive summarization selects a subset of sentences directly from the original document. Every word in the summary appeared in the source. Conservative, safe, sometimes choppy.

Abstractive summarization generates a new summary that captures the key points but may use language not in the original. Risky (can hallucinate), but more readable.

Both are in the wild. Amazon's review summaries are abstractive — they synthesize across many reviews into a paragraph that reads like a human wrote it. Email summaries on mobile devices are abstractive. Scientific article summaries can be either, depending on the tool.

"},{"type":"text","html":"

Extractive: TextRank

TextRank is an elegant unsupervised method. Treat each sentence as a node in a graph. Draw an edge between two sentences if they're similar enough. Now you have a sentence-similarity graph.

Run the PageRank algorithm on this graph. PageRank identifies the most \"central\" nodes — the ones well-connected to other well-connected nodes. The most central sentences are, intuitively, the most representative. Extract them.

"},{"type":"text","html":"

Abstractive: Pretrained Transformers

Fine-tune (or use off the shelf) a sequence-to-sequence transformer pretrained on a summarization dataset (i.e., BART or T5). Pipeline:

Preprocess and tokenize the document.
Break the document into chunks if it exceeds the model's max input length.
Pass each chunk through the model.
Stitch the outputs together (sometimes via a second-pass summary of summaries).

For very long documents — a research paper, say — you might summarize each section separately and then summarize the summaries. A multi-step approach that respects the model's context window.

"},{"type":"callout","variant":"example","title":"Real World: Amazon Review Summaries","html":"

Amazon uses abstractive summarization to synthesize thousands of customer reviews into a paragraph highlighting what buyers most frequently mention about a product. This is exactly the multi-document abstractive summarization pipeline described above — ingest many short inputs, generate one coherent output.

"},{"type":"image","src":"/nlp/amazon_summary.png","alt":"Amazon review summary example","caption":"Amazon uses abstractive summarization to synthesize thousands of customer reviews into a paragraph highlighting what buyers most frequently mention about a product. This is the multi-document abstractive summarization pipeline described above, where you ingest many short inputs, generate one coherent output."}]},{"number":8,"id":"topic-modeling","title":"NLP Applications:Topic Modeling","blocks":[{"type":"text","html":"$20"},{"type":"callout","variant":"example","title":"Real World: Topic Modeling Is Everywhere","html":"

Every customer support system that routes tickets to teams is doing topic modeling. Every news app that auto-tags articles is doing topic modeling. Every product review system that highlights \"what people are saying about size, fit, comfort\" is doing topic modeling on attributes. This is one of the most common \"behind the scenes\" NLP tasks in industry.

"},{"type":"checkpoint","id":"nlp-q11-summarization","kind":"mc","question":"A legal tech company wants to summarize each page of a 200-page contract as a bullet list of key obligations. They want the summaries to use the exact language from the contract to avoid misrepresentation. Which summarization approach is more appropriate?","options":[{"label":"Abstractive summarization using a T5 model","explanation":"Abstractive models generate new text that may differ from the source — a significant risk in legal contexts where exact wording matters."},{"label":"Extractive summarization","correct":true,"explanation":"Extractive methods select sentences directly from the source document, guaranteeing that every word in the summary appeared in the original. This is the right choice when faithfulness to the source text is critical."},{"label":"LDA topic modeling","explanation":"LDA identifies topics (word distributions) but doesn't produce readable summaries — it outputs lists of keywords associated with each topic."},{"label":"TF-IDF keyword extraction","explanation":"TF-IDF extracts important keywords but doesn't produce sentence-level summaries. You'd get a list of important terms, not a readable summary of obligations."}]}]}]},{"id":"llms-beyond","number":8,"title":"LLMs + RAG","overview":"Large language models have transformed what's possible in NLP — and building with them is harder than it looks. This chapter covers the LLM landscape and application patterns (fine-tuning, prompt engineering, RAG, agents), then digs into the practical challenges that dominate real deployments: embedding space visualization and similarity metrics as tools for understanding and debugging, RAG's cascading design decisions, and the Curse of Evaluation.","sections":[{"number":1,"id":"what-is-an-llm","title":"What Is a Large Language Model?","blocks":[{"type":"checkpoint","id":"define-llm","kind":"reflective","question":"What is a Large Language Model? Write your own definition before reading on.","sampleAnswer":"A large language model is a neural network trained on massive amounts of text data with enough parameters to learn general-purpose language representations. 'Large' is relative — it refers to scale in both parameters and training data."},{"type":"text","html":"\n

The honest answer is that the definition is moving. GPT-1 (2018) is usually considered the first \"large\" language model. It had 117 million parameters. By today's standards, that's barely a small language model.

GPT-2 (2019): when OpenAI released it, they were so worried about misuse that they delayed the full model and released a smaller version with a technical paper instead. Reading that press release today is surreal. GPT-2 by modern standards is not capable of fooling anyone. Yet at the time the discourse was about responsible disclosure of dangerous AI. That tells you something about how fast the field has moved — and about how perceptions of \"dangerous capability\" shift relative to the frontier.

"},{"type":"timeline","events":[{"year":"2018","title":"GPT-1 — 117M parameters","body":"The first 'large' language model by the field's standards at the time. Demonstrated that unsupervised pretraining followed by fine-tuning could transfer across NLP tasks."},{"year":"2019","title":"GPT-2 — 1.5B parameters","body":"Caused a responsible-disclosure controversy despite producing outputs that look obviously AI-generated today. Demonstrates how fast the definition of 'dangerous capability' moves relative to the frontier."},{"year":"2020","title":"GPT-3 — 175B parameters","body":"Few-shot prompting. The model could perform tasks from a handful of examples in the prompt with no fine-tuning. A conceptual shift: the same model, different prompts."},{"year":"2022–2023","title":"ChatGPT, Claude, Gemini","body":"Instruction-tuned models with reinforcement learning from human feedback (RLHF). Suddenly, these models could be used in conversation by non-technical users. The public inflection point."},{"year":"2024–present","title":"Multimodal, Mixture of Experts, Long Context","body":"Models handle images, audio, and video. Context windows grow from 2K to 1M+ tokens. Mixture of Experts enables larger capacity at lower inference cost."}]},{"type":"text","html":"$21"}]},{"number":1,"id":"dimensionality-reduction","title":"Visualizing Embedding Spaces","blocks":[{"type":"image","src":"/nlp/laion.jpeg","alt":"UMAP of LAION-Aesthetics","caption":"All 12M captions from LAION-Aesthetics with score > 6, embedded with CLIP and UMAP'ed to 2d. Color is the domain of the image URL. Source"},{"type":"text","html":"

Modern embeddings have hundreds or thousands of dimensions. You can't visualize that. So we reduce.

The standard approaches:

PCA (Principal Component Analysis): linear. Captures global linear relationships. Fast. Use it when you want to find the major axes of variation in your data.
t-SNE: nonlinear. Constructs a low-dimensional representation where locally similar points stay close together. Good for revealing clusters. Distances in the t-SNE plot don't preserve global structure — two clusters that look far apart may not be far apart in the original space.
UMAP: nonlinear, based on manifold learning. Similar goals to t-SNE but typically faster and better at preserving some global structure too.

For large embedding spaces where you want to see clusters and local relationships, prefer t-SNE or UMAP over PCA.

"},{"type":"interactive","component":"ArticleEmbed","caption":"","props":{"href":"https://projector.tensorflow.org/","imageSrc":"/nlp/tf-projector.png","imageAlt":"TensorFlow Embedding Projector","publisher":"TensorFlow","title":"Embedding Projector","excerpt":"Visualize high-dimensional data — explore word embeddings in 2D and 3D using PCA, t-SNE, and UMAP.","ctaLabel":"Open tool"}}]},{"number":3,"id":"similarity-metrics","title":"Similarity Metrics and the Limits of Cosine","blocks":[{"type":"text","html":"$22"},{"type":"callout","variant":"warning","title":"The Limits of Cosine Similarity","html":"

No concept of proximity. Cosine cares only about the angle between vectors, not their position. Two vectors on opposite sides of the space can have high cosine similarity if they point in similar directions.
Assumes linear relationships. If meaning bends across the space nonlinearly, cosine misses it.
Struggles with sparse vectors. Many zeros, few signal dimensions, artifacts in computed similarity.
What's a \"good\" score? 0.85 is great in some embedding spaces, mediocre in others. There's no universal threshold — evaluate relative to your specific model and task.

"},{"type":"callout","variant":"example","title":"Real World: Choosing a Similarity Metric for RAG","html":"

When you build a RAG system, the choice of similarity metric is one of several decisions to evaluate empirically. Some embedding models produce vectors where magnitude carries information — for those, dot product may outperform cosine. For sparse vectors (like TF-IDF), Jaccard or BM25-style metrics often win. Don't reach for cosine because it's the default. Reach for it because you tested it and it worked.

"},{"type":"checkpoint","id":"nlp-q12-similarity","kind":"mc","question":"You are using cosine similarity to find the most relevant documents for a query in a RAG system. A document with a cosine similarity of 0.91 is returned, but when you read it, it's clearly not relevant to the query. Which property of cosine similarity best explains this failure?","options":[{"label":"Cosine similarity doesn't account for document length","explanation":"Cosine similarity is magnitude-invariant, so document length (which affects magnitude) doesn't directly cause this failure — that's actually one of its strengths."},{"label":"Cosine similarity measures angle, not proximity — two vectors can point in the same direction but encode different content depending on the embedding model and task distribution","correct":true,"explanation":"High cosine similarity means the vectors are pointing in the same direction, but whether that direction corresponds to meaningful relevance depends entirely on the quality and distribution of the embedding model. A misleading high score is a known failure mode when the embedding model isn't well-calibrated for your domain."},{"label":"0.91 is too low a threshold for relevant documents","explanation":"There's no universal threshold for 'relevant' — and the question describes a failure at 0.91, not a threshold problem. The issue is the metric or embedding model, not the cutoff."},{"label":"Cosine similarity only works for short documents","explanation":"Cosine similarity is commonly used for documents of any length after sentence-level or chunk-level embedding. Document length is not the issue here."}]}]},{"number":3,"id":"rag","title":"Retrieval-Augmented Generation (RAG)","blocks":[{"type":"text","html":"

The pattern: instead of relying on the LLM's training data alone, retrieve relevant documents from a vector database at query time and include them in the prompt. The LLM then generates an answer grounded in the retrieved content.

End-to-end pipeline:

User query comes in.
Embedding model converts the query into a vector.
Similarity algorithm searches the vector database for the closest matches.
Closest matches are inserted into the prompt.
LLM generates a response grounded in the retrieved context.
Response returned to the user.

"},{"type":"image","src":"/nlp/simple-rag.png","alt":"RAG pipeline diagram: user query → embedding model → vector database similarity search → retrieved chunks inserted into LLM prompt → generated response","caption":"A simple RAG pipeline. Each step hides significant design decisions that interact with every other decision.","width":"max-w-3xl"},{"type":"image","src":"/nlp/rag-db.png","alt":"RAG pipeline diagram - building the vector database","caption":"Building a vector database for RAG is a major design decision. You need to decide on the chunking strategy and the embedding model.","width":"max-w-3xl"},{"type":"callout","variant":"warning","title":"RAG Is Not the Easy NLP Project","html":"

There's a story going around that RAG is the \"easy NLP project.\" It is not. It is a large can of worms. Here are the design decisions you have to make — and every single one of them interacts with every other one.

"},{"type":"text","html":"$23"},{"type":"interactive","component":"RAGPipelineBuilder","caption":"Configure a RAG pipeline step by step. See how each design decision affects the others and explore tradeoffs for different application types.","props":{}}]},{"number":4,"id":"curse-of-evaluation","title":"The Curse of Evaluation","blocks":[{"type":"text","html":"$24"},{"type":"image","src":"/nlp/curse-of-evaluation.png","alt":"curse of evaluation","caption":"The curse of evaluation ensures any RAG project is harder than it first appears..."},{"type":"callout","variant":"warning","title":"Answer Evaluation Questions Before Writing Code","html":"

The most important questions in any LLM-based system: How will you evaluate it? What will you need to evaluate it? If you can't answer these before writing any code, you'll spend three months building something you can't tell is working. Answer evaluation questions first. Build second.

"},{"type":"checkpoint","id":"nlp-q13-rag","kind":"reflective","question":"You are building a RAG system for a company's internal documentation. You've chosen paragraph-level chunking, a specific open-source embedding model, cosine similarity, and GPT-4 as your LLM. Describe how you would evaluate the system, including what you would measure and how you would build your evaluation dataset.","sampleAnswer":"First, I would build a golden evaluation dataset of 100–200 (query, ideal answer) pairs by having subject-matter experts write representative questions and correct answers from the documentation. I'd then measure: (1) Retrieval precision — are the right chunks being retrieved? (2) Answer faithfulness — is the LLM's answer grounded in the retrieved context, or is it hallucinating? (3) Answer correctness — does the answer match the golden answer semantically? I'd use a combination of human review for correctness/faithfulness (on a sample), LLM-as-judge for scalable faithfulness scoring, and embedding similarity between generated and golden answers as a proxy metric. System metrics (latency, cost per query) are also tracked. I would then prioritize: if retrieval precision is low, revisit chunking and embedding model before touching the LLM."}]}]},{"id":"multimodal","number":9,"title":"Multimodal Models","overview":"The transformer's sequence-of-tokens insight extends far beyond language. This chapter covers the Vision Transformer (ViT), which tokenizes images as patches to apply a standard transformer encoder; CLIP, which places text and images in a shared embedding space via contrastive learning; and Mixture of Experts (MoE), which scales model capacity without proportional compute cost by routing each token to a small subset of specialized expert networks.","sections":[{"number":1,"id":"multimodal-intro","title":"Beyond Language — Multimodal Models","blocks":[{"type":"text","html":"

The deepest insight of the last few years of NLP is that the transformer isn't really a language architecture. It's a sequence-of-tokens architecture. If you can chop your data into tokens, you can feed it to a transformer.

Examples already in production:

Images → ViT (tokens are 16×16 pixel patches)
Audio / spectrograms → AST (tokens are time-frequency patches)
Time series → Timer, Informer (tokens are time windows)
Video → CogVideoX (tokens are spatial-temporal patches)

"},{"type":"callout","variant":"example","title":"Real World: Multimodal Systems Are Everywhere","html":"

The product analytics tool that ingests screenshots and writes summaries. The retail platform that lets you upload a photo and find similar items. The medical imaging system that reads scans and writes draft radiology reports. The customer support bot that ingests screenshots from frustrated users. All of this is transformer-based, much of it CLIP-style.

"}]},{"number":2,"id":"vision-transformer","title":"The Vision Transformer (ViT)","blocks":[{"type":"text","html":"

For years, researchers tried to apply attention at the pixel level for images. But a 224×224 image has 50,176 pixels!

ViT (An Image Is Worth 16×16 Words) solved this: they chopped the image into 16×16 pixel patches and treated each patch as a token. A 224×224 image becomes a sequence of 196 patches. Each patch gets a linear projection into an embedding. Add positional encodings. Run through a standard transformer encoder.

This worked spectacularly. ViT and its descendants now dominate computer vision benchmarks. The connection to NLP is direct: a ViT encoder is architecturally identical to a BERT encoder, just with different input tokenization.

"},{"type":"image","src":"/nlp/vit.png","alt":"An image divided into 16x16 pixel patches, each patch flattened and embedded as a token, fed into a transformer encoder","caption":"ViT: the same transformer encoder you know from NLP, with images tokenized into 16×16 patches instead of words. [Source]","width":"max-w-2xl"},{"type":"text","html":"

Why Patches Work

The patch-as-token trick works because local image regions carry coherent semantic content — a 16×16 patch of an eye, a wheel, or a leaf is already meaningful. The transformer then uses self-attention to relate patches across the whole image, capturing long-range structure that convolutional networks could only approximate through stacking layers.

ViT requires large training sets to outperform CNNs. On ImageNet alone, a ViT trained from scratch underperforms a ResNet of comparable size. But pretrained on JFT-300M or similar large corpora, ViT dominates. This data-hungry property mirrors BERT: inductive biases (locality, translation equivariance) are helpful when data is scarce, but given enough data, the general-purpose transformer can learn them implicitly.

"},{"type":"checkpoint","id":"nlp-q-vit-patch","kind":"mc","question":"A 224×224 image is processed by ViT with 16×16 patches. How many patch tokens does the encoder receive (ignoring any CLS token)?","options":[{"label":"50,176","explanation":"That's the number of individual pixels — ViT avoids this by grouping pixels into patches."},{"label":"196","correct":true,"explanation":"(224 / 16) × (224 / 16) = 14 × 14 = 196 patches. Each becomes one token, making the sequence length tractable."},{"label":"768","explanation":"768 is a common embedding dimension, not the sequence length."},{"label":"14","explanation":"14 is the number of patches along one axis; the grid is 14×14 = 196 total."}]}]},{"number":3,"id":"clip","title":"CLIP: Connecting Text and Images","blocks":[{"type":"text","html":"$25"},{"type":"image","src":"/nlp/clip.png","alt":"CLIP training","caption":"Training CLIP [Source]","width":"max-w-2xl"},{"type":"interactive","component":"CLIPTrainingWalkthrough","caption":"Step through CLIP's three-phase training loop: encoding both modalities, building the contrastive similarity matrix, and updating both encoders with the symmetric cross-entropy loss."},{"type":"text","html":"

Zero-Shot Classification

To classify an image with CLIP, encode the image and a set of text prompts like \"a photo of a dog\", \"a photo of a cat\". The class whose text embedding has the highest cosine similarity to the image embedding wins — no fine-tuning required. On ImageNet, CLIP achieves ~76% top-1 accuracy zero-shot, competitive with supervised ResNet-50.

This is a qualitative shift: instead of a fixed label set baked into the model's final layer, CLIP's classifier is defined at inference time by whatever text you provide. New classes cost a sentence, not a training run.

"},{"type":"callout","variant":"info","title":"CLIP as Infrastructure","html":"

At this point, CLIP and its derivatives are foundational infrastructure for multimodal AI: Stable Diffusion uses a CLIP text encoder to condition image generation. Reverse image search uses CLIP embeddings. Content moderation pipelines use CLIP to detect images matching text descriptions of prohibited content.

"},{"type":"checkpoint","id":"nlp-q14-clip","kind":"mc","question":"CLIP is trained with a contrastive loss on (image, caption) pairs. How does the loss function encourage useful cross-modal embeddings?","options":[{"label":"It minimizes the cosine similarity between all image-text pairs in the batch","explanation":"Minimizing all similarities would give you random or anti-correlated embeddings."},{"label":"It maximizes similarity for every image-text pair in the batch equally","explanation":"This would collapse all embeddings to the same point."},{"label":"It maximizes similarity for matched (image, caption) pairs and minimizes it for all other pairs in the batch","correct":true,"explanation":"This is contrastive learning. The batch provides natural negative examples. The model learns to pull matched pairs together and push non-matched pairs apart in the joint embedding space."},{"label":"It fine-tunes the image encoder but keeps the text encoder frozen","explanation":"CLIP trains both encoders end-to-end from scratch."}]}]},{"number":4,"id":"mixture-of-experts","title":"Mixture of Experts","blocks":[{"type":"text","html":"$26"},{"type":"image","src":"/nlp/moe.jpg","alt":"Mixture of Experts compared to standard transformer","caption":"Mixture of Experts compared to a standard transformer. [Source]"},{"type":"text","html":"

Load Balancing

The naive routing problem: the router might always prefer a few popular experts and ignore the rest, wasting capacity. To prevent this, MoE training adds an auxiliary load-balancing loss that encourages each expert to receive roughly equal token traffic. Without it, expert collapse is common.

MoE in Production

MoE is now standard in the largest production language models. Mixtral 8x7B routes each token through 2 of 8 experts, giving it the inference cost of a ~13B dense model with the capacity of a much larger one. GPT-4 is widely believed to be MoE. The architecture is also spreading to multimodal models, where different modalities can be routed to specialized experts.

"},{"type":"callout","variant":"info","title":"MoE Trade-offs","html":"

MoE models are harder to serve than dense models. All experts must be loaded into memory (or distributed across devices) even though only a few fire per token. A Mixtral 8x7B requires ~90GB of VRAM to serve in full precision — more than a dense 70B model in 4-bit. The compute savings are real, but the memory footprint is a practical constraint you'll hit in deployment.

"},{"type":"checkpoint","id":"nlp-q-moe-routing","kind":"mc","question":"A MoE transformer has 8 experts per layer and routes each token to the top-2. Compared to a dense transformer with the same per-token FLOPs, what does the MoE model gain?","options":[{"label":"Lower memory footprint at inference time","explanation":"MoE actually increases memory footprint — all experts must be loaded even though only 2 of 8 activate per token."},{"label":"Greater total parameter count and model capacity without proportional increase in per-token compute","correct":true,"explanation":"This is the core MoE trade-off: you can scale parameters (capacity) much faster than you scale FLOPs, because most parameters are dormant for any given token."},{"label":"Fewer total parameters than an equivalent dense model","explanation":"MoE has more total parameters — that's the point. It achieves high capacity with selective activation."},{"label":"Simpler training dynamics due to independent expert gradients","explanation":"MoE training is actually more complex — it requires auxiliary load-balancing losses to prevent expert collapse."}]}]}]}]}}],"$L27"]}]