Unit 3
Natural Language Processing
Explore how machines process and generate language, from bag-of-words representations through attention and transformers, to practical implementation, LLMs, RAG, and multimodal models.
Chapter 1
Introduction to NLP
You are likely using NLP systems every day without even realizing it: search, spam filters, translation, sentiment analysis, and text generation are all NLP. This chapter maps the full landscape of what NLP does, identifies the fundamental challenges that make language so much harder to represent than images, and lays out the historical arc from bag-of-words to large language models that the rest of the unit follows.
Chapter 2
Text Preprocessing
Before text reaches a model, it must be transformed into a structured form the model can work with. This chapter walks through the standard preprocessing pipeline (tokenization, stop word removal, and stemming vs. lemmatization) and explains when each step matters (and when modern transformer-based models let you skip it).
Chapter 3
Traditional Approaches to NLP
Before neural networks took over NLP, the field ran on bag-of-words, TF-IDF, N-grams, and Hidden Markov Models. This chapter builds intuition for each — how they work, where they fail, and where they still belong. A TF-IDF logistic regression baseline is still the right first move for most text classification tasks, and understanding why sets up the contrast that motivates everything that comes next.
Chapter 4
Word Embeddings
Bag-of-words is sparse, huge, and semantically meaningless. Word embeddings replace it with dense, low-dimensional vectors where geometric distance reflects semantic similarity. This chapter walks through Word2Vec's deceptively simple architecture, explains why extracting the weight matrix as embeddings works, demonstrates vector arithmetic (king − man + woman ≈ queen), and introduces Doc2Vec and GloVe.
Chapter 5
Recurrent Neural Networks
Feed-forward networks are stateless — they treat every observation independently. Language requires memory. This chapter introduces RNNs as the first architecture that carries information forward through time, maps out the four sequence architecture types (seq-to-seq, seq-to-vector, vector-to-seq, encoder-decoder), explains the vanishing gradient problem that limits plain RNNs, and shows how LSTMs and GRUs solve it through gated memory.
Chapter 6
Attention and Transformers
Static word embeddings give every word one vector regardless of context — fundamentally unable to distinguish 'bank of the river' from 'deposited money at the bank.' Self-attention solves this by computing each word's representation as a weighted blend of all other words in the sequence. This chapter builds self-attention from scratch (dot product → softmax → weighted sum), introduces the Q/K/V parameterization that makes it trainable, extends to multi-head and cross-attention, then assembles the full transformer architecture from the 2017 'Attention Is All You Need' paper — encoder, decoder, positional encodings, masking, and three popular types of transformers (BERT, GPT, T5).
Chapter 7
NLP Implementation
The gap between a working NLP model in a notebook and a working NLP system in production is filled with implementation details and applied technique. This chapter covers the practical mechanics — padding and masks for variable-length sequences, text augmentation, imbalanced data, data-splitting pitfalls, and transfer learning strategies — then moves to the applied NLP tasks that put these skills to work: text similarity, summarization (extractive vs. abstractive), and topic modeling.
Chapter 8
LLMs + RAG
Large language models have transformed what's possible in NLP — and building with them is harder than it looks. This chapter covers the LLM landscape and application patterns (fine-tuning, prompt engineering, RAG, agents), then digs into the practical challenges that dominate real deployments: embedding space visualization and similarity metrics as tools for understanding and debugging, RAG's cascading design decisions, and the Curse of Evaluation.
Chapter 9
Multimodal Models
The transformer's sequence-of-tokens insight extends far beyond language. This chapter covers the Vision Transformer (ViT), which tokenizes images as patches to apply a standard transformer encoder; CLIP, which places text and images in a shared embedding space via contrastive learning; and Mixture of Experts (MoE), which scales model capacity without proportional compute cost by routing each token to a small subset of specialized expert networks.