The Preprocessing Pipeline

Take raw text — anything from a scraped webpage to a transcript of a lecture — and you'll typically do three things before it hits a traditional model:

Tokenize the text into pieces.
Remove stop words and punctuation that don't carry meaning.
Lemmatize or stem to collapse word variants to a root.

A modern transformer-based model needs much less of this — it can learn directly from raw token sequences. But for traditional models, and for understanding what your tokenizer is doing under the hood, you need to know each step.

ℹ

Preprocessing as Architecture-Dependent

The more powerful the model, the less preprocessing you need. BERT and GPT were trained on raw text with minimal preprocessing. A TF-IDF logistic regression classifier, on the other hand, benefits significantly from stop word removal and lemmatization. Match your preprocessing choices to your model.

←PreviousThe RoadmapIntroduction to NLP Next→TokenizationText Preprocessing