The Preprocessing Pipeline
Take raw text — anything from a scraped webpage to a transcript of a lecture — and you'll typically do three things before it hits a traditional model:
- Tokenize the text into pieces.
- Remove stop words and punctuation that don't carry meaning.
- Lemmatize or stem to collapse word variants to a root.
A modern transformer-based model needs much less of this — it can learn directly from raw token sequences. But for traditional models, and for understanding what your tokenizer is doing under the hood, you need to know each step.
ℹ
Preprocessing as Architecture-Dependent
The more powerful the model, the less preprocessing you need. BERT and GPT were trained on raw text with minimal preprocessing. A TF-IDF logistic regression classifier, on the other hand, benefits significantly from stop word removal and lemmatization. Match your preprocessing choices to your model.