Stemming and Lemmatization
Stemming vs. Lemmatization
The words branch, branches, branching, branched all refer to roughly the same concept. We'd like to collapse them.
- Stemming chops off suffixes mechanically. changes, changed, changing →
chang. Not a real word. Doesn't matter — it's a feature, not a noun. Fast, crude. - Lemmatization uses a dictionary to map each form to a canonical root. is, am, were →
be. changes →change. Slower, but the output is always a real word.
If you're throwing together a quick keyword classifier on millions of documents, stem. If you care about interpretability or accuracy, lemmatize.
Stemming vs. Lemmatization
| Original | Porter Stem | WordNet Lemma |
|---|---|---|
| The | the | the |
| dogs | dog-s | dog |
| were | were | be |
| running | run-ning | run |
| and | and | and |
| jumping | jump-ing | jump |
| quickly | quickli-… | quickly |
Porter Stemmer
Mechanically chops suffixes. Fast, language-agnostic. Output may not be a real word — chang for changing. Fine for keyword matching.
WordNet Lemmatizer
Uses a dictionary to return canonical root words. Always a real word. be for was, is, are. Better for interpretability.
Enter a word or sentence and compare the output of stemming (Porter stemmer) and lemmatization (WordNet).