Stemming and Lemmatization

Stemming vs. Lemmatization

The words branch, branches, branching, branched all refer to roughly the same concept. We'd like to collapse them.

  • Stemming chops off suffixes mechanically. changes, changed, changingchang. Not a real word. Doesn't matter — it's a feature, not a noun. Fast, crude.
  • Lemmatization uses a dictionary to map each form to a canonical root. is, am, werebe. changeschange. Slower, but the output is always a real word.

If you're throwing together a quick keyword classifier on millions of documents, stem. If you care about interpretability or accuracy, lemmatize.

Stemming vs. Lemmatization
OriginalPorter StemWordNet Lemma
Thethethe
dogsdog-sdog
werewerebe
runningrun-ningrun
andandand
jumpingjump-ingjump
quicklyquickli-quickly

Porter Stemmer

Mechanically chops suffixes. Fast, language-agnostic. Output may not be a real word — chang for changing. Fine for keyword matching.

WordNet Lemmatizer

Uses a dictionary to return canonical root words. Always a real word. be for was, is, are. Better for interpretability.

Enter a word or sentence and compare the output of stemming (Porter stemmer) and lemmatization (WordNet).