TF-IDF and N-grams

The Problem with Common Words, and What TF-IDF Does About It

Bag of words has a problem: in a corpus of movie reviews, movie shows up in basically every document. It's not informative — it's just the topic. In order to focus on the most meaningful words, we would like to downweight common words and upweight words that are distinctive to a specific document.

That's TF-IDF: Term Frequency × Inverse Document Frequency.

  • TF is just the count from bag-of-words.
  • IDF is the log of (total documents / documents containing this word). A word in 1 of 6 documents gets a high IDF. A word in 6 of 6 gets an IDF of zero.

So for amazing (appears twice in our review, in 1 of 6 documents):

TF-IDF = 2 × log(6/1) ≈ 1.56

For movie (appears once, in 2 of 6 documents):

TF-IDF = 1 × log(6/2) ≈ 0.48

amazing is now correctly identified as the more informative word in this review. TF-IDF is one of the most useful tools in NLP, and it's been powering search ranking algorithms for decades.

TF-IDF Calculator

IDF values (corpus-wide)

Higher IDF = more distinctive. Words in all docs get IDF = 0.

great movie amazing plot
amazing
count: 1TF-IDF: 0.448
raw countTF-IDF
great
count: 1TF-IDF: 0.275
raw countTF-IDF
movie
count: 1TF-IDF: 0.275
raw countTF-IDF
plot
count: 1TF-IDF: 0.275
raw countTF-IDF

Formula: TF-IDF = (count / doc length) × log(N / df). Words unique to one document score highest. Words appearing in every document score 0.

Enter a small corpus and see TF-IDF scores computed in real time. Compare which words rise and which fall relative to raw counts.

N-grams: Sneaking Word Order Back In

Bag-of-words throws away word order. "Terrible acting but great plot" and "Great acting but terrible plot" become the same vector.

N-grams patch this. Instead of counting single words (unigrams), count adjacent pairs (bigrams) or triples (trigrams):

  • Unigrams: great, movie, amazing, plot
  • Bigrams: great movie, movie amazing, amazing plot
  • Trigrams: great movie amazing, movie amazing plot

Bigrams helps with negation handling: "not good" becomes a single bigram token that the model can learn to associate with negative sentiment. Bag-of-words on unigrams will count good as positive — a bigram model can recognize the construction.

Real World: Elasticsearch Search Relevance

TF-IDF still powers production search systems. Elasticsearch's relevance scoring is, at its core, TF-IDF with refinements. When you build a RAG system later in this unit, you'll see TF-IDF retrieval as a perfectly viable competitor to dense vector search in many domains.

Checkpoint

In a corpus of 1,000 product reviews, the word 'product' appears in every single review. What would its IDF score be, and what does that mean for its usefulness as a feature?