TF-IDF and N-grams
The Problem with Common Words, and What TF-IDF Does About It
Bag of words has a problem: in a corpus of movie reviews, movie shows up in basically every document. It's not informative — it's just the topic. In order to focus on the most meaningful words, we would like to downweight common words and upweight words that are distinctive to a specific document.
That's TF-IDF: Term Frequency × Inverse Document Frequency.
- TF is just the count from bag-of-words.
- IDF is the log of (total documents / documents containing this word). A word in 1 of 6 documents gets a high IDF. A word in 6 of 6 gets an IDF of zero.
So for amazing (appears twice in our review, in 1 of 6 documents):
TF-IDF = 2 × log(6/1) ≈ 1.56
For movie (appears once, in 2 of 6 documents):
TF-IDF = 1 × log(6/2) ≈ 0.48
amazing is now correctly identified as the more informative word in this review. TF-IDF is one of the most useful tools in NLP, and it's been powering search ranking algorithms for decades.
IDF values (corpus-wide)
Higher IDF = more distinctive. Words in all docs get IDF = 0.
Formula: TF-IDF = (count / doc length) × log(N / df). Words unique to one document score highest. Words appearing in every document score 0.
Enter a small corpus and see TF-IDF scores computed in real time. Compare which words rise and which fall relative to raw counts.
N-grams: Sneaking Word Order Back In
Bag-of-words throws away word order. "Terrible acting but great plot" and "Great acting but terrible plot" become the same vector.
N-grams patch this. Instead of counting single words (unigrams), count adjacent pairs (bigrams) or triples (trigrams):
- Unigrams:
great,movie,amazing,plot - Bigrams:
great movie,movie amazing,amazing plot - Trigrams:
great movie amazing,movie amazing plot
Bigrams helps with negation handling: "not good" becomes a single bigram token that the model can learn to associate with negative sentiment. Bag-of-words on unigrams will count good as positive — a bigram model can recognize the construction.
Real World: Elasticsearch Search Relevance
TF-IDF still powers production search systems. Elasticsearch's relevance scoring is, at its core, TF-IDF with refinements. When you build a RAG system later in this unit, you'll see TF-IDF retrieval as a perfectly viable competitor to dense vector search in many domains.
In a corpus of 1,000 product reviews, the word 'product' appears in every single review. What would its IDF score be, and what does that mean for its usefulness as a feature?