Evaluation

Evaluating a recommender system is very challenging. The metric we care about most, user satisfaction, is expensive to measure, while the metrics we can measure cheaply may not correlate with satisfaction.

Offline vs. Online Evaluation

Offline evaluation uses historical interaction data. We hold out a test set of interactions, ask the model to predict them without having seen them, and measure prediction quality. It's cheap and fast: you can iterate quickly on model design. But it has important limitations:

  • It doesn't capture what the user would have done had they been shown a different set of items (counterfactual problem)
  • Historical data reflects past system recommendations, creating a feedback loop: a model that mimics the old system will score well offline even if it's no better
  • Offline metrics may not correlate with engagement or business outcomes

Online evaluation (A/B testing, multi-armed bandits) directly measures the impact of the recommender on user behavior by exposing different user groups to different models. It's the gold standard but requires significant traffic, careful experimental design to avoid biasing effects, and in some cases raises ethical questions about differential treatment of users.

Offline Metrics: Precision, Recall, F1

For top-K recommendation, we typically evaluate whether the recommended items appear in the user's actual held-out interactions.

  • Precision@K: Of the K items we recommended, what fraction were actually relevant?
  • Recall@K: Of all relevant items for this user, what fraction did we include in the top K?
  • F1@K: Harmonic mean of precision and recall at K

Mean Average Precision (mAP)

mAP averages precision across multiple recall levels, providing a single-number summary of ranking quality across the recommendation list. For each user, it computes the Average Precision (AP), a precision value calculated at each position where a relevant item appears, and then averages AP across all users. mAP rewards recommenders that surface relevant items earlier in the list.

Slide 1
Slide 2
Slide 3
1 / 3

Normalized Discounted Cumulative Gain (NDCG)

NDCG is the most commonly used evaluation metric in industrial recommendation systems. It addresses a key limitation of Precision/Recall: those metrics treat all relevant items as equally valuable, regardless of where they appear in the ranking. NDCG explicitly rewards placing the most relevant items at the top of the list.

Building up from first principles:

  1. Relevance scores: Assign a relevance score to each item (binary: 0/1, or graded: 0/1/2/3)
  2. Cumulative Gain (CG): Sum the relevance scores of items in the ranked list. Simple, but order-agnostic.
  3. Discounted CG (DCG): Apply a logarithmic discount to each item's relevance score based on its position. Items ranked lower are discounted more. Formally: DCG@K = Σ rel_i / log₂(i+1)
  4. Ideal DCG (IDCG): Compute DCG for the perfect ranking (most relevant items first)
  5. NDCG: Normalize DCG by IDCG: NDCG@K = DCG@K / IDCG@K. This produces a value between 0 and 1, where 1 is a perfect ranking.

Advantages of NDCG: handles graded relevance (not just binary), normalized so it's comparable across tasks and list lengths, strongly rewards placing high-relevance items near the top.

Disadvantage: harder to interpret intuitively compared to precision/recall.

Slide 1
Slide 2
Slide 3
Slide 4
Slide 5
1 / 5
CheckpointMultiple Choice

Two recommenders produce the following top-5 lists for a user who has 3 relevant items. Recommender A: [✓, ✗, ✓, ✗, ✓]. Recommender B: [✗, ✓, ✗, ✓, ✓]. Both have the same Precision@5 = 3/5. Which would have higher NDCG@5?

Online Evaluation Methods

A/B Testing randomly splits users into a control group (existing system) and a treatment group (new system). Comparing engagement metrics (CTR, watch time, purchase rate) between groups gives a causal estimate of the new system's effect. Good experimental design requires sufficient sample sizes to detect small effects.

Multi-Armed Bandit (MAB) Testing takes a more dynamic approach. Rather than committing a fixed traffic split upfront, MAB algorithms continuously adjust traffic allocation toward whichever variant is performing better, while still maintaining enough exploration to detect improvements. This reduces the cost of testing inferior models compared to fixed A/B tests.

CheckpointReflective Question

A team trains a new recommendation model that achieves significantly higher NDCG@10 in offline evaluation but shows no improvement in A/B test click-through rate. What are two plausible explanations for this discrepancy?