Data Splitting Best Practices

The standard data splitting advice is wrong for almost every real text dataset:

  • News articles? Time leakage. A model trained on articles from 2024 and tested on articles from 2023 has seen the future. Split by date.
  • Conversation data? Split by conversation thread, not by individual message, or your test set will contain replies to context the model already saw in training.
  • Multi-author corpora? Split by author. Otherwise your model learns writing style, not the task.
  • Similar or duplicate documents? Check for near-duplicates across splits — they contaminate evaluation.
  • Domain shift? Account for it explicitly. A model trained on news and tested on tweets is a different problem.

The principle underneath all of these: whatever distribution you'll see at deployment is the distribution your test set should reflect. If you'll deploy to new conversations from unseen users, your test set should be new conversations from unseen users.

The Most Common Leakage Pattern

Near-duplicate documents are more common than you'd think — especially in any dataset scraped from the web. A news article published by Reuters may appear in 50 downstream publications. If one copy ends up in training and another in test, your evaluation numbers are optimistic by an unknown amount. Always deduplicate before splitting.