Text Data Augmentation

Computer vision has a deep playbook for augmentation — rotate, flip, crop, color jitter. Text is harder because every transformation risks changing meaning. The most common approaches:

  • Back-translation: translate to French, translate back to English. Often produces a paraphrase that preserves meaning.
  • Synonym replacement: swap words with synonyms (NLTK's WordNet synsets are great for this).
  • Random insertion / deletion / swap / substitution: aggressive, useful for robustness, dangerous for meaning.

Recommendations for Text Augmentation

  1. Don't change the label. If you're doing sentiment analysis and your augmentation flips "good" to "bad," you've created poisoned training data.
  2. Augment equally across classes. Asymmetric augmentation creates an artificial imbalance.
  3. Manually inspect samples. Always. Augmentation pipelines silently corrupt data, and you'll only notice when your validation metrics inexplicably tank.
  4. Combine methods for diversity, but layer them carefully, stacking too many aggressive operations quickly produces nonsense.