Text Data Augmentation
Computer vision has a deep playbook for augmentation — rotate, flip, crop, color jitter. Text is harder because every transformation risks changing meaning. The most common approaches:
- Back-translation: translate to French, translate back to English. Often produces a paraphrase that preserves meaning.
- Synonym replacement: swap words with synonyms (NLTK's WordNet synsets are great for this).
- Random insertion / deletion / swap / substitution: aggressive, useful for robustness, dangerous for meaning.
⚠
Recommendations for Text Augmentation
- Don't change the label. If you're doing sentiment analysis and your augmentation flips "good" to "bad," you've created poisoned training data.
- Augment equally across classes. Asymmetric augmentation creates an artificial imbalance.
- Manually inspect samples. Always. Augmentation pipelines silently corrupt data, and you'll only notice when your validation metrics inexplicably tank.
- Combine methods for diversity, but layer them carefully, stacking too many aggressive operations quickly produces nonsense.