Tokenization
Tokenization splits a string into substrings. The default is to split on whitespace and punctuation:
"Which class is the best class at Duke? Deep Learning Applications."
becomes
['Which', 'class', 'is', 'the', 'best', 'class', 'at', 'Duke', '?', 'Deep', 'Learning', 'Applications', '.']
You can also tokenize by sentence (useful for long documents you want to summarize one sentence at a time), by subword (the modern default — tokenization → ['token', 'ization']), or by character (rarely useful, but possible).
Split on whitespace and punctuation. Simple and fast — but out-of-vocabulary words get a single <UNK> token.
Split words into meaningful sub-pieces. Handles new words gracefully: 'tokenization' → ['token', '##ization']. Used by BERT, GPT, and most modern models.
Every character is its own token. No unknown tokens ever, but sequences are very long and the model must learn to combine characters into meaning.
Type any sentence and compare word-level, subword, and character-level tokenization side by side.
A Real-World Rule That Trips People Up
When you fine-tune a pretrained model like BERT or GPT, you must use that model's specific tokenizer. BERT learned its embeddings against BERT's tokens.
You are fine-tuning BERT for a sentiment classification task. You decide to save time by tokenizing your text with spaCy's word tokenizer before passing it to BERT. What problem does this cause?