Tokenization

Tokenization splits a string into substrings. The default is to split on whitespace and punctuation:

"Which class is the best class at Duke? Deep Learning Applications."

becomes

['Which', 'class', 'is', 'the', 'best', 'class', 'at', 'Duke', '?', 'Deep', 'Learning', 'Applications', '.']

You can also tokenize by sentence (useful for long documents you want to summarize one sentence at a time), by subword (the modern default — tokenization → ['token', 'ization']), or by character (rarely useful, but possible).

Tokenizer Playground

Input sentence

Word-level8 tokens

Split on whitespace and punctuation. Simple and fast — but out-of-vocabulary words get a single <UNK> token.

Deeplearningistransformingnaturallanguageprocessing.

Subword (BPE-like)14 tokens

Split words into meaningful sub-pieces. Handles new words gracefully: 'tokenization' → ['token', '##ization']. Used by BERT, GPT, and most modern models.

Deeplearn##ingistrans####form####ingnatur##allang##uageprocess##ing.

Character-level58 tokens

Every character is its own token. No unknown tokens ever, but sequences are very long and the model must learn to combine characters into meaning.

Deep·learning·is·transforming·natural·language·processing.

Word-level

Subword (BPE-like)

Character-level

Type any sentence and compare word-level, subword, and character-level tokenization side by side.

⚠

A Real-World Rule That Trips People Up

When you fine-tune a pretrained model like BERT or GPT, you must use that model's specific tokenizer. BERT learned its embeddings against BERT's tokens.

Checkpoint

You are fine-tuning BERT for a sentiment classification task. You decide to save time by tokenizing your text with spaCy's word tokenizer before passing it to BERT. What problem does this cause?

←PreviousThe Preprocessing PipelineText Preprocessing Next→Stop WordsText Preprocessing