Word2Vec

In 2013, a team at Google released Word2Vec. The model is a shallow neural network — input layer, one hidden layer, output layer. That's it. But its outputs were impressive.

The architecture:

  • Input layer: one-hot encoded vector of vocabulary size (huge, sparse).
  • Hidden layer: dense linear projection of dimension equal to the desired embedding size (say, 300). No nonlinear activation.
  • Output layer: softmax over the entire vocabulary.

You train this on (target word, context word) pairs slid across your corpus. Once trained, you throw away the output layer and keep only the weights between the input and hidden layer. Those weights are your word embeddings.

Why does this work? Precisely because there's no nonlinearity in the path. The hidden layer is a linear projection, a learned matrix where the i-th row is the embedding of vocabulary word i. The softmax at the output is what forces the model to learn useful projections (vectors that predict the surrounding words), but the embeddings themselves come straight from the weight matrix.

Word2Vec Architecture
Target word:window = 2
Skip-gramtarget → predict contextINPUTbrownPROJECTIONPROJECTIONembeddingOUTPUTthequickfoxjumps
1

Select a target word — this is the input.

Target wordContext wordsProjection (embedding) layer

Step through how Skip-gram and CBOW process a target word. Toggle between modes and click any word in the sentence to change the target.

Skip-gram vs. CBOW

Word2Vec comes in two flavors depending on which way you orient the prediction:

  • Skip-gram: given a center word, predict its context. Better for rare words and small datasets.
  • CBOW (Continuous Bag of Words): given the context, predict the center word. Faster to train, better for common words.

In CBOW, the context words get averaged (via a lambda layer) before going into the softmax. This averaging assumes all context words contribute equally (order doesn't matter inside the window) and makes the model more robust to small variations in word arrangement.

Creating a Dataset

To generate training data for Word2Vec, slide a context window of size N across your corpus. For each position, the center word is the target and every word within the window is a context word.

Each (target, context) pair becomes one training example:

  • Window size N = 2, sentence: "the quick brown fox jumps"
  • Target: brown → context pairs: (brown, the), (brown, quick), (brown, fox), (brown, jumps)

Do this for every word in the corpus and you have your dataset — millions of (target word, context word) pairs ready for training.

Word2Vec Visualizer — 2D Projection
kingprincelorddukemanboyjumpflyswim
royaltypeoplecountries/citiesanimalsfoodtechnology

Nearest neighbors to king

1lord
1.000
2man
1.000
3jump
1.000
4swim
0.998
5prince
0.997
6duke
0.997
7boy
0.996
8fly
0.990

Similarity = cosine similarity in 2D projection. Click any word to explore its neighbors.

Explore a pretrained Word2Vec embedding space. Search for a word and see its nearest neighbors. Try words with multiple meanings.