Why We Need a Better Representation

Bag of words on a real corpus is brutally inefficient. The standard English dictionary has roughly 470,000 entries. If every dimension of your vector corresponds to a word, you're working in 470,000-dimensional space, and almost every dimension is zero almost all the time. Sparse, huge, and semantically uninformative. The dimensions for cat and dog are no more related than the dimensions for cat and pneumonia.

What we want instead is to take that word bank and map it into a much smaller space, say 300 or 512 dimensions, where the dimensions themselves capture something about meaning. Where cat and dog are close together. Where bank (financial) is near money and loan, and bank (river) is near shore and stream.

This is what neural networks are great at. We can train a network to produce a dense, low-dimensional encoding for each word that captures the word's semantic role.

ℹ

Sparse vs. Dense Representations

Sparse representation: a vector of vocabulary size where almost all values are zero. It's easy to construct, but hard to use because the dimensions don't carry meaning.

Dense representation: a short vector (typically 50–1024 dimensions) where almost all values are non-zero. Much harder to construct, but dimensions can capture semantic relationships.

←PreviousWhen to Still Use Traditional ML for NLPTraditional Approaches to NLP Next→Word2VecWord Embeddings