Bag of Words

If you were to tell a 10 year old to come up with a way to encode words into numbers, they would come up with bag of words. No shade at bag of words, it's a perfectly valid way to represent text. Here's how it works: take every unique word in your corpus, build a vocabulary, and represent each document as a vector counting how many times each vocabulary word appears. It is essentially one-hot-encoding each word in your vocabulary

Suppose your corpus is six movie reviews — three positive, three negative. Your vocabulary becomes something like:

[acting, amazing, bad, best, ever, fantastic, film, great, horrible, it, loved, made, movie, plot, terrible, time, waste, worst, year]

And the review "Great movie, amazing, amazing plot" becomes:

[0, 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0]

amazing appears twice. great, movie, plot once each. Everything else is zero.

Stack these vectors, attach binary labels (1 for positive sentiment, 0 for negative sentiment), and now you can train a Naive Bayes classifier or a logistic regression on this.

Bag of Words Builder

Corpus — one document per line

Vocabulary size: 10

2 (very sparse)30 (fuller vocab)

Vocabulary (10 terms, sorted by frequency)

BoW Matrix

Doc	great	movie	plot	acting	bad	film	amazing	terrible	loved	best
1	1	1	1	0	0	0	1	0	0	0
2	0	0	0	1	1	1	0	1	0	0
3	0	0	0	0	0	0	0	0	1	1
4	0	1	0	1	0	0	0	0	0	0
5	1	0	0	0	0	1	0	0	0	0
6	0	0	1	0	1	0	0	0	0	0

Enter a small corpus and watch it get converted to bag-of-words vectors.

←PreviousStemming and LemmatizationText Preprocessing Next→TF-IDF and N-gramsTraditional Approaches to NLP

Doc	great	movie	plot	acting	bad	film	amazing	terrible	loved	best
1	1	1	1	0	0	0	1	0	0	0
2	0	0	0	1	1	1	0	1	0	0
3	0	0	0	0	0	0	0	0	1	1
4	0	1	0	1	0	0	0	0	0	0
5	1	0	0	0	0	1	0	0	0	0
6	0	0	1	0	1	0	0	0	0	0

Doc	great	movie	plot	acting	bad	film	amazing	terrible	loved	best
1	1	1	1	0	0	0	1	0	0	0
2	0	0	0	1	1	1	0	1	0	0
3	0	0	0	0	0	0	0	0	1	1
4	0	1	0	1	0	0	0	0	0	0
5	1	0	0	0	0	1	0	0	0	0
6	0	0	1	0	1	0	0	0	0	0

Doc	great	movie	plot	acting	bad	film	amazing	terrible	loved	best
1	1	1	1	0	0	0	1	0	0	0
2	0	0	0	1	1	1	0	1	0	0
3	0	0	0	0	0	0	0	0	1	1
4	0	1	0	1	0	0	0	0	0	0
5	1	0	0	0	0	1	0	0	0	0
6	0	0	1	0	1	0	0	0	0	0