Bag of Words

If you were to tell a 10 year old to come up with a way to encode words into numbers, they would come up with bag of words. No shade at bag of words, it's a perfectly valid way to represent text. Here's how it works: take every unique word in your corpus, build a vocabulary, and represent each document as a vector counting how many times each vocabulary word appears. It is essentially one-hot-encoding each word in your vocabulary

Suppose your corpus is six movie reviews — three positive, three negative. Your vocabulary becomes something like:

[acting, amazing, bad, best, ever, fantastic, film, great, horrible, it, loved, made, movie, plot, terrible, time, waste, worst, year]

And the review "Great movie, amazing, amazing plot" becomes:

[0, 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0]

amazing appears twice. great, movie, plot once each. Everything else is zero.

Stack these vectors, attach binary labels (1 for positive sentiment, 0 for negative sentiment), and now you can train a Naive Bayes classifier or a logistic regression on this.

Bag of Words Builder
2 (very sparse)30 (fuller vocab)

Vocabulary (10 terms, sorted by frequency)

BoW Matrix

Docgreatmovieplotactingbadfilmamazingterriblelovedbest
11110001000
20001110100
30000000011
40101000000
51000010000
60010100000

Enter a small corpus and watch it get converted to bag-of-words vectors.