Bag of Words
If you were to tell a 10 year old to come up with a way to encode words into numbers, they would come up with bag of words. No shade at bag of words, it's a perfectly valid way to represent text. Here's how it works: take every unique word in your corpus, build a vocabulary, and represent each document as a vector counting how many times each vocabulary word appears. It is essentially one-hot-encoding each word in your vocabulary
Suppose your corpus is six movie reviews — three positive, three negative. Your vocabulary becomes something like:
[acting, amazing, bad, best, ever, fantastic, film, great, horrible, it, loved, made, movie, plot, terrible, time, waste, worst, year]
And the review "Great movie, amazing, amazing plot" becomes:
[0, 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0]
amazing appears twice. great, movie, plot once each. Everything else is zero.
Stack these vectors, attach binary labels (1 for positive sentiment, 0 for negative sentiment), and now you can train a Naive Bayes classifier or a logistic regression on this.
Vocabulary (10 terms, sorted by frequency)
BoW Matrix
| Doc | great | movie | plot | acting | bad | film | amazing | terrible | loved | best |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 4 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 6 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
Enter a small corpus and watch it get converted to bag-of-words vectors.