Why Text Is Harder Than It Looks
In computer vision, every pixel cleanly converts into three numbers: red, green, blue. Language is a little harder to convert into numbers...
Consider the word bank.
- "bank of the river" — a little hill next to a river.
- "deposited money at the bank" — a financial institution.
Same string of four letters, completely different meaning. Ideally we want to represent these as different numbers, because they really are different things. This is the homonym problem, and it's everywhere in language.
Now flip it. "Sneakers," "running shoes," and "tennis shoes" all refer to the same physical object in everyday speech. Synonym problem. How do we encode them in a way that captures that they mean the same thing without manually building a thesaurus?
And then there's the fact that observations are not independent. "The dog ate the bone. It tasted good." What is "it"? You can only answer because you read the previous sentence. History matters.
So the goal of this entire unit is to take the word bank in a specific context and turn it into a vector of numbers that actually captures what it means there, in that sentence. And as you'll see, we don't solve this in one shot. We chip away at it for decades.
A sentiment classifier flags the tweet "That concert was lowkey fire 🔥" as neutral or negative. Which NLP challenge best explains this failure?
The Eight Challenges of Representing Text
- Homonyms — "bank" means different things in different contexts.
- Synonyms — "sneakers" and "running shoes" mean the same thing.
- Dependence on history — meaning depends on prior sentences or context.
- Semantic ambiguity — "I saw the boy on the beach with my binoculars." Are you using binoculars? Is he?
- Slang and colloquialisms — "That snowboarder did a sick jump" doesn't mean the snowboarder is unwell.
- Acronyms — even the degree program you might be in (Master of Engineering in AI) is one.
- Variable length — a tweet is six words; a contract is sixty thousand. Both are valid inputs.
- Sarcasm and humor — good luck.