NLP Applications:Text Similarity
How similar are two documents? Two flavors of the question:
- Lexical similarity: how much vocabulary do they share?
- Semantic similarity: how close are they in meaning, regardless of whether they share words?
Modern semantic similarity is built on embeddings. Encode both documents using a pretrained model (Word2Vec for cheap, a transformer for serious), then compute cosine similarity on the resulting vectors. The closer the vectors, the more similar the documents.
◆
Real World: Plagiarism Detection, Deduplication, RAG
- Plagiarism detection (Grammarly, Turnitin): compare student submissions against each other and the web.
- Duplicate question detection (Stack Overflow, customer support): merge or deduplicate questions that ask the same thing differently.
- Search relevance: every search bar you've ever used is doing some form of similarity computation.
- Retrieval in RAG systems: find the most relevant chunks of your knowledge base for a given query.