NLP Applications:Text Similarity

How similar are two documents? Two flavors of the question:

  • Lexical similarity: how much vocabulary do they share?
  • Semantic similarity: how close are they in meaning, regardless of whether they share words?

Modern semantic similarity is built on embeddings. Encode both documents using a pretrained model (Word2Vec for cheap, a transformer for serious), then compute cosine similarity on the resulting vectors. The closer the vectors, the more similar the documents.

Real World: Plagiarism Detection, Deduplication, RAG

  • Plagiarism detection (Grammarly, Turnitin): compare student submissions against each other and the web.
  • Duplicate question detection (Stack Overflow, customer support): merge or deduplicate questions that ask the same thing differently.
  • Search relevance: every search bar you've ever used is doing some form of similarity computation.
  • Retrieval in RAG systems: find the most relevant chunks of your knowledge base for a given query.