NLP Applications: Text Summarization
There exist two fundamentally different approaches to text summarization:
Extractive summarization selects a subset of sentences directly from the original document. Every word in the summary appeared in the source. Conservative, safe, sometimes choppy.
Abstractive summarization generates a new summary that captures the key points but may use language not in the original. Risky (can hallucinate), but more readable.
Both are in the wild. Amazon's review summaries are abstractive — they synthesize across many reviews into a paragraph that reads like a human wrote it. Email summaries on mobile devices are abstractive. Scientific article summaries can be either, depending on the tool.
Extractive: TextRank
TextRank is an elegant unsupervised method. Treat each sentence as a node in a graph. Draw an edge between two sentences if they're similar enough. Now you have a sentence-similarity graph.
Run the PageRank algorithm on this graph. PageRank identifies the most "central" nodes — the ones well-connected to other well-connected nodes. The most central sentences are, intuitively, the most representative. Extract them.
Abstractive: Pretrained Transformers
Fine-tune (or use off the shelf) a sequence-to-sequence transformer pretrained on a summarization dataset (i.e., BART or T5). Pipeline:
- Preprocess and tokenize the document.
- Break the document into chunks if it exceeds the model's max input length.
- Pass each chunk through the model.
- Stitch the outputs together (sometimes via a second-pass summary of summaries).
For very long documents — a research paper, say — you might summarize each section separately and then summarize the summaries. A multi-step approach that respects the model's context window.
Real World: Amazon Review Summaries
Amazon uses abstractive summarization to synthesize thousands of customer reviews into a paragraph highlighting what buyers most frequently mention about a product. This is exactly the multi-document abstractive summarization pipeline described above — ingest many short inputs, generate one coherent output.
