An Overview of Text Summarization Techniques

A Survey on Automatic Text/Speech Summarization Shih-Hsiang Lin(林士翔) Department of Computer Science & Information Engineering National Taiwan Normal University References: D, Das and A. F. T. Martins, A Survey on Automatic Text Summarization, 2007 Y. T. Chen et al., A probabilistic generative framework for extractive broadcast news speech summarization, IEEE Trans. on ASLP 2009. Hovey’s tutorial, Automated Text summarization Tutorial , COLING/ACL 1998 Radev’s tutorial, Text summarization, SIGIR 2004 Berlin’s lecture, A Brief Review of Extractive Summarization Research, 2008

NLP Related Technologies

Outline • Introduction • Single-Document Summarization • Early work • Supervised Methods • Unsupervised Method • Multi-Document Summarization • Not abailable yet … • Evaluation • ROGUE • Information-Theoretic Method

Introduction • The subfield of summarization has been investigated by the NLP community for nearly the last half century • “A text that is produced from one or more texts, that conveys important information in the original text(s), and that is no longer than half of the original text(s) and usually significantly less than that” – (Radev, 2000) • Summaries may be produced from a single document or multiple documents • Summaries should preserve important information • Summaries should be short • Terminology in the summarization dialect • Extraction: identify important sections of the text • Abstraction: produce important material in a new way • Fusion: combines extracted parts coherently • Compression: throw out unimportant sections of the text • Indicative vs. Informative vs. Critic • Generic vs. Query-oriented • Single-Document Summarization vs. Multi-Document Summarization

Introduction (cont.) • Input (Jones, 1997) • Subject type: domain • Genre: newspaper articles, editorials, letters, reports... • Form: regular text structure; free-form • Source size: single doc; multiple docs (few; many) • Purpose • Situation: embedded in larger system (MT, IR) or not? • Audience: focused or general • Usage: IR, sorting, skimming... • Output • Completeness: include all aspects, or focus on some? • Format: paragraph, table, etc. • Style: informative, indicative, critical... *This slides was adopted from Prof. Hovey’s presentation

Introduction (cont.) • A Summarization Machine *This slides was adopted from Prof. Hovey’s presentation

Introduction (cont.) • A brief history of summarization

Speech Summarization • Fundamental problems with speech summarization • Disfluencies, hesitations, repetitions, repairs, … • Difficulties of sentence segmentation • More spontaneous parts of speech (e.g. interviews in broadcast news) are less amenable to standard text summarization • Speech recognition errors • Speech Summarization • Speech-to-text summarization • The documents can be easily looked through • The part of the documents that is interesting for users can be easily extracted • Information extraction and retrieval techniques can be easily applied to the documents • Speech-to-speech summarization • Wrong information due to speech recognition errors can be avoided • Prosodic information such as the emotion of speakers that is conveyed only by speech can be presented *This slides was adopted from Prof. Furui’s presentation

Single-Document SummarizationEarly Work • The most cited paper on summarization is that of (Luhn, 1958) • The frequencyof a particular word in an article provides an useful measure of its significance • There are also several key ideas put forward in this paper that have assumed importance in later work on summarization • words were stemmed to their root forms, and stop words were deleted • compiled a list of content words sorted by decreasing frequency, the index providing a significance measure of the word • a significance factor was derived that reflects the number of occurrences of significant words within a sentence • all sentences are ranked in orderof their significance factor, and the top ranking sentences are finally selected to form the auto-abstract • Baxendale also suggest that “sentence position” is helpful in finding salient parts of documents (Baxendale, 1958) • examined 200 paragraphs to find that in 85% of the paragraphs the topic sentence came as the first one & in 7% of the time it was the last sentence

Single-Document Summarization Early Work (cont.) • Edmundson (1969) describes a system that produces document extracts • His primary contribution was the development of a typical structure for an extractive summarization experiment (400 technical documents) • Four kind of features are used • Word frequency, Positional feature • Cue words: present of words like significant, or hardly • The skeleton of the document: whether the sentence is a title or heading • Weights were attached to each of these features manually to score each sentence • About 44% of the auto-extracts matched the manual extracts

Single-Document SummarizationSupervised Methods • In the 1990s, with the advent of machine learning techniques in NLP • a series of seminal publications appeared that employed statistical techniques to produce document extracts • Kupiec et al. (1995) using a naive-Bayes classifier to categorizes each sentence as worthy of extraction or not • Let be a particular sentence, the set of sentences that make up the summary, and the features • Assuming independence of the features • Two additional features are used: sentence length and the presence of uppercase words • Feature analysis revealed that a system using only the position and the cue features, along with the sentence length, performed best

Single-Document Summarization Supervised Methods (cont.) • Aone et al. (1999) also incorporated a naive-Bayes classifier, but with richer features • Signature words: derived from term frequency(TF)and inverse document frequency(IDF) • Named-entitytagger • Shallow discourse analysis • Synonyms and morphological variants were also merged (accomplied by WordNet) • Lin and Hovy (1997) studied the importance of sentence position feature • However, since the discourse structure significantly varies over domains • They makes an important contribution by investigating techniques of tailoring the position method towards optimality over a genre • Measured the yield of each sentence position against the topic keywords • Then ranked the sentence positions by their average yield to produce the Optimal Position Policy (OPP) for topic positions for the genre

Single-Document Summarization Supervised Methods (cont.) • Lin (1999) broke away from the assumption that features are independent of each other • He tried to model the problem of sentence extraction using decision trees, instead of a naive-Bayes classifier • Some novel features were introduced in his paper • Query Signature: normalized score given to sentences depending on number of query words that they contain • IR signature: score given to sentences depending on number and scores of IR signature words included (the m most salient words in the corpus) • Average lexical connectivity: the number of terms shared with other sentences divided by the total number of sentences in the text • Numerical data: value 1 when sentences contained a number • Proper name, Pronoun or Adjective, Weekday or Month, Quotation (similar as previous feature) • Sentence length, Sentence Order • Feature analysis suggested that the IR signature was a valuable feature, corroborating the early findings of Luhn (1958)

Single-Document Summarization Supervised Methods (cont.) • Conroy and O'leary (2001) modeled the problem of extracting a sentence from a document using a hidden Markov model (HMM) • The HMM was structured as follows • states (alternating between summary states and non-summary states) • Allowed “hesitation“ only in non-summary states and “skipping next state” only in summary states • The transition matrix can be estimated from training corpus • element is the empirical probability of transitioning from state i to state j • Associated with each state i was an output function • assume that the features are multivariate normal distributed • using the training data to compute the maximum likelihood estimate of its mean and covariance matrix (shared covariance) • Use three features: position of the sentence, number of terms in the sentence, and likeliness of the sentence terms given the document terms

Single-Document Summarization Supervised Methods (cont.) • Osborne (2002) used log-linear models to obviate the assumption of feature independence • Let be a label, the item we are interested in labeling, the i-th feature and the corresponding feature weight • The conditional log-linear model can be stated as follows • The authors added a non-uniform prior to the model, claiming that a log-linear model tends to reject too many sentences for inclusion in a summary • The features included word pairs, sentence length, sentence position, and naive discourse features like inside introduction or inside conclusion.

Single-Document Summarization Supervised Methods (cont.) • Svore et al. (2007) propose an algorithm based on neural nets and the use of third party datasets to perform extractive summarization • They trained a model that could infer the proper ranking of sentences • The ranking was accomplished using RankNet based on neural networks • For the training set, they used ROUGE-1 to score the similarity of a human written highlight and a sentence in the document • These similarity scores were used as “soft-labels” during training, contrasting with other approaches where sentences are “hard-labeled”, as selected or not • Another novelty of the framework lay in the use of features that derived information from query logs from Microsoft's news search engine and Wikipedia entries (third party datasets) • They conjecture that if a document sentence contained keywords used in the news search engine, or entities found in Wikipedia articles, then there is a greater chance of having that sentence in the highlight • They generate 10 features for each sentence in each document • Is first sentence, Sentence position, SumBasic score(unigram), SumBasic bigram score, Title similarity score,Average News Query Term Score, News Query Term Sum Score, Relative News Query Term Score, Average Wikipedia Entity Score, Wikipedia Entity Sum Score

Single-Document Summarization Supervised Methods (cont.) • Other kinds of supervised summarizers includes • Support vector machine (SVM) (Hirao et al. 2002) • Gaussian Mixture Models (GMM) (Murray et al. 2005) • Conditional Random Fields (CRFs) (Shen et al. 2007) • In general, the extractive summarization can be treated as a two-class (summary/non-summary) classification problem (Lin et al. 2009) • A sentence with a set of representative features • To summarize documents with different summary ratios, the important sentences of a document can be selected (or ranked) based on the posterior probability of a sentence being included in the summary given the feature set

Single-Document SummarizationUnsupervised Methods • Gong (2001) proposed using vector space model (VSM) • Vector representations of sentences and the document to be summarized using statistical weighting, such as TF-IDF • Sentences are ranked based on their proximity to the document • Maximum Marginal Relevance (MMR) (Murray et al. 2005) can be applied to summarize more important and different concepts in a document

Single-Document SummarizationUnsupervised Methods (cont.) • Latent Semantic Analysis (LSA) (Gong 2001) • Construct a “term-sentence” matrix for a given document • Perform Singular Value Decomposition (SVD) on the “term-sentence” matrix • The right singular vectors with larger singular values represent the dimensions of the more important latent semantic concepts in the document • Represent each sentence of a document as a semantic vector in the reduced space

Single-Document Summarization Unsupervised Methods (cont.) • Probabilistic Generative Framework (Chen et al. 2009) • Criterion: Maximum a posteriori (MAP) • Sentence Generative Model • Each sentence of the document as a probabilistic generative model • Language Model (LM), Sentence Topic Model (STM) and Word Topic Model (WTM) are initially investigated • Sentence Prior Model • Sentence prior is simply set to uniform here • Or may have to do with duration/position, correctness of sentence boundary, confidence score, prosodic information, etc.

Single-Document Summarization Unsupervised Methods (cont.) • Language Model (LM) Approach (Literal Term Matching) • Sentence Topic Model (STM) Approach (Concept Matching) • Word Topic Model (WTM) Approach (Concept Matching) : the sentence model : the collection model : a weighting parameter

Multi-Document Summarization • Task Characteristics • Input: a set of documents on the same topic • Retrieved during an IR search • Clustered by a news browsers • Problem: same topic or same event • Output: a paragraph length summary • Salient information across documents • Similarities between topics? • Redundancy removal is critical • Application oriented task • News portal, presenting articles from different sources • Corporate emails organized by subjects. • Medical reports about a patient

Evaluation • Recall-Oriented Understudy for Gisting Evaluation (ROUGE) (Lin 2004) • Let be a set of reference summary, and let be a summary generated automatically by a system. Let be a binary vector representing n-grams contained in a document • The metric ROUGE-N is an n-gram recall based statistic where denotes the usual inner product of vectors • The various versions of ROUGE were evaluated by computing the correlation coefficient between ROUGE scores and human judgment scores • ROUGE-2 performed the best among the ROUGE-N variants 昨天　馬英九　訪問　中國大陸昨天馬英九　結束　訪問　回國

Evaluation (cont.) • Lin et al., (2006) also proposed to use an information-theoretic method to automatic evaluation of summaries • The central idea is to use a divergence measure (i.e., Jensen-Shannon divergence), between a pair of probability distributions • The first distribution is derived from an automatic summary and the second from a set of reference summaries • Let be the set of documents to summarize A distribution parameterized by generates reference summaries A summarization system is governed by some distribution • We may define a good summarizer as one for which is closed to • One information-theoretic measure between distributions that is adequate for this is the KL divergence • However, the KL divergence is unbounded and goes to infinity whenever vanishes and does not • Another problem is that KL divergence is not symmetric

Evaluation (cont.) • Hence, they propose to use the Jensen-Shannon divergence which is bounded and symmetric where • To evaluate a summary given a reference summary , the negative JS divergence can be used for the purpose

An Overview of Text Summarization Techniques

An Overview of Text Summarization Techniques

Presentation Transcript

Text summarization

Automatic Speech Recognition

Voice XML and Speech Applications

Speech Detection, Classification, and Processing for Improved Automatic Speech Recognition in Multiparty Meetings

CSC 9010: Text Mining Applications Document Summarization

Text summarization

Speech Conductor

Robust Automatic Speech Recognition by Transforming Binary Uncertainties

Automatic Text Summarization

Automatic Speech Recognition Introduction

Summarization and Generation

807 - TEXT ANALYTICS

Text Summarization: News and Beyond

Summarization

Conditional Random Fields for Automatic Speech Recognition

The use of unlabeled data to improve supervised learning for text summarization

Automatic Summarization of Rushes Video using Bipartite Graphs

A New Approach to Unsupervised Text Summarization

Automatic Text Summarization

Chapter 5 RIP version 1

A Survey on Text Categorization with Machine Learning