1 / 25

A Survey on Automatic Text/Speech Summarization

A Survey on Automatic Text/Speech Summarization. Shih-Hsiang Lin( 林士翔 ) Department of Computer Science & Information Engineering National Taiwan Normal University. References: D, Das and A. F. T. Martins, A Survey on Automatic Text Summarization , 2007

Download Presentation

A Survey on Automatic Text/Speech Summarization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Survey on Automatic Text/Speech Summarization Shih-Hsiang Lin(林士翔) Department of Computer Science & Information Engineering National Taiwan Normal University References: D, Das and A. F. T. Martins, A Survey on Automatic Text Summarization, 2007 Y. T. Chen et al., A probabilistic generative framework for extractive broadcast news speech summarization, IEEE Trans. on ASLP 2009. Hovey’s tutorial, Automated Text summarization Tutorial , COLING/ACL 1998 Radev’s tutorial, Text summarization, SIGIR 2004 Berlin’s lecture, A Brief Review of Extractive Summarization Research, 2008

  2. NLP Related Technologies

  3. Outline • Introduction • Single-Document Summarization • Early work • Supervised Methods • Unsupervised Method • Multi-Document Summarization • Not abailable yet … • Evaluation • ROGUE • Information-Theoretic Method

  4. Introduction • The subfield of summarization has been investigated by the NLP community for nearly the last half century • “A text that is produced from one or more texts, that conveys important information in the original text(s), and that is no longer than half of the original text(s) and usually significantly less than that” – (Radev, 2000) • Summaries may be produced from a single document or multiple documents • Summaries should preserve important information • Summaries should be short • Terminology in the summarization dialect • Extraction: identify important sections of the text • Abstraction: produce important material in a new way • Fusion: combines extracted parts coherently • Compression: throw out unimportant sections of the text • Indicative vs. Informative vs. Critic • Generic vs. Query-oriented • Single-Document Summarization vs. Multi-Document Summarization

  5. Introduction (cont.) • Input (Jones, 1997) • Subject type: domain • Genre: newspaper articles, editorials, letters, reports... • Form: regular text structure; free-form • Source size: single doc; multiple docs (few; many) • Purpose • Situation: embedded in larger system (MT, IR) or not? • Audience: focused or general • Usage: IR, sorting, skimming... • Output • Completeness: include all aspects, or focus on some? • Format: paragraph, table, etc. • Style: informative, indicative, critical... *This slides was adopted from Prof. Hovey’s presentation

  6. Introduction (cont.) • A Summarization Machine *This slides was adopted from Prof. Hovey’s presentation

  7. Introduction (cont.) • A brief history of summarization

  8. Speech Summarization • Fundamental problems with speech summarization • Disfluencies, hesitations, repetitions, repairs, … • Difficulties of sentence segmentation • More spontaneous parts of speech (e.g. interviews in broadcast news) are less amenable to standard text summarization • Speech recognition errors • Speech Summarization • Speech-to-text summarization • The documents can be easily looked through • The part of the documents that is interesting for users can be easily extracted • Information extraction and retrieval techniques can be easily applied to the documents • Speech-to-speech summarization • Wrong information due to speech recognition errors can be avoided • Prosodic information such as the emotion of speakers that is conveyed only by speech can be presented *This slides was adopted from Prof. Furui’s presentation

  9. Single-Document SummarizationEarly Work • The most cited paper on summarization is that of (Luhn, 1958) • The frequencyof a particular word in an article provides an useful measure of its significance • There are also several key ideas put forward in this paper that have assumed importance in later work on summarization • words were stemmed to their root forms, and stop words were deleted • compiled a list of content words sorted by decreasing frequency, the index providing a significance measure of the word • a significance factor was derived that reflects the number of occurrences of significant words within a sentence • all sentences are ranked in orderof their significance factor, and the top ranking sentences are finally selected to form the auto-abstract • Baxendale also suggest that “sentence position” is helpful in finding salient parts of documents (Baxendale, 1958) • examined 200 paragraphs to find that in 85% of the paragraphs the topic sentence came as the first one & in 7% of the time it was the last sentence

  10. Single-Document Summarization Early Work (cont.) • Edmundson (1969) describes a system that produces document extracts • His primary contribution was the development of a typical structure for an extractive summarization experiment (400 technical documents) • Four kind of features are used • Word frequency, Positional feature • Cue words: present of words like significant, or hardly • The skeleton of the document: whether the sentence is a title or heading • Weights were attached to each of these features manually to score each sentence • About 44% of the auto-extracts matched the manual extracts

  11. Single-Document SummarizationSupervised Methods • In the 1990s, with the advent of machine learning techniques in NLP • a series of seminal publications appeared that employed statistical techniques to produce document extracts • Kupiec et al. (1995) using a naive-Bayes classifier to categorizes each sentence as worthy of extraction or not • Let be a particular sentence, the set of sentences that make up the summary, and the features • Assuming independence of the features • Two additional features are used: sentence length and the presence of uppercase words • Feature analysis revealed that a system using only the position and the cue features, along with the sentence length, performed best

  12. Single-Document Summarization Supervised Methods (cont.) • Aone et al. (1999) also incorporated a naive-Bayes classifier, but with richer features • Signature words: derived from term frequency(TF)and inverse document frequency(IDF) • Named-entitytagger • Shallow discourse analysis • Synonyms and morphological variants were also merged (accomplied by WordNet) • Lin and Hovy (1997) studied the importance of sentence position feature • However, since the discourse structure significantly varies over domains • They makes an important contribution by investigating techniques of tailoring the position method towards optimality over a genre • Measured the yield of each sentence position against the topic keywords • Then ranked the sentence positions by their average yield to produce the Optimal Position Policy (OPP) for topic positions for the genre

  13. Single-Document Summarization Supervised Methods (cont.) • Lin (1999) broke away from the assumption that features are independent of each other • He tried to model the problem of sentence extraction using decision trees, instead of a naive-Bayes classifier • Some novel features were introduced in his paper • Query Signature: normalized score given to sentences depending on number of query words that they contain • IR signature: score given to sentences depending on number and scores of IR signature words included (the m most salient words in the corpus) • Average lexical connectivity: the number of terms shared with other sentences divided by the total number of sentences in the text • Numerical data: value 1 when sentences contained a number • Proper name, Pronoun or Adjective, Weekday or Month, Quotation (similar as previous feature) • Sentence length, Sentence Order • Feature analysis suggested that the IR signature was a valuable feature, corroborating the early findings of Luhn (1958)

  14. Single-Document Summarization Supervised Methods (cont.) • Conroy and O'leary (2001) modeled the problem of extracting a sentence from a document using a hidden Markov model (HMM) • The HMM was structured as follows • states (alternating between summary states and non-summary states) • Allowed “hesitation“ only in non-summary states and “skipping next state” only in summary states • The transition matrix can be estimated from training corpus • element is the empirical probability of transitioning from state i to state j • Associated with each state i was an output function • assume that the features are multivariate normal distributed • using the training data to compute the maximum likelihood estimate of its mean and covariance matrix (shared covariance) • Use three features: position of the sentence, number of terms in the sentence, and likeliness of the sentence terms given the document terms

  15. Single-Document Summarization Supervised Methods (cont.) • Osborne (2002) used log-linear models to obviate the assumption of feature independence • Let be a label, the item we are interested in labeling, the i-th feature and the corresponding feature weight • The conditional log-linear model can be stated as follows • The authors added a non-uniform prior to the model, claiming that a log-linear model tends to reject too many sentences for inclusion in a summary • The features included word pairs, sentence length, sentence position, and naive discourse features like inside introduction or inside conclusion.

  16. Single-Document Summarization Supervised Methods (cont.) • Svore et al. (2007) propose an algorithm based on neural nets and the use of third party datasets to perform extractive summarization • They trained a model that could infer the proper ranking of sentences • The ranking was accomplished using RankNet based on neural networks • For the training set, they used ROUGE-1 to score the similarity of a human written highlight and a sentence in the document • These similarity scores were used as “soft-labels” during training, contrasting with other approaches where sentences are “hard-labeled”, as selected or not • Another novelty of the framework lay in the use of features that derived information from query logs from Microsoft's news search engine and Wikipedia entries (third party datasets) • They conjecture that if a document sentence contained keywords used in the news search engine, or entities found in Wikipedia articles, then there is a greater chance of having that sentence in the highlight • They generate 10 features for each sentence in each document • Is first sentence, Sentence position, SumBasic score(unigram), SumBasic bigram score, Title similarity score,Average News Query Term Score, News Query Term Sum Score, Relative News Query Term Score, Average Wikipedia Entity Score, Wikipedia Entity Sum Score

  17. Single-Document Summarization Supervised Methods (cont.) • Other kinds of supervised summarizers includes • Support vector machine (SVM) (Hirao et al. 2002) • Gaussian Mixture Models (GMM) (Murray et al. 2005) • Conditional Random Fields (CRFs) (Shen et al. 2007) • In general, the extractive summarization can be treated as a two-class (summary/non-summary) classification problem (Lin et al. 2009) • A sentence with a set of representative features • To summarize documents with different summary ratios, the important sentences of a document can be selected (or ranked) based on the posterior probability of a sentence being included in the summary given the feature set

  18. Single-Document SummarizationUnsupervised Methods • Gong (2001) proposed using vector space model (VSM) • Vector representations of sentences and the document to be summarized using statistical weighting, such as TF-IDF • Sentences are ranked based on their proximity to the document • Maximum Marginal Relevance (MMR) (Murray et al. 2005) can be applied to summarize more important and different concepts in a document

  19. Single-Document SummarizationUnsupervised Methods (cont.) • Latent Semantic Analysis (LSA) (Gong 2001) • Construct a “term-sentence” matrix for a given document • Perform Singular Value Decomposition (SVD) on the “term-sentence” matrix • The right singular vectors with larger singular values represent the dimensions of the more important latent semantic concepts in the document • Represent each sentence of a document as a semantic vector in the reduced space

  20. Single-Document Summarization Unsupervised Methods (cont.) • Probabilistic Generative Framework (Chen et al. 2009) • Criterion: Maximum a posteriori (MAP) • Sentence Generative Model • Each sentence of the document as a probabilistic generative model • Language Model (LM), Sentence Topic Model (STM) and Word Topic Model (WTM) are initially investigated • Sentence Prior Model • Sentence prior is simply set to uniform here • Or may have to do with duration/position, correctness of sentence boundary, confidence score, prosodic information, etc.

  21. Single-Document Summarization Unsupervised Methods (cont.) • Language Model (LM) Approach (Literal Term Matching) • Sentence Topic Model (STM) Approach (Concept Matching) • Word Topic Model (WTM) Approach (Concept Matching) : the sentence model : the collection model : a weighting parameter

  22. Multi-Document Summarization • Task Characteristics • Input: a set of documents on the same topic • Retrieved during an IR search • Clustered by a news browsers • Problem: same topic or same event • Output: a paragraph length summary • Salient information across documents • Similarities between topics? • Redundancy removal is critical • Application oriented task • News portal, presenting articles from different sources • Corporate emails organized by subjects. • Medical reports about a patient

  23. Evaluation • Recall-Oriented Understudy for Gisting Evaluation (ROUGE) (Lin 2004) • Let be a set of reference summary, and let be a summary generated automatically by a system. Let be a binary vector representing n-grams contained in a document • The metric ROUGE-N is an n-gram recall based statistic where denotes the usual inner product of vectors • The various versions of ROUGE were evaluated by computing the correlation coefficient between ROUGE scores and human judgment scores • ROUGE-2 performed the best among the ROUGE-N variants 昨天 馬英九 訪問 中國大陸 昨天馬英九 結束 訪問 回國

  24. Evaluation (cont.) • Lin et al., (2006) also proposed to use an information-theoretic method to automatic evaluation of summaries • The central idea is to use a divergence measure (i.e., Jensen-Shannon divergence), between a pair of probability distributions • The first distribution is derived from an automatic summary and the second from a set of reference summaries • Let be the set of documents to summarize A distribution parameterized by generates reference summaries A summarization system is governed by some distribution • We may define a good summarizer as one for which is closed to • One information-theoretic measure between distributions that is adequate for this is the KL divergence • However, the KL divergence is unbounded and goes to infinity whenever vanishes and does not • Another problem is that KL divergence is not symmetric

  25. Evaluation (cont.) • Hence, they propose to use the Jensen-Shannon divergence which is bounded and symmetric where • To evaluate a summary given a reference summary , the negative JS divergence can be used for the purpose

More Related