Document Summarization

Document Summarization Vinayak Gagrani NeerajToshniwalAbhishekKabra Guide Pushpak Bhattacharya

Outline • Introduction • Single Document Summarization • Multiple Document Summarization • Application • Evaluation • Conclusion

Introduction • What is Summary? • Text produced from one or more texts • Conveys important information in the original texts, and that is no longer than half of the original texts. • 3 important aspects of summary are: • Summaries should be short • Summaries should preserve important information • Summaries may be produced from single/multiple documents

Common terms in summarization dialect • Extraction • Procedure of identifying important sections of text and producing verbatim • Abstraction • Aim to produce material in a new way • Fusion • Combining extracted parts coherently • Compression • Aims at throwing out unimportant sections of text

Single Document Summarization • Early Works • Machine Learning Methods • Naïve-Bayes Methods • Rich Features and Decision Trees • Deep Natural Language Analysis Methods • Lexical Chaining • Rhetorical Structure Theory (RST)

Early Works • Luhn, 1958 • Summarization based on measuring significance of words depending on its frequency • Deriving significance factor of sentence, based on number of significance words in that sentence • Edmundson, 1969 • Word frequency and positional importance were incorporated • Presence of cue words, and skeleton of the document were also incorporated

Naïve Bayes Method • Classifier based on applying Bayes theorem with strong independence assumption s-particular sentence S-set of sentences that make up the summary F1…, Fk -the features Assuming independence of features: P(s ε S | F1,F2….Fk)= • Evaluation is done by analyzing its match with the human extracted document summary

Naïve Bayes Method • Term frequency-inverse document frequency • Increases proportionally to the number of times a word appears in the document • offset by the frequency of the word in the corpus • Takes into account that certain words are more common than others. For e.g.. “the”, “is” etc. • Idf(t,D)= log • |D|: total number of documents in the corpus • : number of documents where the term t appears i.e. tf(t,d) 0

Rich Features and Decision Trees • Weighing sentences based on their position • Arises from the idea that texts generally follow a predictable discourse structure • Sentence position yield was calculated against the topic keywords later • Sentence position were then ranked by average yield to produce Optimal Position Policy for topic positions for the genre • Later, sentence extraction problem was modeled using decision trees • assumption that features are independent broke away

Deep Natural Language Analysis Methods • Techniques aimed at modeling the text’s discourse structure • Use of heuristics to create document extracts • Lexical Chaining • independent of the grammatical structure of the text • list of words that captures a portion of the cohesive structure of the text • sequence of related words in the text, spanning short or long distances • technique used to identify the central theme of a document

Forms of Cohesion • Ellipsis • Words are omitted when the phrase needs to be repeated • Example: • A: Where are you going? • B: To town. • Substitution • Word is not omitted but replaced by another • Example: • A: Which ice-cream would you like? • B: I would like the pink one.

Forms of Cohesion • Conjunction • Relationship between two clauses • Few of them are: “and”, “then”, “however” etc. • Repetition • Mentioning of the same word again • Reference • Anaphoric reference • Refers to someone/something that has been previously identified • Cataphoric reference • Forward referencing . Example: Here he comes….It’s Brad Pitt

Lexical chaining • Example:- John had mud pie for dessert. Mud pie is made of chocolate. John really enjoyed it. • Steps involved in lexical chaining: a) Selecting a set of candidate words. b) For each candidate word, finding an appropriate chain relying on a relatedness criterion among members of the chain c) If it is found, inserting the word in the chain and updating it accordingly

Lexical Chaining • relatedness measure-Wordnet Distance. • Weights assigned to chains based on their length and homogeneity • Determining the strength of a lexical chain by taking in consideration the distribution of elements in the chain throughout the text • Corresponds to the significance of the textual context it embodies. • Provides a basis identifying the topical units in a document which are of great importance in document summarization.

Rhetorical Structure Theory(RST) • two non-overlapping pieces of text spans: the nucleus and the satellite • Nuclei expresses what is more essential to the writer's purpose than the satellite • Example: claim followed by evidence for the claim. RST posits an "Evidence" relation between the two spans. • claim is more essential to the text than the particular evidence • claim span a nucleus and the evidence span a satellite • Nucleus is independent of the satellite but not vice versa

Rhetorical Structure Theory(RST)

Multiple Document Summarization • Need and Encouragement • Extraction of single summary from multiple documents started in mid 1990s • Most of the application in news article • Google news (news.google.com) • Columbia news blaster (newsblaster.cs.columbia.edu) • News in Essence (NewsInEssence.com) • Multiple source of information which are :- • supplementary to each other • overlapping in content • even contradictory at time

Early Work • Extended template driven message understanding system • Abstractive System, rely heavily on internal NLP tools • Earlier considered as knowledge of • Language Interpretation • Generation • Extractive Techniques have been applied - Similarity measures between sentences • identify common theme through clustering - select one sentence to represent each cluster • generate composite sentence from each cluster • Summarization differs on what the final goal is • MEAD : works based on extraction techniques on general domains • SUMMONS : build a briefing highlighting difference and updates on news report

Abstractions and Information Fusion • SUMMONS is the first example of multi-document summarization • Considers event about a narrow domain • news articles about terrorism • It produces a briefing merging relevant information about event and their evolution over time • It reads a database built by template based message understanding system • Concatenation of two systems : Content Planner and Linguistic Generator

SUMMONS - processing the text (Content Planner) • Content Planner : selects information to include in summary through combination of input templates • It uses summary operators - set of heuristics that perform operations like : • change of perspective, contradiction, refinement • Linguistic Generator :selects the right words to express the information in grammatical and coherent text. • Uses connective phrases to synthesize summary, adapting language generation tools like FUF/SURGE

Theme based approach - McKeown et al., Barzilay et al. • Themes - set of similar text units (Paragraphs) - Clustering Problem • Text is mapped to vector of features including single words weighted by their TF-IDF scores, noun, pronoun, semantic classes of verbs • For each pair of paragraphs a vector is computed which represents matches on different features. • Decision rules learnt from data classify each pair as similar or dissimilar. An algorithm then places the most related paragraphs in same theme • Information Fusion - which sentences of the theme should be included in the final summary.

Information Fusion • Algorithm - compares and intersects predicate argument structures of the phrases within each theme to find which are repeated often enough to be included in summary • Sentenced are parsed using Collins' statistical parser converted into dependency tree – captures predicate-argument structure, identify functional roles. • Comparison algorithm traverses the tree recursively, adding identical nodes to output tree. • Once full phrase are found, they are marked to be included in summary. • Once summary content is decided, a grammatical text is generated using FUF/SURGE language generating system.

Decision Tree “McVeigh, 27,was charged with the bombing”

Topic-Driven Summarization • MMR - Maximal Marginal Relevance introduced by Carbonell and Goldstein • Rewards relevant sentences and penalizes redundant ones by considering a linear combination of two similarity measures. • Q - query or user profile, R - Ranked list of documents, S - already selected documents . • Select a document one at a time and add them to S. • For each document in Di in R\S, MR(Di) = a * Sim1(Di,Q) - (1-a) * max Di in S Sim2(Di,Dj), where a lies in [0,1] • Document getting maximum MR(Di) is selected until maximum number is reached or threshold is reached, • a controls the relative importance between relevance and redundancy. • Sim1 and Sim2 are similarity measures ( cosine similarity measure )

Graph Spreading Activation • Content is denoted as entities and relations as nodes and edges of a graph. • Rather than extracting sentences, they detect salient regions of the graph. • Topic Driven : topic is denoted by entry nodes in graph. • Graph : • Each node is single occurrence of word. • Different kind of links – Adjacency links, Same links, Alpha Links and Phrase links, Name and Coref Links

Graph Spreading Activation • Topic nodes are identified through stem comparison and marked as entry node. • Spreading activation: search for semantically related text is propagated from these to other nodes of the graph. • Weight of neighboring node depends on node links traveled and is exponentially decaying function of the distance. • Pair of document graph: identify common nodes and difference nodes. Highlight sentences having higher common and different scores. • User is able to specify the maximal number to control the output.

Centroid-based Summarization • It does not use any language generation module. Easily scalable and domain-independent • Topic Detection - Group together news articles that describe the same event. • An agglomerative clustering algorithm is used, operates on TF-IDF vector representations, successively adding documents to clusters and re computing the centroids according to • cjis the centroid of the j-th cluster, Cj the set of documents that belong to that cluster • Centroids can thus be considered as pseudo-documents that include those words whose TF-IDF scores are above a threshold in their cluster.

Centroid-based Summarization • Second Stage - Identify sentences that are central to topic of the entire cluster. • Two metrics similar to MMR(but not query dependent) are defined by Radev et al., 2000 • Cluster-based relative utility (CBRU) - how relevant a particular sentence to general topic of cluster • Cross-sentence Informational subsumption (CSIS) - measure of redundancy among sentences • Given a cluster segmented into n sentences, and compression rate R, we select nR sentences in order of appearance in chronologically arranged documents • Addition of the three scores minus redundancy penalty(Rs) for sentence that overlaps highly ranked sentence is the final score for each sentence • Centroid Value (Ci) sum of centroid values of all the words in sentence • Positional Value(Pi) makes leading sentences more important • First sentence Overlap (Fi) - inner product of word occurrence vector of sentence I and that of 1st sentence of document

Application • Google News: • news aggregator, selecting most up-to-date(within the past 30 days) information from thousands of publications by an automatic aggregation algorithm • Different versions available for more than 60 regions in 28 languages • Ultimate research Assistant: • performs text mining on Internet search results • make it easier for the user to perform online research by organizing the output. • Type name of a topic and it will search the web for highly relevant resources, and organize the search results

Application • Shablast • Universal search engine • Produces multi-document summaries from the top 50 results returned by Microsoft's Bing search engine for a set of keywords. • iResearch Reporter – • Commercial Text Extraction and Text Summarization system • Produces categorized, easily-readable natural language summary reports covering multiple documents retrieved by entering user query in google search engine

Application

Evaluation • A difficult task • Absence of a standard human or automatic evaluation metric • makes difficult to compare different systems and establish a baseline • Manual evaluation not feasible • Need for an evaluation metric having high correlation with human scores • human and automatic evaluation: • Comparison of automatic generated summaries with manually written "ideal" summaries decomposition of text into sentences • Rating between 1-4 to system unit(SU) which shares content with Model unit(MU) corresponding to ideal summaries

Evaluation • ROUGE • based only on content overlap • can determine if the same general concepts are discussed between an automatic summary and a reference summary • cannot determine if the result is coherent or the sentences flow together in a sensible manner • Better in case of single document summarization • Information-theoretic Evaluation of Summaries • Central idea is to use a divergence measure between a pair of probability distributions • First distribution is derived from automatic summary • Second from a set of reference summaries • Suits both the single document and multi document summarization scenarios

Conclusion • Need to develop efficient and accurate summarization systems due to enormous rate of information growth • Still a lot of research going on this field especially in evaluation techniques • Multi document summarization is more in use as compared to single-document summarization • Extractive techniques are employed usually rather than abstractive techniques as they are easy to employ and have produced satisfactory results

References • A survey on Automatic Summarization – Dipanjan Das and Andre F.T. Martins (http://www.cs.cmu.edu/~afm/Home_files/Das_Martins_survey_summarization.pdf) • Wikipedia • Relevance of cluster size in MMR Based summarizer (http://www.cs.cmu.edu/~madhavi/publications/Ganapathiraju_11-742Report.pdf)

Document Summarization