Summary Generation Keith Trnka

Summary GenerationKeith Trnka

The approach • Apply Marcu's basic summarizer (1999) to perform content selection • Re-generate the selected content so that it's more natural

RST Refresher • A text is composed of elementary discourse units (EDUs) • What constitutes an EDU varies from author to author • Common consensus that they are no larger than sentences • Text spans • An EDU is a text span • A sequence of adjacent text spans in some rhetorical relation is a text span

RST Refresher (cont'd) • A rhetorical relation is the relationship between text spans • Some relations have the notion of nuclearity: one sub-span (nucleus) is the one to which all other sub-spans (satellites) relate • These relations are called mononuclear • Example: [When I got home,] circumstance-for [I was tired] • Other spans are called multinuclear • There is no most-important sub-span • Example: [Cats scratch] contrast-with [, but dogs bite.]

RST Discourse Treebank • RST analyses of 385 WSJ articles from Penn Treebank • Available from LDC (http://www.ldc.upenn.edu) • Overview can be found in (Carlson et. al. 2001) • Annotation manual is (Carlson, Marcu 2001) • Thanks to the department for buying it

RST Discourse Treebank (cont'd) • Notes about the annotation • EDUs are clause-like • Mono-nuclear relations were forced to be binary • Relative clauses and appositives can be embedded relations

RST Discourse Treebank (cont'd) • Statistical analysis of 335 training documents • 98% of spans are binary (two children) • For binary mononuclear relations: • Nucleus-satellite order can be predicted with 87% accuracy, given the relation, using predict-majority

Marcu's Content Selection Algorithm • Described in (Marcu 1999) • Promotion sets • The promotion set of each span is the union of all promotion sets of nuclear sub-spans • The promotion set of an EDU is the EDU itself

Marcu's Content Selection Algorithm (cont'd) • Build a partial ordering of EDUs* • For each EDU, find the topmost span in which it's in the promotion set. Let d be the tree depth of this span. • The rank of each EDU is • If the EDU is in an embedded relation, d + 1 • Otherwise, d • Example of the partial ordering *re-worded from Marcu's description

Marcu's Content Selection Algorithm (cont'd) • Given a summary length requirement • Select the topmost EDU groups until it isn't possible to select more and honor the length requirement • Effect: can't always generate a summary as close to desired length as possible

Generation desiderata • Removal of problems • Dangling references • Dangling discourse markers • Introduction of coherence • Generate smaller referring expressions • Generate discourse markers when appropriate

Example Claude Bebear, chairman and chief executive officer, of Axa-Midi Assurances, pledged to retain employees and management of Farmers Group Inc.. Mr. Bebear made his remarks at a breakfast meeting with reporters here yesterday as part of a tour. Farmers was quick yesterday to point out the many negative aspects. For one, Axa plans to do away with certain tax credits.

The theoretical approach • Content selection • Marcu's summarization algorithm • Paragraph generation • Organize sentences into paragraphs • Sentence generation • Construct complete sentences from EDUs

The theoretical approach (cont'd) • Discourse marker generation • Remove discourse markers that refer to removed text spans • Generate discourse markers when none exists and one is appropriate • Referring expression generation • Generate the best unambiguous referring expressions • Shorter is better • Faster to interpret is better

The implemented approach • Content selection • Marcu's algorithm as stated • Paragraph generation • Not implemented

Implementation: Sentence “generation” • If a selected group of EDUs is an entire text span • select them all as-is, uppercase the front and make sure it ends with punctuation • If a selected group of EDUs is an entire text span, except for some embedded relations • Remove punctuation associated with embeddings, add sentence terminators from embeddings • If a selected group of EDUs is a sentence • Select as-is • If a selected EDU isn't part of such a group • uppercase the front and end with punctuation

Implementation: Discourse marker generation • Train to see which discourse markers go with which relations • In generation, select discourse markers with a probability > 80%

Training on discourse markers • Discourse markers identified by string matching at beginning and ending of each EDU • List of markers taken from (Knott 1994)

Training on discourse markers (cont'd) • Three statistics trained on binary, atomic spans with zero or one markers • Inclusion • Usage • Position

Rough evaluation • Sentence “generation” isn't much different from not changing it at all • Except embedded relation removal • Out of 347 summaries, a discourse marker was only generated once • Ms. Johnson is awed by the earthquake's destructive force. "It really brings you down to a human level," Though "It's hard to accept all the suffering but you have to.

Desired approach: Content selection • Marcu's algorithm can only select groups of EDUs • Sometimes produces overly short summaries or nothing at all • If a preferential ordering could be defined within equivalence, summaries could meet the desired length better • EDUs tied to more salient EDUs have their score boosted

Desired approach:Paragraph generation • Paragraphs in the source document are marked • Leave paragraph boundaries intact if they form large enough paragraphs • A shallow method, but has potential • Correlate paragraph boundaries with something • RS-tree structure • Co-reference chain beginnings/endings • Topical text segments, by an extension of Heart's text segmentation algorithm (Hearst 1994)

Desired approach: Sentence generation • Apply shallow parsing to understand the rough syntactic structure of an EDU • Relative clauses can be attached and full sentences generated like (Siddharthan 2004)

Desired approach:Discourse marker generation • The probabilities computed in DM training aren't the best • Need to attach discourse markers and recompute, repeat until stable • The attachment algorithm involves a constraint-satisfaction problem • DM attachment needed to perform DM removal • A DM generator should understand syntax better • When should commas be included and where?

Desired approach:Referring expression generation • Requires good co-reference resolution • A reference resolver requires (at least) a base noun phrase chunker • EDUs might be used in conjunction with a shallow parse to approximate Hobbs' naïve approach • Mitkov (2002) describes Hobbs' naïve approach • Generation algorithm only adds the creation of a list of referring expressions, ordered by preference

Conclusions • Document length is poorly defined • Quite a bit of variation between EDU length, word length, and character length • Attaching discourse markers to the relation they realize is tough • Representing natural language in programs can be tough • Summarization of quotations requires special treatment

References • Lynn Carlson, Daniel Marcu, and Mary Ellen Okurowski (2001). Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory. Proceedings of the 2nd SIGDIAL Workshop on Discourse and Dialogue, Eurospeech 2001, Denmark, September 2001. • Lynn Carlson and Daniel Marcu. (2001). Discourse Tagging Manual. ISI Tech Report ISI-TR-545. July 2001. • Marti Hearst (1994). Multi-Paragraph Segmentation of Expository Text. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, NM, June 1994. • Alistair Knott and Robert Dale (1994). Using Linguistic Phenomena to Motivate a Set of Coherence Relations. Discourse Processes 18(1): 35-62. • William Mann and Sandra Thompson (1988). Rhetorical Structure Theory: Toward a functional theory of text organization. Text 8(3): 243-281.

References (cont'd) • Daniel Marcu (1999). Discourse trees are good indicators of importance in text. In I. Mani and M. Maybury editors, Advances in Automatic Text Summarization, pages 123-136, The MIT Press. • I think this is a cleanup of his earlier work from 1997. • Ruslan Mitkov (2002). Anaphora Resolution. Pearson Education. • Advaith Siddharthan (2004). Syntactic Simplification and Text Cohesion. To appear in the Journal of Language and Computation, Kluwer Academic Publishers, the Netherlands.

Summary Generation Keith Trnka