1 / 28

Summary Generation Keith Trnka

Summary Generation Keith Trnka. The approach. Apply Marcu's basic summarizer (1999) to perform content selection Re-generate the selected content so that it's more natural. RST Refresher. A text is composed of elementary discourse units (EDUs)

zorana
Download Presentation

Summary Generation Keith Trnka

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Summary GenerationKeith Trnka

  2. The approach • Apply Marcu's basic summarizer (1999) to perform content selection • Re-generate the selected content so that it's more natural

  3. RST Refresher • A text is composed of elementary discourse units (EDUs) • What constitutes an EDU varies from author to author • Common consensus that they are no larger than sentences • Text spans • An EDU is a text span • A sequence of adjacent text spans in some rhetorical relation is a text span

  4. RST Refresher (cont'd) • A rhetorical relation is the relationship between text spans • Some relations have the notion of nuclearity: one sub-span (nucleus) is the one to which all other sub-spans (satellites) relate • These relations are called mononuclear • Example: [When I got home,] circumstance-for [I was tired] • Other spans are called multinuclear • There is no most-important sub-span • Example: [Cats scratch] contrast-with [, but dogs bite.]

  5. RST Discourse Treebank • RST analyses of 385 WSJ articles from Penn Treebank • Available from LDC (http://www.ldc.upenn.edu) • Overview can be found in (Carlson et. al. 2001) • Annotation manual is (Carlson, Marcu 2001) • Thanks to the department for buying it

  6. RST Discourse Treebank (cont'd) • Notes about the annotation • EDUs are clause-like • Mono-nuclear relations were forced to be binary • Relative clauses and appositives can be embedded relations

  7. RST Discourse Treebank (cont'd) • Statistical analysis of 335 training documents • 98% of spans are binary (two children) • For binary mononuclear relations: • Nucleus-satellite order can be predicted with 87% accuracy, given the relation, using predict-majority

  8. Marcu's Content Selection Algorithm • Described in (Marcu 1999) • Promotion sets • The promotion set of each span is the union of all promotion sets of nuclear sub-spans • The promotion set of an EDU is the EDU itself

  9. Marcu's Content Selection Algorithm (cont'd) • Build a partial ordering of EDUs* • For each EDU, find the topmost span in which it's in the promotion set. Let d be the tree depth of this span. • The rank of each EDU is • If the EDU is in an embedded relation, d + 1 • Otherwise, d • Example of the partial ordering *re-worded from Marcu's description

  10. Marcu's Content Selection Algorithm (cont'd) • Given a summary length requirement • Select the topmost EDU groups until it isn't possible to select more and honor the length requirement • Effect: can't always generate a summary as close to desired length as possible

  11. Generation desiderata • Removal of problems • Dangling references • Dangling discourse markers • Introduction of coherence • Generate smaller referring expressions • Generate discourse markers when appropriate

  12. Example Claude Bebear, chairman and chief executive officer, of Axa-Midi Assurances, pledged to retain employees and management of Farmers Group Inc.. Mr. Bebear made his remarks at a breakfast meeting with reporters here yesterday as part of a tour. Farmers was quick yesterday to point out the many negative aspects. For one, Axa plans to do away with certain tax credits.

  13. The theoretical approach • Content selection • Marcu's summarization algorithm • Paragraph generation • Organize sentences into paragraphs • Sentence generation • Construct complete sentences from EDUs

  14. The theoretical approach (cont'd) • Discourse marker generation • Remove discourse markers that refer to removed text spans • Generate discourse markers when none exists and one is appropriate • Referring expression generation • Generate the best unambiguous referring expressions • Shorter is better • Faster to interpret is better

  15. The implemented approach • Content selection • Marcu's algorithm as stated • Paragraph generation • Not implemented

  16. Implementation: Sentence “generation” • If a selected group of EDUs is an entire text span • select them all as-is, uppercase the front and make sure it ends with punctuation • If a selected group of EDUs is an entire text span, except for some embedded relations • Remove punctuation associated with embeddings, add sentence terminators from embeddings • If a selected group of EDUs is a sentence • Select as-is • If a selected EDU isn't part of such a group • uppercase the front and end with punctuation

  17. Implementation: Discourse marker generation • Train to see which discourse markers go with which relations • In generation, select discourse markers with a probability > 80%

  18. Training on discourse markers • Discourse markers identified by string matching at beginning and ending of each EDU • List of markers taken from (Knott 1994)

  19. Training on discourse markers (cont'd) • Three statistics trained on binary, atomic spans with zero or one markers • Inclusion • Usage • Position

  20. Rough evaluation • Sentence “generation” isn't much different from not changing it at all • Except embedded relation removal • Out of 347 summaries, a discourse marker was only generated once • Ms. Johnson is awed by the earthquake's destructive force. "It really brings you down to a human level," Though "It's hard to accept all the suffering but you have to.

  21. Desired approach: Content selection • Marcu's algorithm can only select groups of EDUs • Sometimes produces overly short summaries or nothing at all • If a preferential ordering could be defined within equivalence, summaries could meet the desired length better • EDUs tied to more salient EDUs have their score boosted

  22. Desired approach:Paragraph generation • Paragraphs in the source document are marked • Leave paragraph boundaries intact if they form large enough paragraphs • A shallow method, but has potential • Correlate paragraph boundaries with something • RS-tree structure • Co-reference chain beginnings/endings • Topical text segments, by an extension of Heart's text segmentation algorithm (Hearst 1994)

  23. Desired approach: Sentence generation • Apply shallow parsing to understand the rough syntactic structure of an EDU • Relative clauses can be attached and full sentences generated like (Siddharthan 2004)

  24. Desired approach:Discourse marker generation • The probabilities computed in DM training aren't the best • Need to attach discourse markers and recompute, repeat until stable • The attachment algorithm involves a constraint-satisfaction problem • DM attachment needed to perform DM removal • A DM generator should understand syntax better • When should commas be included and where?

  25. Desired approach:Referring expression generation • Requires good co-reference resolution • A reference resolver requires (at least) a base noun phrase chunker • EDUs might be used in conjunction with a shallow parse to approximate Hobbs' naïve approach • Mitkov (2002) describes Hobbs' naïve approach • Generation algorithm only adds the creation of a list of referring expressions, ordered by preference

  26. Conclusions • Document length is poorly defined • Quite a bit of variation between EDU length, word length, and character length • Attaching discourse markers to the relation they realize is tough • Representing natural language in programs can be tough • Summarization of quotations requires special treatment

  27. References • Lynn Carlson, Daniel Marcu, and Mary Ellen Okurowski (2001). Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory. Proceedings of the 2nd SIGDIAL Workshop on Discourse and Dialogue, Eurospeech 2001, Denmark, September 2001. • Lynn Carlson and Daniel Marcu. (2001). Discourse Tagging Manual. ISI Tech Report ISI-TR-545. July 2001. • Marti Hearst (1994). Multi-Paragraph Segmentation of Expository Text. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, NM, June 1994. • Alistair Knott and Robert Dale (1994). Using Linguistic Phenomena to Motivate a Set of Coherence Relations. Discourse Processes 18(1): 35-62. • William Mann and Sandra Thompson (1988). Rhetorical Structure Theory: Toward a functional theory of text organization. Text 8(3): 243-281.

  28. References (cont'd) • Daniel Marcu (1999). Discourse trees are good indicators of importance in text. In I. Mani and M. Maybury editors, Advances in Automatic Text Summarization, pages 123-136, The MIT Press. • I think this is a cleanup of his earlier work from 1997. • Ruslan Mitkov (2002). Anaphora Resolution. Pearson Education. • Advaith Siddharthan (2004). Syntactic Simplification and Text Cohesion. To appear in the Journal of Language and Computation, Kluwer Academic Publishers, the Netherlands.

More Related