1 / 24

Multi-document Summarization and Evaluation

Multi-document Summarization and Evaluation. Task Characteristics. Input: a set of documents on the same topic Retrieved during an IR search Clustered by a news browsers Problem: same topic or same event? Output: a paragraph length summary Salient information across documents

beck
Download Presentation

Multi-document Summarization and Evaluation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multi-document Summarization and Evaluation

  2. Task Characteristics • Input: a set of documents on the same topic • Retrieved during an IR search • Clustered by a news browsers • Problem: same topic or same event? • Output: a paragraph length summary • Salient information across documents • Similarities between topics? • Redundancy removal is critical

  3. Some Standard Approaches • Salient information = similarities • Pairwise similarity between all sentences • Cluster sentences using similarity score (Themes) • Generate one sentence for each theme • Sentence extraction (one sentence/cluster) • Sentence fusion: intersect sentences within a theme and choose the repeated phrases. Generate sentence from phrases • Salient information = important words • Important words are simply the most frequent in the document set • SumBasic simply chooses sentences with the most frequent words. Conroy expands on this • Daume and Marcu have been the renegades

  4. Some Variations on Task • Focused-based summarization: given a topic/query generate a summary • Update summaries: given an event over time, tell us what’s new • Multilingual summarization: generate an English summary of multiple documents in different languages

  5. DUC – Document Understanding Conference • Established and funded by DARPA TIDES • Run by independent evaluator NIST • Open to summarization community • Annual evaluations on common datasets • 2001-present • Tasks • Single document summarization • Headline summarization • Multi-document summarization • Multi-lingual summarization • Focused summarization

  6. DUC Evaluation • Gold Standard • Human summaries written by NIST • From 2 to 9 summaries per input set • Multiple metrics • Manual • Coverage (early years) • Pyramids (later years) • Responsiveness (later years) • Quality questions • Automatic • Rouge (-1, -2, -skipbigrams, LCS, BE) • Granularity • Manual: sub-sentential elements • Automatic: sentences

  7. Considerations Across Evaluations • Independent evaluator • Not always as knowledgeable as researchers • Impartial determination of approach • Extensive collection of resources • Determination of task • Appealing to a broad cross-section of community • Changes over time • DUC 2001-2002 Single and multi-document • DUC 2003: headlines, multi-document • DUC 2004: headlines, multilingual and multi-document, focused • DUC 2005: focused summarization • DUC 2006: focused and a new task, up for discussion • How long do participants have to prepare? • When is a task dropped? • Scoring of text at the sub-sentential level

  8. Potential Problems

  9. Comparing Text Against Text • Which human summary makes a good gold standard? Many summaries are good • At what granularity is the comparison made? • When can we say that two pieces of text match?

  10. Variation impacts evaluation • Comparing content is hard • All kinds of judgment calls • Paraphrases • VP vs. NP • Ministers have been exchanged • Reciprocal ministerial visits • Length and constituent type • Robotics assists doctors in the medical operating theater • Surgeonsstarted using robotic assistants

  11. Nightmare: only one gold standard • System may have chosen an equally good sentence but not in the one gold standard • Pinochet arrested in London on Oct 16 at a Spanish judge’s request for atrocities against Spaniards in Chile. • Former Chilean dictator Augusto Pinochet has been arrested in London at the request of the Spanish government • In DUC 2001 (one gold standard), human model had significant impact on scores (McKeown et al) • Five human summaries needed to avoid changes in rank (Nenkova and Passonneau) • DUC2003 data • 3 topic sets, 1 highest scoring and 2 lowest scoring • 10 model summaries

  12. Scoring • Two main approaches used in DUC • ROUGE (Lin and Hovy) • Pyramids (Nenkova and Passonneau) • Problems: • Are the results stable? • How difficult is it to do the scoring?

  13. ROUGE: Recall-Oriented Understudy for Gisting Evaluation Rouge – Ngram co-occurrence metrics measuring content overlap Counts of n-gram overlaps between candidate and model summaries Total n-grams in summary model

  14. ROUGE • Experimentation with different units of comparison: unigrams, bigrams, longest common substring, skip-bigams, basic elements • Automatic and thus easy to apply • Important to consider confidence intervals when determining differences between systems • Scores falling within same interval not significantly different • Rouge scores place systems into large groups: can be hard to definitively say one is better than another • Sometimes results unintuitive: • Multilingual scores as high as English scores • Use in speech summarization shows no discrimination • Good for training regardless of intervals: can see trends

  15. Comparison of Scoring Methods in DUC05 • Comparisons between Pyramid (original,modified), responsiveness, and Rouge-SU4 • Pyramids score computed from multiple humans • Responsiveness is just one human’s judgment • Rouge-SU4 equivalent to Rouge-2

  16. Creation of pyramids • Done at Columbia for each of 20 out of 50 sets • Primary annotator, secondary checker • Held round-table discussions of problematic constructions that occurred in this data set • Comma separated lists • Extractive reserves have been formed for managed harvesting of timber, rubber, Brazil nuts, and medical plants without deforestation. • General vs. specific • Eastern Europe vs. Hungary, Poland, Lithuania, and Turkey

  17. Characteristics of the Responses • Proportion of SCUs of Weight 1 is large • 44% (D324) to 81% (D695) • Mean SCU weight: 1.9 Agreement among human responders is quite low

  18. # of SCUs at each weight SCU Weights

  19. Human performance/Best sys Pyramid Modified Resp ROUGE-SU4 B: 0.5472 B: 0.4814 A: 4.895 A: 0.1722 A: 0.4969 A: 0.4617 B: 4.526 B: 0.1552 ~~~~~~~~~~~~~~~~~ 14: 0.2587 10: 0.2052 4: 2.85 15: 0.139 Best system ~50% of human performance on manual metrics Best system ~80% of human performance on ROUGE

  20. Pyramid original Modified Resp Rouge-SU4 14: 0.2587 10: 0.2052 4: 2.85 15: 0.139 17: 0.2492 17: 0.1972 14: 2.8 4: 0.134 15: 0.2423 14: 0.1908 10: 2.65 17: 0.1346 10: 0.2379 7: 0.1852 15: 2.6 19: 0.1275 4: 0.2321 15: 0.1808 17: 2.55 11: 0.1259 7: 0.2297 4: 0.177 11: 2.5 10: 0.1278 16: 0.2265 16: 0.1722 28: 2.45 6: 0.1239 6: 0.2197 11: 0.1703 21: 2.45 7: 0.1213 32: 0.2145 6: 0.1671 6: 2.4 14: 0.1264 21: 0.2127 12: 0.1664 24: 2.4 25: 0.1188 12: 0.2126 19: 0.1636 19: 2.4 21: 0.1183 11: 0.2116 21: 0.1613 6: 2.4 16: 0.1218 26: 0.2106 32: 0.1601 27: 2.35 24: 0.118 19: 0.2072 26: 0.1464 12: 2.35 12: 0.116 28: 0.2048 3: 0.145 7: 2.3 3: 0.1198 13: 0.1983 28: 0.1427 25: 2.2 28: 0.1203 3: 0.1949 13: 0.1424 32: 2.15 27: 0.110 1: 0.1747 25: 0.1406 3: 2.1 13: 0.1097

  21. Pyramid original Modified Resp Rouge-SU4 14: 0.2587 10: 0.2052 4: 2.85 15: 0.139 17: 0.2492 17: 0.1972 14: 2.8 4: 0.134 15: 0.2423 14: 0.1908 10: 2.65 17: 0.1346 10: 0.2379 7: 0.1852 15: 2.6 19: 0.1275 4: 0.2321 15: 0.1808 17: 2.55 11: 0.1259 7: 0.2297 4: 0.177 11: 2.5 10: 0.1278 16: 0.2265 16: 0.1722 28: 2.45 6: 0.1239 6: 0.2197 11: 0.1703 21: 2.45 7: 0.1213 32: 0.2145 6: 0.1671 6: 2.4 14: 0.1264 21: 0.2127 12: 0.1664 24: 2.4 25: 0.1188 12: 0.2126 19: 0.1636 19: 2.4 21: 0.1183 11: 0.2116 21: 0.1613 6: 2.4 16: 0.1218 26: 0.2106 32: 0.1601 27: 2.35 24: 0.118 19: 0.2072 26: 0.1464 12: 2.35 12: 0.116 28: 0.2048 3: 0.145 7: 2.3 3: 0.1198 13: 0.1983 28: 0.1427 25: 2.2 28: 0.1203 3: 0.1949 13: 0.1424 32: 2.15 27: 0.110 1: 0.1747 25: 0.1406 3: 2.1 13: 0.1097

  22. Pyramid original Modified Resp Rouge-SU4 14: 0.2587 10: 0.2052 4: 2.85 15: 0.139 17: 0.2492 17: 0.197214: 2.8 4: 0.134 15: 0.2423 14: 0.1908 10: 2.65 17: 0.1346 10: 0.2379 7: 0.1852 15: 2.6 19: 0.1275 4: 0.2321 15: 0.1808 17: 2.55 11: 0.1259 7: 0.2297 4: 0.177 11: 2.5 10: 0.1278 16: 0.2265 16: 0.1722 28: 2.45 6: 0.1239 6: 0.2197 11: 0.1703 21: 2.45 7: 0.1213 32: 0.2145 6: 0.1671 6: 2.4 14: 0.1264 21: 0.2127 12: 0.1664 24: 2.4 25: 0.1188 12: 0.2126 19: 0.1636 19: 2.4 21: 0.1183 11: 0.2116 21: 0.1613 6: 2.4 16: 0.1218 26: 0.2106 32: 0.1601 27: 2.35 24: 0.118 19: 0.2072 26: 0.1464 12: 2.35 12: 0.116 28: 0.2048 3: 0.145 7: 2.3 3: 0.1198 13: 0.1983 28: 0.1427 25: 2.2 28: 0.1203 3: 0.1949 13: 0.1424 32: 2.15 27: 0.110 1: 0.1747 25: 0.1406 3: 2.1 13: 0.1097

  23. Pyramid original Modified Resp Rouge-SU4 14: 0.258710: 0.20524: 2.8515: 0.139 17: 0.2492 17: 0.197214: 2.84: 0.134 15: 0.242314: 0.190810: 2.6517: 0.1346 10: 0.23797: 0.185215: 2.6 19: 0.1275 4: 0.232115: 0.180817: 2.55 11: 0.1259 7: 0.22974: 0.177 11: 2.5 10: 0.1278 16: 0.2265 16: 0.1722 28: 2.45 6: 0.1239 6: 0.2197 11: 0.1703 21: 2.45 7: 0.1213 32: 0.2145 6: 0.1671 6: 2.4 14: 0.1264 21: 0.2127 12: 0.1664 24: 2.4 25: 0.1188 12: 0.2126 19: 0.1636 19: 2.4 21: 0.1183 11: 0.2116 21: 0.1613 6: 2.4 16: 0.1218 26: 0.2106 32: 0.1601 27: 2.35 24: 0.118 19: 0.2072 26: 0.1464 12: 2.35 12: 0.116 28: 0.2048 3: 0.145 7: 2.3 3: 0.1198 13: 0.1983 28: 0.1427 25: 2.2 28: 0.1203 3: 0.1949 13: 0.1424 32: 2.15 27: 0.110 1: 0.1747 25: 0.1406 3: 2.1 13: 0.1097

  24. Questions • Brotzman: In "Topic-Focused Multi-document Summarization Using an Approximate Oracle Score" and "Bayesian Query-Focused Summarization" we read of two methods of document summarization that rely on a surface-level representation of written language. They both beg the question (and Nenkova hints at the issue by characterizing the DUC's "coverage" as "not addressing issues such as readability and other text qualities"), how useful or relevant is a surface-level representation of language, in general? The experiments these papers conduct achieve promising results - but is this merely because the kinds of texts they consider are very "plain" or fundamentally "surface-level" anyway? How do you think the methods described could be extended to apply to less straightforward text? • Sparck Jones: In order to develop effective procedures it is necessary to identify and respond to the context factors, i.e. input, purpose and output factors, that bear on summarising and its evaluation. (p. 1)

More Related