1 / 41

Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization

Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization. Kathleen McKeown, Rebecca Passonneau, David Elson, Ani Nenkova, Julia Hirschberg Department of Computer Science Columbia University. Status of Multi-Document Summarization. Robust

Download Presentation

Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Do Summaries Help?A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca Passonneau, David Elson, Ani Nenkova, Julia Hirschberg Department of Computer Science Columbia University

  2. Status of Multi-Document Summarization • Robust • Many existing systems (e.g. DUC 2004) • http://newsblaster.cs.columbia.edu • http://www.newsinessence.com • Extensive quantitative evaluation (intrinsic) • DUC 2001 – DUC 2005 • Comparison of system summary content against human models • Do system generated summaries help end-users to make better use of the news?

  3. Extrinsic Evaluation • Task-based evaluation of single document summarization using IR • TIPSTER-II, Brandow et al, Mani et al, Mochizuki&Okumura • Other factors can determine result (Jing et al) • Evaluation of evaluation metrics using similar task as ours • Amigo et al

  4. Task Evaluation • Hypothesis: multi-document summaries enable users to find information efficiently • Task: fact-gathering given topic and questions • Resembles intelligence analyst task • Compared 4 parallel news browsing systems • Level 1: Source documents only • Level 2: One sentence multi-document summaries (e.g., Google News) linked to documents • Level 3: Newsblaster multi-document summaries linked to documents • Level 4: Human written multi-document summaries linked to documents

  5. Results Preview • Quality of facts gathered significantly better • Newsblaster vs. documents alone • User satisfaction higher • Newsblaster and human summaries vs. documents and 1 sentence summaries • Summaries contributed important facts • Newsblaster and human summaries vs. 1 sentence summaries • Full multi-document summarization more powerful than no documents or single sentence summarization

  6. Outline • Study design and execution • Scoring • Results

  7. Evaluation Goals • Do summaries help users find information needed to perform a fact gathering task? • Do users use information from the summary in gathering their facts? • Do summaries increase user satisfaction with the online news system? • Do users create better fact sets with an online news system that includes summaries than one without? • How does type of summary (i.e., 1-sentence, system generated, human generated) affect quality of task output and user satisfaction?

  8. Experimental Design • Subjects performed four 30-minute fact-gathering scenarios • Prompt: topic description plus three questions • Given a web page as sole resource • Space in which to compose response • Instructed to cut and paste from summary or article • Four event clusters per page • Two centrally relevant, two less relevant • 10 documents per cluster on average • Complete survey after each scenario

  9. Prompt • The conflict between Israel and the Palestinians has been difficult for government negotiators to settle. Most recently, implementation of the "road map for peace," a diplomatic effort sponsored by the United States, Russia, the E.U. and the U.N., has suffered setbacks. However unofficial negotiators have developed a plan known as the Geneva Accord for finding a permanent solution to the conflict. • Who participated in the negotiations that produced the Geneva Accord? • Apart from direct participants, who supported the Geneva Accord preparations and how? • What has the response been to the Geneva Accord by the Palestinians and Israelis?

  10. Experimental Design • Subjects performed four 30-minute fact-gathering scenarios • Prompt: topic description plus three questions • Produced a report containing a list of facts • Given a web page as sole resource • Space in which to compose response • Instructed to cut and paste from summary or article and make citation • Four event clusters per page • Two centrally relevant, two less relevant • 10 documents per cluster on average • Complete survey after each scenario

  11. Level 1: Documents only, no summary

  12. Level 2: 1-sentence summary for each event cluster, 1-sentence summary for each article

  13. Full multi-document summaries Neither humans nor systems had access to the prompt • Level 3: Generated by Newsblaster for each event cluster • Level 4 • Human written summary for each event cluster • Summary writers hired to write summaries • English or Journalism students with high verbal SAT

  14. Levels 3 and 4: full summary for each event cluster

  15. Experimental Design • Subjects performed four 30-minute fact-gathering scenarios • Prompt: topic description plus three questions • Produced a report containing a list of facts • Given a web page as sole resource • Space in which to compose response • Instructed to cut and paste from summary or article and make citation • Four event clusters per page • Two centrally relevant, two less relevant • 10 documents per cluster on average • Complete survey after each scenario

  16. Study Execution • 45 Subjects with varied background • 73% students (BS, BA, journalism, law) • Native speakers of English • Paid, with promise of monetary prize for best report • 3 studies, controlling for scenario and level order, ~11 subjects/scenario/level

  17. Results – What was Measured • Report content across summary conditions: levels 1-4 • User satisfaction per summary condition based on user surveys • Source of report content (summary or article) by counting fact citations

  18. Scoring Report Content • Compare subject reports against a gold standard • Used the Pyramid method [HLT2004] • Avoids postulating an ideal exhaustive report • Predicts multiple equally good reports • Provides a metric for comparison • Gold standard for report x = pyramid of facts constructed from all reports except x • Relative importance of facts determined by report writers • 34 reports per pyramid on average -> very stable

  19. Tiers of differentially weighted facts Top: few facts, high weight Bottom: many facts, low weight Report facts that don’t appear in pyramid have weight 0 Duplicate report facts get weight 0 Pyramid representation W=34 W=33 … W=1

  20. Ideally informative report • Does not include a fact from a lower tier unless all facts from higher tiers are included as well

  21. Ideally informative report • Does not include a fact from a lower tier unless all facts from higher tiers are included as well

  22. Ideally informative report • Does not include a fact from a lower tier unless all facts from higher tiers are included as well

  23. Ideally informative report • Does not include a fact from a lower tier unless all facts from higher tiers are included as well

  24. Ideally informative report • Does not include a fact from a lower tier unless all facts from higher tiers are included as well

  25. Ideally informative report • Does not include a fact from a lower tier unless all facts from higher tiers are included as well

  26. Report Length • Wide variation in length impacts scores • We restricted report length < 1 standard deviation above the mean by truncating question answers

  27. Results - Content Report quality improves from level 1 to level 3. (One scenario was dropped from results as it was problematic for subjects)

  28. Statistical Analysis • ANOVA shows summary is marginally significant factor • Bonferonni method applied to determine differences in summary levels • Difference between Newsblaster and documents-only significant (P=.05) • Differences between Newsblaster and 1-sentence or human not significant • ANOVA shows that scenario, question and subject also significant factors

  29. Results - User Satisfaction • 6 questions in exit survey required response from a 1-5 scale • Average increases by summary type

  30. With full summaries, users read less

  31. With Summaries, easier to write report and tended to have more time

  32. Usefulness improves with summary qualityHuman summaries help best with time

  33. Multiple Choice Survey Questions

  34. Citation Patterns • Report writers were significantly more likely to extract facts from summaries with Newsblaster and human summaries

  35. What we Learned • With summaries, a significant increase in quality of report • We hypothesized summaries would reduce reading time • As summary quality increases, users significantly more often draw facts from summary without decrease in report quality • Users claim they read fewer full documents with level 3 and 4 summaries • Full multi-document summarization better than 1 sentence summaries • Almost 5 times the proportion of subjects using Newsblaster summaries say summaries are helpful than subjects using 1 sentence summaries

  36. Need for Follow-on Studies • Why no significant increase in report quality from level 2 to level 3? • Interface differences • Level 2 had summary for each article, level 3 did not • Level 3 required extra clicks to see list of articles • Studies to investigate controlling report length • Studies to investigate impact of scenario and question

  37. Need for Follow-on Studies • Why no significant increase in report quality from level 2 to level 3? • Interface differences • Level 2 had summary for each article, level 3 did not • Level 3 required extra clicks to see list of articles • Studies to investigate controlling report length • Studies to investigate impact of scenario and question

  38. Conclusions • Do summaries help? • Yes • Our task-based, extrinsic evaluation yielded significant conclusions • Full multi-document summarization (Newsblaster, human summaries) helps users perform better at fact-gathering than documents only • Users are more satisfied with full multi-document summarization than Google News style 1-sentence summaries

More Related