1 / 51

Kathleen McKeown Department of Computer Science Columbia University

Query-Focused Summarization Using Text to Text Generation: When Information Comes from Multilingual Sources. Kathleen McKeown Department of Computer Science Columbia University. Query-Focused Summarization: Open-Ended Answers . LIST FACTS ABOUT The Trial of Saddam Hussein

saburo
Download Presentation

Kathleen McKeown Department of Computer Science Columbia University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Query-Focused Summarization Using Text to Text Generation:When Information Comes from Multilingual Sources Kathleen McKeown Department of Computer Science Columbia University

  2. Query-Focused Summarization:Open-Ended Answers LIST FACTS ABOUT The Trial of Saddam Hussein Provide a biography of Laurent Kabila Describe the Relationship of Andrew Fastow to Kenneth Lay How did Serbia react to Bill Clinton’s visit to Pristina?

  3. Key Requirements • Handle input documents from unrestricted domains robustly • Operate without full semantic interpretation • Transform text input directly to text output: text-to-text generation

  4. Typical Approach to Query-Focused Summarization • Select “key” sentences from input document • Use statistical measures of salience to determine “key” • Word frequency • Sentence position • Clue words • Use matches of query terms against sentence terms • Key sentences are strung together to form answer

  5. Problems • A selected sentence may contain relevant along with irrelevant information • Sentences placed out of context may create misconceptions • Translated text can be errorful

  6. Problems • A selected sentence may contain relevant along with irrelevant information • Iran's foreign minister, ManouchehrMottaki, said Saturday that Javier Solana, the European Union's foreign policy chief who will deliver the new package of incentives for Iran, is expected to arrive in Tehran within the next two days, IRNA news agency reported.

  7. Problems • A selected sentence may contain relevant along with irrelevant information • Sentences placed out of context may create misconceptions • Translated text can be errorful

  8. Problems • A selected sentence may contain relevant along with irrelevant information • Sentences placed out of context may create misconceptions • Miers held a news conference to justify her selection. • Translated text can be errorful

  9. Problems • A selected sentence may contain relevant along with irrelevant information • Sentences placed out of context may create misconceptions • Translated text can be errorful

  10. Problems • Translated text can be errorful • MT: After several rounds of reminded, I was a little bit. • Ref: After several hints, it began to come back to me.

  11. Our Focus • Generate new sentences through fusion of selected phrases • Edit selected sentences • Correct infelicitous references • Remove extraneous and redundant material • Make fluent sentences from disfluentsentencesWalking a fine line: it’s easy to make a good sentence bad

  12. Newsblaster • Generating sentences from phrases (Regina Barzilay, MIT) • Input cluster: • A fourth round scheduled for September was canceled when North Korea refused to attend, citing what it called a "hostile" US policy. • A fourth round of six-nation talks planned for September was canceled after North Korea refused to attend. • Fusion sentence: • A fourth round planned for September was canceled after North Korea refused to attend.

  13. Newsblaster • Editing references to people (AniNenkova, UPenn) • Cognitive Status • Major vs. minor character (George Bush vs. John Doe, a neighbor of the victim) • Hearer old vs. hearer new ( Bush vs. Miers) • Discourse given vs. discourse new

  14. Talk Outline • Multilingual Query-focused summarization • Overview • Correcting names (Kristen Parton, PhD student, Columbia) • Editing references (AdvaithSiddharthan, Cambridge) • Correcting verb deletions (WeiyunMa, PhD student, Columbia) • Correcting content word deletions (Kristen Parton) • Sentence Fusion for Machine Translation • Using Tree Adjoining Grammar to fuse subtrees for system combination (Weiyun Ma)

  15. Multilingual Query Focused Summarization: Overview DARPA GALE: Global Autonomous Language Environment • Three large teams: BBN, IBM, SRI • Generate responses to open-ended questions • 17 templates: definitions, biographies, events, relationships, reactions, etc. • Using English, Chinese, and Arabic • Text and Speech • Formal and Informal (newswire vs blog)

  16. Initially: SRI distillation Now: IBM distillation

  17. Multilingual Query Focused Summarization: Overview Utility Evaluation

  18. Multilingual Query Focused Summarization: Overview

  19. Multilingual Query Focused Summarization: Overview NIGHTINGALE: Information Delivery Backup response generators error/empty Response XML Query- specific response generators Identify redundant & identical sentences Query analyzer Search engine Resolve references XML query success Citation XML Annotated data

  20. Multilingual Query Focused Summarization: Overview Describe the relationship of the Islamic Jihad to Hamas. • The militant Islamic Jihad group rejects the idea of a long-term truce with Israel and will not join a Hamas-led government, a leader of the group said Wednesday [2/8/2006]. Some Hamas leaders have indicated they are interested in such an extended cease-fire with Israel. Hamas has largely observed an informal truce for the past year, while the smaller, more hardline Islamic Jihad has carried out six suicide attacks against Israelis during that period. Islamic Jihad and Hamas have similar ideologies, including a call for the destruction of Israel, but are also fierce competitors. ….. From English only

  21. Multilingual Query Focused Summarization: Overview LIST FACTS ABOUT [The Trial of Saddam Hussein] • The judge , however, that all people should have heard voices, the order of a court to solve technical problems. (Chi) • A trial without Saddam could be an embarrassment for the U.S. government, which has worked hard to help create a tribunal that would be perceived by Iraqis as independent and fair. (Eng) • As the trial got under way, a former secret police officer testified that he had not received any orders from Saddam during the investigations that followed an assassination attempt against him in Dujail in 1982 . (Eng) From mixed Chinese and English

  22. Multilingual Query Focused Summarization: Overview DESCRIBE THE ACTIONS OF [HuJintao] DURING [04/24/2006] TO [4/30/2006] The 26 HuJintao arrived in the Nigerian capital Abuja to Nigeria. HuJintao President of Nigeria OlusegunObasanjo held talks. 28 published by Qiushi magazine will be made by HuJintao the article is to firmly establish the socialist and Kwun From Chinese only

  23. Multilingual Query Focused Summarization: Overview Error Analysis of Year 2 Evaluation • 12 / 66 queries with zero response • 7 Chinese - 3 broadcast news, 2 broadcast conversation,  2 newswire • 2 Arabic - 1 broadcast news, 1 blogs • 1 mixed Arabic/English – broadcast news, newswire • 2 English - 1 broadcast conversation, 1 blogs • >40% Chinese-only queries had zero response Name in English query did not even appear in translated documents

  24. Multilingual Query Focused Summarization: Overview Distillation over Multilingual Sources • “Minor” MT problem = Major distillation problem • Not all n-grams are equally important • Deletion of verbs (especially for event triggers) • Deletion of named entities • Non-translation of OOV (named entities, inflected words) • Sentence-by-sentence vs. whole document • IR over MT is poor, especially for names • Garbled sentences are hard to process • Unreadable snippets are irrelevant

  25. Multilingual Query Focused Summarization: Correcting Names Finding and Correcting Names

  26. Multilingual Query Focused Summarization: Correcting Names It should be mentioned that $wArznjr is also a nasseer of the Olympic Movement , which provide mentally handicapped the opportunity to participate in social life through training and sports competitions . It also included also many media figures and general such important outstanding information opera singer wynfrY , besides the star and the governor of the state of California Arnold Schwarznegger . Nightingale System (Year 2) • IR over translated documents • MT output passed to response generators Query: Schwarzenegger يذكر ان شوارزنجر هو ايضا نصير للحركة الأوليمبية الخاصة , التى توفر للمعاقين ذهنيا فرصة المشاركة فى الحياة الأجتماعية من خلال التدريبات والمسابقات الرياضية . وانما ضمت أيضا العديد من الشخصيات الاعلامية والعامة المهمة مثل الاعلامية اللامعة أوبرا وينفرى , الى جانب النجم وحاكم ولاية كاليفورنيا ارنولد شوارزنيجر .

  27. Multilingual Query Focused Summarization: Correcting Names Solution: Simultaneous Multilingual IR • Index “pseudo-parallel” documents, with source and MT • Create multilingual queries to search both • Each document has two “chances” to match query: source and translation شفارتزنيغرشوارزنجرشوارزينيجرSchwarzenegger Query: It should be mentioned that $wArznjr is also a nasseer of the Olympic Movement , which provide mentally handicapped the opportunity to participate in social life through training and sports competitions . It also included also many media figures and general such important outstanding information opera singer wynfrY , besides the star and the governor of the state of California Arnold Schwarznegger. يذكر ان شوارزنجر هو ايضا نصير للحركة الأوليمبية الخاصة , التى توفر للمعاقين ذهنيا فرصة المشاركة فى الحياة الأجتماعية من خلال التدريبات والمسابقات الرياضية . وانما ضمت أيضا العديد من الشخصيات الاعلامية والعامة المهمة مثل الاعلامية اللامعة أوبرا وينفرى , الى جانب النجم وحاكم ولاية كاليفورنيا ارنولد شوارزنيجر .

  28. Multilingual Query Focused Summarization: Correcting Names User-Generated Translations • Use Wikipedia to translate, expand query • Advantages • Generated by humans, “edited” by humans • Good for names, named entities, titles of books/plays/etc. • Contains slang, name variations, common misspellings – useful for blogs • Free, easy to acquire 49 English synonyms: ahnuld governator arnold swartzeneger arnold swartzenegger arnold swartzenneger arnold swartzennegger arnold swartznegger arnold swarzenager arnold swarzenneger …

  29. Multilingual Query Focused Summarization: Correcting Names Experimental Setting • 145 GALE queries, 8,785 Chinese documents judged • What gets translated • Queries only • Documents only • Both • How queries are translated • Using Wikipedia • Using Statistical MT (SMT) dictionary

  30. Multilingual Query Focused Summarization: Correcting Names IR Evaluation • Overall, SMLIR does better than just query translation or document translation alone • On named entities, Wikipedia translation dictionary outperforms SMT dictionary • For non-names, doing just query translation with SMT dictionary is better than any other setting *other systems performed better

  31. Multilingual Query Focused Summarization: Correcting Names SMT Post-Editing for Names • Use SMLIR to identify incorrect name translations • Use query translation + word alignments to rewrite for response generator • Language independent English query: Schwarzenegger Pseudo-parallel document with word alignments يذكر ان شوارزنجر هو ايضا ... It should be mentioned that $wArznjr is also … Dictionary translation: شوارزنيجر It should be mentioned that Schwarzenegger is also … Edited translation 除了阿迪达斯,本届世界杯的官方赞助商还有可口可乐,麦当劳,百威啤酒,雅虎以及万事达信用卡等。 In addition Adidas , at this World Cup official sponsor of the Coca-Cola , McDonalds , 100 beerBudweiser , Yahoo and MasterCard .

  32. Multilingual Query Focused Summarization: References Rewriting References • The representative of Iraq in the United Nations NizarHamdoon

  33. Multilingual Query Focused Summarization: References Rewrite proper and common nouns to remove MT errors (Siddharthan and McKeown 05) • Use redundancy in input to QA and multiple translations to build attribute value matrices (AVMs) • Record country, role, description for all people • Record name variants • Use generation grammar with semantic categories (role, organization, location) to re-order phrases for fluent output

  34. Multilingual Query Focused Summarization: References

  35. Multilingual Query Focused Summarization: Correcting Verbs Correcting Missing Verbs

  36. Multilingual Query Focused Summarization: Correcting Verbs SMT Post-Editing for Chinese Verbs Wei Yun Ma, Kathy McKeown • For Chinese, 4-7% of our MT sentences/clauses were missing a main verb • MT: People of classical music loving every year. • REF : People’s love for classical music reduced every year. • Chinese: 民众对古典音乐的热爱逐年减退。 “arrested” 12月/NN 13日/NN 萨达姆/NR 被/SB 捕/VV。 On December 13 Saddam .

  37. Multilingual Query Focused Summarization: Correcting Verbs SMT Post-Editing for Chinese Verbs Wei Yun Ma, Kathy McKeown • Detect missing verbs using POS • Use related documents returned by IR as DB of examples • Correct using context, freq to select verb translation from DB

  38. Multilingual Query Focused Summarization: Correcting Verbs SMT Post-Editing for Chinese Verbs • Chinese speakers judged original versus modified • 79% of the modified sentences were better than the original MT • 7% irrelevant sentences become relevant 12月13日萨达姆被捕。 On December 13 Saddam . On December 13 Saddam was arrested. 印度25日组织全国几百万儿童服用脊髓灰质炎疫苗。 India 25 th National millions children polio vaccine . India 25 th National millions children received polio vaccine .

  39. Detecting Content Word Deletion (in Arabic) • Beyond verbs • Named entities, content words -> function words, out of vocabulary • Use alignments, part of speech, TF*IDF • Detection accuracy, judged by 2 people • 82% of the deletions, both judges agreed • 89% of those were correct • Current work: correcting difficult due to word order differences

  40. Fusion for Machine Translation • STAGES (Statistical Translation and Generation using Semantics) • Funded by NSF • Univ of Colorado (Martha Palmer) • ISI (Kevin Knight) • Rochester Univ (Dan Gildea) • Brandeis Univ (NangwenXue)

  41. Columbia’s Role • Generate an English sentence using conceptual alignments produced by Colorado as input • Fuse translation trees provided by ISI’s and Rochester’s SMT components • A new form of system combination using generation grammar • Research: • What are the constraints on grammar controlled combination? • What constraints can we use for syntactic and lexical choice? • How can we recover/exploit information from the source?

  42. Columbia’s Role • Generate an English sentence using conceptual alignments produced by Colorado as input • Fuse translation trees provided by ISI’s and Rochester’s SMT components • A new form of system combination using generation grammar • Research: • What are the constraints on grammar controlled combination? • What constraints can we use for syntactic and lexical choice? • How can we recover/exploit information from the source?

  43. Overall Architecture

  44. Overall Architecture

  45. Approach • The use of XTAG grammar • Feature-based lexicalized tree adjoining grammar • Simultaneously detect multiple ungrammatical types and words • Correct detected ungrammatical errors using substitution and adjunction

  46. Sent 521 foreign:让 我们 看一个 实际 的 例子 . dev.0: let us look at a concrete example . dev.1: let us look at a concrete example . dev.2: let us look at a concrete example . dev.3: let us look at a concrete example . Combine constituents Rochester ISI

  47. Combine constituents Adopt Rochester’s output as the basis tree substitute look [ Role:arg1 ] [ ] at

  48. Current Focus • Detection of agreement errors • Number agreement • Verb mode • Use feature structures attached to nodes • Unification failures indicate a grammatical error • Augmented to record failure points

  49. Many young student play basketball pl

  50. Experiments • 422 translated sentences • 6 systems from GALE 2007 evaluation

More Related