Enhancing Multi-Document Summarization: Strategies and Components

UNC-CH at DUC2007:Query Expansion, Lexical Simplification, and Sentence Selection Strategies for Multi-Document Summarization Catherine BlakeJulia KampovAndreas OrphanidesDavid WestCory Lown

Goals in 2007 • Get a system up and running • Components • Query Expansion • WordNet • Lexical Compression • Linguistically motivated pruning • Sentence Selection • Clustering

System Architecture

Query Expansion - Approach • Goal: Increase responsiveness • Approach • A – Weak Baseline • any term in topic or query • B – Baseline • remove stop words inc. small set of tailored terms • C – Weak WordNet • WordNet synsets from terms in B • D – WordNet • Synsets from C + synonyms

Query Expansion - Evaluation Annotators didn’t know how the system would summarize text, but knew that the task was going to be automated • Query selection • Rank 2006 queries by overall responsiveness • Relevance • 3 annotators identified sentences with “information pertinent to the topic” for 9 topics • For evaluation a sentence was identified when a term from in ABC or D appeared in a gold standard sentence • Inter-rater reliability • Topic 6 and 34 had fair to moderate agreement • Annotators reached consensus for topic 6 and 34 • Annotators then reworked other topics

Query Selection

Query Expansion – Evaluation

Decision: No WordNet Query Expansion Lexical Simplification

Lexical Simplification • Goal • Increase linguistic quality • Approach • Representation • Type Dependency Tree (de Marneffe, et al, 2006) • Stanford Parser Version 1.5 (Klein & Manning, 2002; 2003) • Identify short, stand-alone sentences • Prune both original and short sentences using • Parser tags • Cue phrases identified in previous DUC submissions

Sub-Sentences Short Stand-Alone Sentences

Pruning • Noun Appositive • Participial Modifier For nearly a decade, Queen Latifah,the first lady of hip-hop, has been bobbing and weaving questions about … Indeed, some people reading this report could get the impression that Amnesty believes violence can be a legitimate instrument, the statement said

Pruning • Lead Adverbials • 15 cue phrases from previous DUCs • Attribution • Parser tags • Cue phrases: said, according Separately, the report said that the murder rate by Indians in 1996 was 4 per 100000, below the national average …

Lexical Simplification

Sentence Selection - Settings • No WordNet query expansion • original + base form • Percentage of Topic/Query Terms • Num stemmed terms in querynum stemmed terms in sentence • Percentage of Unique Terms • Num stem terms new sent that not in selected sentNum of stemmed terms in sentence • Weighted Term Frequency * IDF

Sentence Selection - Settings • Weighted Term Frequency (tottf) Feature Weight Stopword or punctuation 0 Topic/Query ^ ¬Summary 1 Topic/Query ^ Summary 0.5 ¬ Topic/Query ^ ¬Summary 0.01 ¬ Topic/Query ^ Summary 0.001

Sentence Selection • Clustering • Oracle clustering tool • K-means • 1000 iterations • removed determiners, prepositions etc • Favor Sentences from • Different clusters • Popular clusters – ie lots of sentences • How representative the sentence is of the cluster

ROUGE-1 Score Sentence Selection – Evaluation

Sentence Selection – Evaluation

Official DUC 2007 Evaluation • UNC-CH = System 22 • Automatic Evaluation • ROUGE-2 score 0.10329 (13th) • Manual Evaluation • Responsiveness = 2.956 (7th) • Linguistic Quality = 2.987 (24th)

What we have learned so far • Sentence selection • Optimal Strategy: weighted term frequency / sentence length * cluster weight • Clustering really helps • Lexical simplification • Rework sub-sentences • Pronoun resolution • Query expansion had negligible effect

Next Steps • Alternative Query Expansion • Error analysis of medical questions underway • Concept representation • Unified Medical Language System (UMLS) • Tune sentence selection strategy • Lexical simplification • Rework sub-sentences • Add basic pronoun resolution • Sentence Re-Ordering • Combine with lexical simplification

Acknowledgements • The organizers for running this conference and providing manual summaries • Previous DUC paper authors for making their system designs explicit • Monica Sanchez and Stephanie Haas for earlier discussions • Thom Hailey, Scott Krauss and Toshiba Burns-Johnson for annotating queries

Enhancing Multi-Document Summarization: Strategies and Components