Predicting sentence specificity, with applications to news summarization

Predicting sentence specificity, with applications to news summarization Ani Nenkova, joint work with Annie Louis University of Pennsylvania

Motivation • A well-written text is a mix of general statements and sentences providing details • In information retrieval: find relevant and well-written documents • Writing support: visualize general and specific areas

Supervised sentence-level classifier for general/specific • Training data • Used existing annotations for discourse relations from PDTB • Features • Lexical, language model, syntax, etc • Testing data • Annotators judged more sentences • Applications to analysis of summarization output • Automatic summaries too specific, worse for that

Training data • Penn discourse tree bank

Penn Discourse Treebank (PDTB) • Largest annotated corpus of explicit and implicit discourse relations • 1 million words of Wall Street Journal • Arguments – spans linked by a relation (Arg1, Arg2) • Sense – semantics of the relation (3 level hierarchy) I love ice-cream but I hate chocolates. (discourse connectives) I came late. I missed the train. (adjacent sentences in the same paragraph)

Distribution of relations between adjacent sentences (Adjacent sentences linked by an entity. Not considered a true discourse relation.)

Training data from PDTB Expansions Expansion Conjunction [Also, Further] Restatement [Specifically, Overall] Instantiation [For example] List [And] Alternative [Or, Instead] Exception [except] Specification Generalization Conjunctive Disjunctive Chosen alternative Equivalence 7

Instantiation example The 40 year old Mr. Murakami is a publishing sensationin Japan. A more recent novel, “Norwegian wood”, has sold more than forty million copies since Kodansha published it in 1987.

Examples of general /specific sentences • Despite recent declines in yields, investors continue to pour cash into money funds. Assets of the 400 taxable funds grew by $1.5 billion during the latest week, to $352 billion. [Instantiation] • By most measures, the nation’s industrial sector is now growing very slowly—if at all. Factory payrolls fell in September. [Specification]

Experimental setup—Two classifiers • Instantiations-based • Arg1: General, Arg2: specific • 1403 examples • Restatement#Specifications-based • Arg1: General, Arg2: specific • 2370 examples • Implicit relations only • 50% baseline accuracy; 10 fold-cross validation; Logistic regression

Features • Developed from a small development set • 10 pairs of specification • 10 pairs of instantiation

Features for general vs specific • Sentence length: no. of tokens, no. of nouns • Expected general sentences to be shorter • Polarity: no. of positive/ negative/ polarity words, also normalized by length • General Inquirer • MPQA subjectivity lexicon • In dev set, sentences with strong opinion are general • Language models: unigram/ bigram/ trigram probability & perplexity • Trained on one year of New York Times news • In dev set, general sentences contained unexpected, catchy phrases

Features for general vs specific • Specificity • min/ max/ avg IDF • WordNet: hypernym distance to root for nouns and verbs—min/ max/ avg • Syntax: No. of adjectives, adverbs, ADJP, ADVP, verb phrases, avg VP length • Entities: Numbers, proper names, $ sign, plural nouns • Words: count of each word in the sentence

Accuracy of general/specific classifier using Instantiations Best: 76% accuracy

Accuracy of general/specific classifier using Specifications Best: 60% accuracy

Instantiation based classifier gave better performance • Best individual feature set: words (74.8%) • Non-lexical features are equally good: 74.1% • No improvement by combining: 75.8%

Feature analysis • Words with highest weight [Instantiation-based] • General: number, but, also, however, officials, some, what, lot, prices, business, were… • Specific: one, a, to, co, I, called, we, could, get… • General sentences are characterized by • Plural nouns • Dollar sign • Lower probability • More polarity words and more adjectives and adverbs • Specific sentences are characterized by • Numbers and names

More testing data • Direct judgments of WSJ and AP sentences on Amazon Mechanical Turk • ~ 600 sentences • 5 judgments per sentence

In WSJ, more sentences are general (55%) In AP, more sentences are specific (60%)

Why the difference between Instantiation and Specification? • Some of the annotations were on our initial training data Has more detectable properties associated with Arg1 and Arg2

Accuracy of classifier on new data Non-lexical features work better on this data Performance is almost the same as in cross validation Classifier is more accurate on examples where people agree Classifier confidence correlates with annotator agreement

Application of our classifier to full articles 22 • Distribution of general/specific sentences in news documents • Can the classifier detect differences in general/specific summaries by people • Do summaries have more general/specific content compared to input? How does it impact summary quality? • Compare different types of summaries • Human abstracts: written from scratch • Human extracts: select sentences as a whole from inputs • System summaries: all extracts

Example general and specific predictions • Seismologists said the volcano had plenty of built-up magma and even more severe eruptions could come later.[general] • The volcano's activity -- measured by seismometers detecting slight earthquakes in its molten rock plumbing system -- is increasing in a way that suggests a large eruption is imminent, Lipman said. [specific]

Example predictions The novel, a story of a Scottish low-life narrated largely in Glaswegian dialect, is unlikely to prove a popular choice with booksellers who have damned all six books shortlisted for the prize as boring, elitist and – worse of all – unsaleable. … The Booker prize has, in its 26-year history, always provoked controversy. Specific General 24

Computing specificity for a text • Sentences in summary are of varying length, so we compute a score on word level • “Average specificity of words in the text” Confidence for being in specific class 0.68 0.68 0.68 0.68 S1: w11 w12 w13 … 0.68 0.23 0.23 0.23 0.23 S2: w21 w22 w23 … 0.23 0.81 0.81 0.81 0.81 S3: w31 w32 w33 … 0.81 Average score on tokens Specificity score

50 specific and general human summaries No significant differences in specificity of the input Significant differences in specificity of summaries in the two categories Our classifier is able to detect the differences

Data: DUC 2002 • Generic multidocument summarization task • 59 input sets • 5 to 15 news documents • 3 types of summaries • 200 words • Manually assigned content and linguistic quality scores 1. Human abstracts 2. Human extracts 3. System extracts 2 assessors * 59 2 assessors * 59 9 systems * 59

Specificity analysis of summaries • More general content is preferred in abstracts • Simply the process of extraction makes summaries more specific • System summaries are overly specific 0.6 0.7 0.8 [Avg. specificity] H. Abs (0.62) Inputs (0.65) H.ext (0.72) S.ext (0.74)

Histogram of specificity scores • Human summaries are more general • Is the aspect related to summary quality?

Analysis of ‘system summaries’: specificity and quality • Content quality • Importance of content included in the summary • Linguistic quality • How well-written the summary is perceived to be • Quality of general/specific summaries • When a summary is intended to be general or specific

Relationship to content selection scores • Coverage score: closeness to human summary • Clause level comparison • For system summaries • Correlation between coverage score and average specificity • -0.16*, p-value = 0.0006 • Less specific ~ better content

But the correlation is not very high • Specificity is related to realization of content • Different from importance of the content • Content quality = content importance + appropriate specificity level • Content importance: ROUGE scores • N-gram overlap of system summary and human summary • Standard evaluation of automatic summaries

Specificity as one of the predictors • Coverage score ~ ROUGE-2 (bigrams) + specificity • Linear regression • Weights for predictors in the regression model Is the combination a better predictor than ROUGE alone?

2. Specificity and linguistic quality • Used different data: TAC 2009 • DUC 2002 only reported number of errors • Were also specified as a range: 1-5 errors • TAC 2009 linguistic quality score • Manually judged: scale 1 – 10 • Combines different aspects • coherence, referential clarity, grammaticality, redundancy

What is the avg specificity in different score categories? • More general ~ lower score! • General content is useful but need proper context! • If a summary starts as follows: • “We are quite a ways from that, actually.” • As ice and snow at the poles melt, … Specificity = low Linguistic quality = 1

Data for analysing generalization operation • Aligned pairs of abstract and source sentences conveying the same content • Traditional data used for compression experiments • Ziff-Davis tree alignment corpus • 15964 sentence pairs • Any number of deletions, up to 7 substitutions • Only 25% abstract sentences are mapped • But beneficial to observe the trends [Galley & McKeown (2007)]

Generalization operation in human abstracts One-third of all transformations are specific to general • Human abstracts involve a lot of generalization

How specific sentences get converted to general? Choose long sentences and compress heavily! • A measure of generality would be useful to guide compression • Currently only importance and grammaticality are used

Use of general sentences in human extracts • Details of Maxwell’s death were sketchy. • Folksy was an understatement. • “Long live democracy!” • Instead it sank like the Bismarck. • Example use of a general sentence in a summary … With Tower’s qualifications for the job, the nominations should have sailed through with flying colors. [Specific] Instead it sank like the Bismarck. [General] … Future: can we learn to generate and select general sentences to include in automatic summaries?

Conclusions • Built a classifier for general and specific sentences • Used existing annotations to do that • But tested on new data and task-based evaluation • The confidence of the classifier is highly correlated with human agreement • Analyzed human and machine summaries • Machine summaries are too specific • But adding general sentences is difficult because the context has to be right

Further details in • Annie Louis and Ani Nenkova, Automatic identification of general and specific sentences by leveraging discourse annotations,  Proceedings of IJCNLP, 2011 (To Appear). • Annie Louis and Ani Nenkova, Text specificity and impact on quality of news summaries,  Proceedings of ACL-HLT Workshop on Monolingual Text to Text Generation, 2011. • Annie Louis and Ani Nenkova, Creating Local Coherence: An Empirical Assessment,  Proceedings of NAACL-HLT 2010.

Two types of local coherence—Entity & Rhetorical • Local coherence: Adjacent sentences in a text flow from one to another • Entity – same topic • John was hungry. He went to a restaurant. • But only 42% sentence pairs are entity-linked [previous corpus studies] • Will core discourse relations connect the non-entity sharing sentence pairs? • Popular hypothesis in prior work

Investigations into text quality • The mix of discourse relations in a text is highly predictive of the perceived quality of the text • Both implicit and explicit relations are needed to predict text quality • Predicting the sense of implicit discourse relations is a very difficult task; most predicted to be “expansion” • How is local coherence created?

Joint analysis by combining PDTB and Ontonotes annotations • 590 articles • Noun phrase coreference from Ontonotes • 40 to 50% of sentence pairs do not share entities in articles of different lengths

Expansions cover most of non-entity sharing instances

Expansions have the least rate of coreference

Rate of coreference in 2nd level elaboration relations

Example instantiations and list relations • Instantiation The economy is showing signs of weakness, particularly among manufacturers. Exports which played a key role in fueling growth over the last two years, seem to have stalled. • List Many of Nasdaq's biggest technology stocks were in the forefront of the rally. - Microsoft added 2 1/8 to 81 3/4 and Oracle Systems rose 1 1/2 to 23 1/4. - Intel was up 1 3/8 to 33 3/4.

Overall distribution of sentence pairs among the two coherence devices • 30% sentence pairs have no coreference and are in a weak discourse relation (expansion/entrel) • We must explore elaboration more closely to identify how they create coherence

Predicting sentence specificity, with applications to news summarization

Predicting sentence specificity, with applications to news summarization

Presentation Transcript

Ontology Summarization Based on RDF Sentence Graph

Specificity

Summarization

Text Summarization: News and Beyond

Applications of Machine Summarization

Summarization

A progressive sentence selection strategy for document summarization

Text Summarization: News and Beyond

Summarization

Summarization

Road to Summarization

Specificity

Assessing sentence scoring techniques for extractive text summarization

MDL Summarization with Holes

SUMMARIZATION

Summarization

Summarization of Broadcast News using Speaker Tracking

Comments-Oriented Blog Summarization by Sentence Extraction

Summarization

specificity

SPECIFICITY

MDL Summarization with Holes