Modeling Novelty and feature combination using Support Vector Regression for Update Summarization

Modeling Novelty and feature combination using Support Vector Regression for Update Summarization Praveen Bysani Vijay Yaram VasudevaVarma Search and Information Extraction Lab LTRC, IIIT Hyderabad

Outline • Text Summarization • Update Summarization • Support Vector Regression • Sentence Scoring Features • Novelty Factor (NF) • Experimental Results Search and Information Extraction Lab LTRC, IIIT Hyderabad

Text Summarization • Condensing a piece of text while retaining important information • Types of summarization • Extractive Vs Abstractive • Single Document Vs Multi Document • Query Focused Vs Query Independent • Personalized Vs Generic • Focus on Extractive Multi Document Summarization • Dynamic Summarization, Update Summarization Search and Information Extraction Lab LTRC, IIIT Hyderabad

Update Summarization • Emerging area in summarization • Summarization with a sense of prior knowledge • Challenging – To detect information that is relevant and also novel • Practical usage – monitor changes in temporally evolving topics especially newswire Search and Information Extraction Lab LTRC, IIIT Hyderabad

An example Topic : YSR Chopper Missing • A helicopter carrying Andhra Pradesh chief minister Y S Rajashekhar Reddy, two of his staff and two pilots went missing in pouringrain Wednesday morning over the Naxal and tiger-infested Nalamalla forests and with no contact until early Thursday, experts and officials feared the worst. Multiple agencies of the state launched a massive hunt for possible wreckage in the desolate terrain. Apart from Reddy, the chopper was carrying principal secretary to CM S Subrahmanyam and YSR's chief security officer ASC Wesley. • Andhra Pradesh chief minister Y S Rajasekhara Reddy has died in an air crash. The bodies of 60-year-old Reddy, his special secretary P Subramanyam, chief security officer A S C Wesley, pilot Group Captain S K Bhatia and co-pilot M S Reddy were found on Rudrakonda Hill, 40 nautical miles east of here, besides the mangled remains of the helicopter. The central leadership of the Congress is understood to have cleared the name of Andhra Pradesh finance minister K Rosaiah as the caretaker CM of the state. Search and Information Extraction Lab LTRC, IIIT Hyderabad

Sentence Scorers doc1 feature1 Ranker Pre Processing feature2 Sentences doc feature n doc3 Ranked Set of Sentences Summary Summary Generator

Sentence Ranking • 4 stages of sentence extractive summarization • Pre Processing • Sentence Scoring • Sentence Ranking • Summary Generation • Scores from features are manually weighted to compute sentence rank • Instead, use a Machine Learning Algorithm to estimate rank from features Search and Information Extraction Lab LTRC, IIIT Hyderabad

Support Vector Regression (SVR) • Regression analysis - modeling values of a dependent variable from one or more independent variables • Regression is a function Y = f(X,β) • The independent variables, X • The dependent variable, Y • unknown parameters, β • Regression using Support Vectors is Support Vector Regression (SVR) • Estimate Sentence rank (dependent variable) using Scoring Features (Independent variables) through Support Vector Regression (SVR) Search and Information Extraction Lab LTRC, IIIT Hyderabad

Estimating Sentence Rank • ROUGE – recall oriented metric which evaluates based on word overlap with models • ROUGE-2 and ROUGE-SU4 correlate highly with human evaluation • Sentence Rank (is) of a sentence s is |Bigramm ^ Bigrams| - # of bigrams shared by model and sentence Search and Information Extraction Lab LTRC, IIIT Hyderabad

Oracle summaries • Each oracle summary is best summary that can be generated by any sentence extractive summarization system • Sentences are ranked using the ROUGE-2 score described above • To depict the gap and scope of improvement Search and Information Extraction Lab LTRC, IIIT Hyderabad

Sentence Scoring Features Sentence Position is a popular and well studied feature • Sentence Location 1 (SL1) • First 3 sentences of document contain most informative content (also proved by analysis of oracle summaries) • Score of a sentence ‘s’ at position ‘n’ in document ‘d’ is Score(snd) = 1 – n/1000 if n<=3 = n/1000 else Search and Information Extraction Lab LTRC, IIIT Hyderabad

Sentence Location 2 (SL2) • Positional index of sentence itself as value • Model will learn optimum sentence position based on genre • Not inclined to top or bottom as SL1 Score(snd) = n • Sentence Frequency Score (SFS) • Ratio of number of sentences in which a word occurred to total number of sentences in cluster • SFS of word w is Search and Information Extraction Lab LTRC, IIIT Hyderabad

TF – IDF • Popular measure to find relevance of document in IR • Same analogy used to find relevance of sentence • Term Frequency(Tfij) of term (ti) in document (dj) is • Inverse Document Frequency (IDFi) of a term (ti) is Search and Information Extraction Lab LTRC, IIIT Hyderabad

Document Frequency Score (DFS) • Inverse of IDF • Ratio of number of docs in which a term occurred to total number of docs • Average DFS of words in sentence is its score • Probabilistic Hyperspace Analogue to language (PHAL) and Kullback-Leiber Divergence (KL) available as features • Baseline summarizer generates summary by picking first 100 words of last document Search and Information Extraction Lab LTRC, IIIT Hyderabad

Novelty Factor • An ad-hoc feature for Update summarization • Consider a stream of articles published on a topic over time period T • All articles published from time 0 to t are considered to be read previously (prior Knowledge) • Articles published from t to T are new that contains new information. • Let td represent the chronological time stamp of document d. Search and Information Extraction Lab LTRC, IIIT Hyderabad

Novelty Factor NF of a word ‘w’ is • nd t = { d : w in d and td > t } • pd t = { d: w in d and td < t} • D = { d: td > t } • |ndt| signifies the importance of term in new cluster • |pdt| penalizes any term that occurs frequently in previous clusters • |D| for smoothing Search and Information Extraction Lab LTRC, IIIT Hyderabad

Training Data • DUC 2007 Main task data for training • 45 topics • Each topic with 25 documents and a query • Associated 4 model summaries each 250 words • DUC 2007 Update task data for training update specific features • 10 topics • Each topic divided into clusters A,B,C in chronological order with 10, 8, 7 docs respectively • Associated 4 model summaries each 100 words Search and Information Extraction Lab LTRC, IIIT Hyderabad

Test Dataset • TAC 2008 Update Summarization data for training • 48 topics • Each topic divided into A, B with 10 documents • Summary for cluster A is normal summary and cluster B is update summary Search and Information Extraction Lab LTRC, IIIT Hyderabad

Experiments and Results • For Individual Features Search and Information Extraction Lab LTRC, IIIT Hyderabad

Experiments and Results • For combination of features Non-complimenting features for KL Search and Information Extraction Lab LTRC, IIIT Hyderabad

At cluster Level • For Cluster A Search and Information Extraction Lab LTRC, IIIT Hyderabad

At cluster Level • For Cluster B Search and Information Extraction Lab LTRC, IIIT Hyderabad

Discussion • Quality of training data for NF is poor compared to other features, that explains the relatively less performance compared to DFS • Huge Gap between oracle summaries and best peers Search and Information Extraction Lab LTRC, IIIT Hyderabad

Contribution • Ad-hoc Feature (Novelty Factor) for modeling novelty along with relevance • Analyzing the affect of various feature combinations in quality of summaries using Support Vector regression • Depicting the scope of improvement in summarization Search and Information Extraction Lab LTRC, IIIT Hyderabad

Thank You Search and Information Extraction Lab LTRC, IIIT Hyderabad

Modeling Novelty and feature combination using Support Vector Regression for Update Summarization

Modeling Novelty and feature combination using Support Vector Regression for Update Summarization

Presentation Transcript

Active Set Support Vector Regression

Modeling and Exploiting Review Helpfulness for Summarization

Modeling Spatial Relationships using Regression Analysis

Blood Glucose Prediction using Physiological Models and Support Vector Regression

Support Vector Regression

Classification / Regression Support Vector Machines

Optimum Kernel Evaluation for Support Vector Regression

Using Support Vector Machine for Integrating Catalogs

Feature Selection for Regression Problems

Support Vector Regression

Novelty Detection in Repeated MEAD Summarization

Geo-activity Recommendations by using Improved Feature Combination

Classifying and clustering using Support Vector Machine

Introduction Support Vector Regression QSAR Problems and Data SVMs for QSAR

Massive Support Vector Regression (via Row and Column Chunking)

Identification of Wiener models using support vector regression

Modeling Spatial Relationships Using Regression Analysis

Using Artificial Neural Networks and Support Vector Regression to Model the Lyapunov Exponent

Support Vector Regression

Support Vector Regression

Massive Support Vector Regression (via Row and Column Chunking)