1 / 25

Modeling Novelty and feature combination using Support Vector Regression for Update Summarization

Modeling Novelty and feature combination using Support Vector Regression for Update Summarization. Praveen Bysani Vijay Yaram Vasudeva Varma. Outline. Text Summarization Update Summarization Support Vector Regression Sentence Scoring Features Novelty Factor (NF) Experimental Results.

donoma
Download Presentation

Modeling Novelty and feature combination using Support Vector Regression for Update Summarization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modeling Novelty and feature combination using Support Vector Regression for Update Summarization Praveen Bysani Vijay Yaram VasudevaVarma Search and Information Extraction Lab LTRC, IIIT Hyderabad

  2. Outline • Text Summarization • Update Summarization • Support Vector Regression • Sentence Scoring Features • Novelty Factor (NF) • Experimental Results Search and Information Extraction Lab LTRC, IIIT Hyderabad

  3. Text Summarization • Condensing a piece of text while retaining important information • Types of summarization • Extractive Vs Abstractive • Single Document Vs Multi Document • Query Focused Vs Query Independent • Personalized Vs Generic • Focus on Extractive Multi Document Summarization • Dynamic Summarization, Update Summarization Search and Information Extraction Lab LTRC, IIIT Hyderabad

  4. Update Summarization • Emerging area in summarization • Summarization with a sense of prior knowledge • Challenging – To detect information that is relevant and also novel • Practical usage – monitor changes in temporally evolving topics especially newswire Search and Information Extraction Lab LTRC, IIIT Hyderabad

  5. An example Topic : YSR Chopper Missing •  A helicopter carrying Andhra Pradesh chief minister Y S Rajashekhar Reddy, two of his staff and two pilots went missing in pouringrain Wednesday morning over the Naxal and tiger-infested Nalamalla forests and with no contact until early Thursday, experts and officials feared the worst. Multiple agencies of the state launched a massive hunt for possible wreckage in the desolate terrain.  Apart from Reddy, the chopper was carrying principal secretary to CM S Subrahmanyam and YSR's chief security officer ASC Wesley. • Andhra Pradesh chief minister Y S Rajasekhara Reddy has died in an air crash. The bodies of 60-year-old Reddy, his special secretary P Subramanyam, chief security officer A S C Wesley, pilot Group Captain S K Bhatia and co-pilot M S Reddy were found on Rudrakonda Hill, 40 nautical miles east of here, besides the mangled remains of the helicopter. The central leadership of the Congress is understood to have cleared the name of Andhra Pradesh finance minister K Rosaiah as the caretaker CM of the state. Search and Information Extraction Lab LTRC, IIIT Hyderabad

  6. Sentence Scorers doc1 feature1 Ranker Pre Processing feature2 Sentences doc feature n doc3 Ranked Set of Sentences Summary Summary Generator

  7. Sentence Ranking • 4 stages of sentence extractive summarization • Pre Processing • Sentence Scoring • Sentence Ranking • Summary Generation • Scores from features are manually weighted to compute sentence rank • Instead, use a Machine Learning Algorithm to estimate rank from features Search and Information Extraction Lab LTRC, IIIT Hyderabad

  8. Support Vector Regression (SVR) • Regression analysis - modeling values of a dependent variable  from one or more independent variables • Regression is a function Y = f(X,β) • The independent variables, X • The dependent variable, Y • unknown parameters, β • Regression using Support Vectors is Support Vector Regression (SVR) • Estimate Sentence rank (dependent variable) using Scoring Features (Independent variables) through Support Vector Regression (SVR) Search and Information Extraction Lab LTRC, IIIT Hyderabad

  9. Estimating Sentence Rank • ROUGE – recall oriented metric which evaluates based on word overlap with models • ROUGE-2 and ROUGE-SU4 correlate highly with human evaluation • Sentence Rank (is) of a sentence s is |Bigramm ^ Bigrams| - # of bigrams shared by model and sentence Search and Information Extraction Lab LTRC, IIIT Hyderabad

  10. Oracle summaries • Each oracle summary is best summary that can be generated by any sentence extractive summarization system • Sentences are ranked using the ROUGE-2 score described above • To depict the gap and scope of improvement Search and Information Extraction Lab LTRC, IIIT Hyderabad

  11. Sentence Scoring Features Sentence Position is a popular and well studied feature • Sentence Location 1 (SL1) • First 3 sentences of document contain most informative content (also proved by analysis of oracle summaries) • Score of a sentence ‘s’ at position ‘n’ in document ‘d’ is Score(snd) = 1 – n/1000 if n<=3 = n/1000 else Search and Information Extraction Lab LTRC, IIIT Hyderabad

  12. Sentence Location 2 (SL2) • Positional index of sentence itself as value • Model will learn optimum sentence position based on genre • Not inclined to top or bottom as SL1 Score(snd) = n • Sentence Frequency Score (SFS) • Ratio of number of sentences in which a word occurred to total number of sentences in cluster • SFS of word w is Search and Information Extraction Lab LTRC, IIIT Hyderabad

  13. TF – IDF • Popular measure to find relevance of document in IR • Same analogy used to find relevance of sentence • Term Frequency(Tfij) of term (ti) in document (dj) is • Inverse Document Frequency (IDFi) of a term (ti) is Search and Information Extraction Lab LTRC, IIIT Hyderabad

  14. Document Frequency Score (DFS) • Inverse of IDF • Ratio of number of docs in which a term occurred to total number of docs • Average DFS of words in sentence is its score • Probabilistic Hyperspace Analogue to language (PHAL) and Kullback-Leiber Divergence (KL) available as features • Baseline summarizer generates summary by picking first 100 words of last document Search and Information Extraction Lab LTRC, IIIT Hyderabad

  15. Novelty Factor • An ad-hoc feature for Update summarization • Consider a stream of articles published on a topic over time period T • All articles published from time 0 to t are considered to be read previously (prior Knowledge) • Articles published from t to T are new that contains new information. • Let td represent the chronological time stamp of document d. Search and Information Extraction Lab LTRC, IIIT Hyderabad

  16. Novelty Factor NF of a word ‘w’ is • nd t = { d : w in d and td > t } • pd t = { d: w in d and td < t} • D = { d: td > t } • |ndt| signifies the importance of term in new cluster • |pdt| penalizes any term that occurs frequently in previous clusters • |D| for smoothing Search and Information Extraction Lab LTRC, IIIT Hyderabad

  17. Training Data • DUC 2007 Main task data for training • 45 topics • Each topic with 25 documents and a query • Associated 4 model summaries each 250 words • DUC 2007 Update task data for training update specific features • 10 topics • Each topic divided into clusters A,B,C in chronological order with 10, 8, 7 docs respectively • Associated 4 model summaries each 100 words Search and Information Extraction Lab LTRC, IIIT Hyderabad

  18. Test Dataset • TAC 2008 Update Summarization data for training • 48 topics • Each topic divided into A, B with 10 documents • Summary for cluster A is normal summary and cluster B is update summary Search and Information Extraction Lab LTRC, IIIT Hyderabad

  19. Experiments and Results • For Individual Features Search and Information Extraction Lab LTRC, IIIT Hyderabad

  20. Experiments and Results • For combination of features Non-complimenting features for KL Search and Information Extraction Lab LTRC, IIIT Hyderabad

  21. At cluster Level • For Cluster A Search and Information Extraction Lab LTRC, IIIT Hyderabad

  22. At cluster Level • For Cluster B Search and Information Extraction Lab LTRC, IIIT Hyderabad

  23. Discussion • Quality of training data for NF is poor compared to other features, that explains the relatively less performance compared to DFS • Huge Gap between oracle summaries and best peers Search and Information Extraction Lab LTRC, IIIT Hyderabad

  24. Contribution • Ad-hoc Feature (Novelty Factor) for modeling novelty along with relevance • Analyzing the affect of various feature combinations in quality of summaries using Support Vector regression • Depicting the scope of improvement in summarization Search and Information Extraction Lab LTRC, IIIT Hyderabad

  25. Thank You Search and Information Extraction Lab LTRC, IIIT Hyderabad

More Related