Concept based Multi-Document Text Summarization

By : asefpoormasoomi Supervisor: Dr. Kahani autumn 2010 Concept based Multi-Document Text Summarization Ferdowsi University of Mashad

Introduction • summary: brief but accurate representation of the contents of a document

Is this the best we can do? Motivation • Abstracts for Scientific and other articles • News summarization (mostly Multiple document summarization)‏ • Classification of articles and other written data • Web pages for search engines • Web access from PDAs, Cell phones • Question answering and data gathering

Genres • Extract vs. abstract • lists fragments of text vs. re-phrases content coherently. • example : He ate banana, orange and apple=>He ate fruit • Generic vs. query-oriented • provides author’s view vs. reflects user’s interest. • example : question answering system • Personal vs. general • consider reader’s prior knowledge vs. general. • Single-document vs. multi-document source • based on one text vs. fuses together many texts. • Input • text , video, image , map

Methods • Statistical scoring methods (Pseudo) • Higher semantic/syntactic structures • Network (graph) based methods • Semantic based methods(LSA, ontology, WordNet) • Other methods (rhetorical analysis, lexical chains, co-reference chains) • AI methods

Statistical scoring (Pseudo) • General method: • score each entity (sentence, word) ; • combine scores; • choose best sentence(s) • Scoring tecahniques: • Word frequencies throughout the text (Luhn 58) • Position in the text (Edmunson 69, Lin&Hovy 97) • Title method (Edmunson 69) • Cue phrases in sentences (Edmunson 69) • Bayesian Classifier(Kupiec at el 95)

Methods • Statistical scoring methods • problems : • Synonymy: one concept can be expressed by different words. • example cycle and bicycle refer to same kind of vehicle. • Polysemy: one word or concept can have several meanings. • example, cycle could mean life cycle or bicycle. • Phrases: a phrase may have a meaning different from the words in it. • An alleged murderer is not a murderer (Lin and Hovy 1997) • Higher semantic/syntactic structures • Network (graph) based methods • Other methods (rhetorical analysis, lexical chains, co-reference chains) • AI methods

LSI based summarization(Gong , 2001) • Make Term-Sentence Matrix • Apply SVD on Term-Sentence Matrix • Problem • TFISF con not show context and relation correctly

Proposed Approach • Preprocessing (Tokenizing, Stopword, Stemming) • Extract Context (Use LSA on Term-Document) • Extract Perspective(SRL and WordNet) • Summary Generation

Proposed Approach • Preprocessing • Tokenizing And Remove Stop words • Stemming and make Term-Document matrix A • Extract Context • Use SVD on A and use matrix U(term-Concept) • Calculate Cosine distance between Concepts And Documents • Calculate Cosine distance between Sentences And Concept of each Topic • Rank Sentences

Proposed Approach • Extract Perspective • Use SRL and WordNet for sentence similarity • Cosine Distance Problem • S1 = United States Army, successfully tested an anti-missile defense system. • S2 = U.S.militaryprojectile interceptor, streaked into space and hit the target. • S3 = Iran's weekend test of a long-range missile underscored the need for a U.S. national missile defense system. • Semantic Similarity • S1 = United States Army,successfullytestedan anti-missile defense system. subject AM-MNR verb object • Summary Generation • Remove Redundancy and Rank Sentence

Evaluation Tools & Summarization Systems • ROUGE :Recall-Oriented Understudy for Gisting Evaluation • Types : ROUGE-N، ROUGE-L، ROUGE-W ، ROUGE-S, ROUGE-SU • MEAD • http : //www.summarization.com/mead • chinese , english , japanese , dutch • DMSumm • http : //www. icmc.usp.br /~taspardo/DMSumm.htm • portuguese , english • SweSum(Martin Hassel) • http://swesum.nada.kth.se/index-eng.html • english , german , italian , spanish , greek , ... • FarsiSum(NimaMazdak , Martin Hassel) • http://swesum.nada.kth.se/index-eng.html • SUMMARIST • PERSIVAL • GLEANS • SumUM • RIPTIDES • NTT • GISTSumm • GISTexter • DiaSumm • NeATS

References [1] I. Mani. Automatic summarization. John Benjamins Publishing Company, 2001. • [2] Yeh, J. Y., Ke, H. R., Yang, W. P., & Meng, I. H. Text summarization using a trainable summarizer and latent • semantic analysis. Information Processing and Management, 41, 75-95, 2005. • [3] Gong, Y., & Liu, X. Generic text summarization using relevance measure and latent semantic analysis. In • Proceedings of the 24th annual international ACM SIGIR conference on research and development in • information retrieval , SIGIR`01, New Orleans, 2001. • [4] Steinberger, J., & Kabadjov, M.A. & Poesio, M., & Sanchez-Graillet,O . Improving LSA-based • summarization with anaphora resolution. In Proceedings of the conference on Human Language Technology • and Empirical Methods in Natural Language Processing. 2005. • [5] Yu, H. News summarization based on semantic similarity measure. Ninth International Conference on • Hybrid Intelligent Systems, vol. 1, pp.180-183, 2009. • [6] C. H. Papadimitrious, P. Raghavan, H. Tamaki, and S. Vempala. Latent semantic indexing:A probabilistic • analysis. J. Comput. Syst. Sci., 61(2):217-235, 2000. • [7] C. –Y. Lin and Hovy. Automatic evaluation of summaries using n-gram co-occurrence statistics. In • Proccedings of NLT-NAACL, 2003. • [8] Nomoto, T., & Yuji, M. A new approach to unsupervised text summarization. In Proceedings of the 24th • annual international ACM SIGIR conference on research and development in information • retrieval, SIGIR`01. New Orleans, Louisiana, United States, 2001. • [9] J. Lee, S. Park, C. Ahn, D. Kim. Automatic generic document summarization based on non-negative matrix • factorization. Information Processing and Management 2008. • [10] Steinberger, J., & Poesio, M.,& Kabadjov, M.A. & Jeek, K .Two uses of anaphora resolution in • summarization. Information Processing and Management: an International Journal, vol 43, November, 2007. • [11] D. Wang, T. Li, S. Zhu, C. Ding. Multi-Document summarization via sentence-level semantic analysis and • symmetric matrix factorization. SIGIR’08, July 2008, Singapore. • [12] V. Gupta, G. S. Lehal, A Survey of Text Summarization Extractive Techniques. Journal of emerging • thechnologies in web intelligence, august 2010

thanks

DUC 2007 • Document Understanding Conferences • AQUAINT corpus • Dataset Specifications • Associated Press and New York Times(1998-2000) & Xinhua News Agency(1996-2000) • Ten NIST assessors wrote summaries for the 45 topics in the DUC 2007 main task. 25 Document In each Topic Totally 1125 Documents 45 Topics 262225 Terms without S.W 20057Terms By Stemming & without S.W 531174 Terms 32system summarizer ROUGE-2 ROUGE-SU4 Each Topic has 4 human summary

Experimental Result • Average result on 3 topics Recall On ROUGE-2

Experimental Result • Average result on 3 topics Recall On ROUGE-SU4

The Best …

Concept based Multi-Document Text Summarization

Concept based Multi-Document Text Summarization

Presentation Transcript

Text summarization

CSC 9010: Text Mining Applications Document Summarization

Recent advances in multi-document summarization

Text Summarization

Text summarization

Topic Themes for Multi-Document Summarization

Document Summarization

Query session g uided multi-document summarization

Document Summarization

Exploiting Timelines to Enhance Multi-document Summarization

Text summarization

Text summarization

Text summarization

Text Summarization

A New Multi-document Summarization System

Document Summarization

Exploiting Timelines to Enhance Multi-document Summarization

Text Summarization

Novel Algorithm for Multi document Summarization using Lexical Concept

Exploiting Timelines to Enhance Multi-document Summarization

Semantic Medline: Multi-Document Summarization and Visualization

LexPageRank: Prestige in Multi-Document Text Summarization