1 / 18

Concept based Multi-Document Text Summarization

By : asef poormasoomi Supervisor : Dr. Kahani autumn 2010. Concept based Multi-Document Text Summarization. Ferdowsi University of Mashad. Introduction. summary : brief but accurate representation of the contents of a document. Is this the best we can do?. Motivation.

hollye
Download Presentation

Concept based Multi-Document Text Summarization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. By : asefpoormasoomi Supervisor: Dr. Kahani autumn 2010 Concept based Multi-Document Text Summarization Ferdowsi University of Mashad

  2. Introduction • summary: brief but accurate representation of the contents of a document

  3. Is this the best we can do? Motivation • Abstracts for Scientific and other articles • News summarization (mostly Multiple document summarization)‏ • Classification of articles and other written data • Web pages for search engines • Web access from PDAs, Cell phones • Question answering and data gathering

  4. Genres • Extract vs. abstract • lists fragments of text vs. re-phrases content coherently. • example : He ate banana, orange and apple=>He ate fruit • Generic vs. query-oriented • provides author’s view vs. reflects user’s interest. • example : question answering system • Personal vs. general • consider reader’s prior knowledge vs. general. • Single-document vs. multi-document source • based on one text vs. fuses together many texts. • Input • text , video, image , map

  5. Methods • Statistical scoring methods (Pseudo) • Higher semantic/syntactic structures • Network (graph) based methods • Semantic based methods(LSA, ontology, WordNet) • Other methods (rhetorical analysis, lexical chains, co-reference chains) • AI methods

  6. Statistical scoring (Pseudo) • General method: • score each entity (sentence, word) ; • combine scores; • choose best sentence(s) • Scoring tecahniques: • Word frequencies throughout the text (Luhn 58) • Position in the text (Edmunson 69, Lin&Hovy 97) • Title method (Edmunson 69) • Cue phrases in sentences (Edmunson 69) • Bayesian Classifier(Kupiec at el 95)

  7. Methods • Statistical scoring methods • problems : • Synonymy: one concept can be expressed by different words. • example cycle and bicycle refer to same kind of vehicle. • Polysemy: one word or concept can have several meanings. • example, cycle could mean life cycle or bicycle. • Phrases: a phrase may have a meaning different from the words in it. • An alleged murderer is not a murderer (Lin and Hovy 1997) • Higher semantic/syntactic structures • Network (graph) based methods • Other methods (rhetorical analysis, lexical chains, co-reference chains) • AI methods

  8. LSI based summarization(Gong , 2001) • Make Term-Sentence Matrix • Apply SVD on Term-Sentence Matrix • Problem • TFISF con not show context and relation correctly

  9. Proposed Approach • Preprocessing (Tokenizing, Stopword, Stemming) • Extract Context (Use LSA on Term-Document) • Extract Perspective(SRL and WordNet) • Summary Generation

  10. Proposed Approach • Preprocessing • Tokenizing And Remove Stop words • Stemming and make Term-Document matrix A • Extract Context • Use SVD on A and use matrix U(term-Concept) • Calculate Cosine distance between Concepts And Documents • Calculate Cosine distance between Sentences And Concept of each Topic • Rank Sentences

  11. Proposed Approach • Extract Perspective • Use SRL and WordNet for sentence similarity • Cosine Distance Problem • S1 = United States Army, successfully tested an anti-missile defense system. • S2 = U.S.militaryprojectile interceptor, streaked into space and hit the target. • S3 = Iran's weekend test of a long-range missile underscored the need for a U.S. national missile defense system. • Semantic Similarity • S1 = United States Army,successfullytestedan anti-missile defense system. subject AM-MNR verb object • Summary Generation • Remove Redundancy and Rank Sentence

  12. Evaluation Tools & Summarization Systems • ROUGE :Recall-Oriented Understudy for Gisting Evaluation • Types : ROUGE-N، ROUGE-L، ROUGE-W ، ROUGE-S, ROUGE-SU • MEAD • http : //www.summarization.com/mead • chinese , english , japanese , dutch • DMSumm • http : //www. icmc.usp.br /~taspardo/DMSumm.htm • portuguese , english • SweSum(Martin Hassel) • http://swesum.nada.kth.se/index-eng.html • english , german , italian , spanish , greek , ... • FarsiSum(NimaMazdak , Martin Hassel) • http://swesum.nada.kth.se/index-eng.html • SUMMARIST • PERSIVAL • GLEANS • SumUM • RIPTIDES • NTT • GISTSumm • GISTexter • DiaSumm • NeATS

  13. References [1] I. Mani. Automatic summarization. John Benjamins Publishing Company, 2001. • [2] Yeh, J. Y., Ke, H. R., Yang, W. P., & Meng, I. H. Text summarization using a trainable summarizer and latent • semantic analysis. Information Processing and Management, 41, 75-95, 2005. • [3] Gong, Y., & Liu, X. Generic text summarization using relevance measure and latent semantic analysis. In • Proceedings of the 24th annual international ACM SIGIR conference on research and development in • information retrieval , SIGIR`01, New Orleans, 2001. • [4] Steinberger, J., & Kabadjov, M.A. & Poesio, M., & Sanchez-Graillet,O . Improving LSA-based • summarization with anaphora resolution. In Proceedings of the conference on Human Language Technology • and Empirical Methods in Natural Language Processing. 2005. • [5] Yu, H. News summarization based on semantic similarity measure. Ninth International Conference on • Hybrid Intelligent Systems, vol. 1, pp.180-183, 2009. • [6] C. H. Papadimitrious, P. Raghavan, H. Tamaki, and S. Vempala. Latent semantic indexing:A probabilistic • analysis. J. Comput. Syst. Sci., 61(2):217-235, 2000. • [7] C. –Y. Lin and Hovy. Automatic evaluation of summaries using n-gram co-occurrence statistics. In • Proccedings of NLT-NAACL, 2003. • [8] Nomoto, T., & Yuji, M. A new approach to unsupervised text summarization. In Proceedings of the 24th • annual international ACM SIGIR conference on research and development in information • retrieval, SIGIR`01. New Orleans, Louisiana, United States, 2001. • [9] J. Lee, S. Park, C. Ahn, D. Kim. Automatic generic document summarization based on non-negative matrix • factorization. Information Processing and Management 2008. • [10] Steinberger, J., & Poesio, M.,& Kabadjov, M.A. & Jeek, K .Two uses of anaphora resolution in • summarization. Information Processing and Management: an International Journal, vol 43, November, 2007. • [11] D. Wang, T. Li, S. Zhu, C. Ding. Multi-Document summarization via sentence-level semantic analysis and • symmetric matrix factorization. SIGIR’08, July 2008, Singapore. • [12] V. Gupta, G. S. Lehal, A Survey of Text Summarization Extractive Techniques. Journal of emerging • thechnologies in web intelligence, august 2010

  14. thanks

  15. DUC 2007 • Document Understanding Conferences • AQUAINT corpus • Dataset Specifications • Associated Press and New York Times(1998-2000) & Xinhua News Agency(1996-2000) • Ten NIST assessors wrote summaries for the 45 topics in the DUC 2007 main task. 25 Document In each Topic Totally 1125 Documents 45 Topics 262225 Terms without S.W 20057Terms By Stemming & without S.W 531174 Terms 32system summarizer ROUGE-2 ROUGE-SU4 Each Topic has 4 human summary

  16. Experimental Result • Average result on 3 topics Recall On ROUGE-2

  17. Experimental Result • Average result on 3 topics Recall On ROUGE-SU4

  18. The Best …

More Related