html5-img
1 / 29

Summarization Techniques

Summarization Techniques . A. Bellaachia Computer Science Department School of Engineering and Applied Sciences George Washington University Washington, DC 20052. Research Team. Abdelghani Bellaachia, Simon Berkovich, Avinash Kanal Computer Science Department George Washington University

leyna
Download Presentation

Summarization Techniques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Summarization Techniques A. Bellaachia Computer Science Department School of Engineering and Applied Sciences George Washington University Washington, DC 20052 A. Bellaachia

  2. Research Team Abdelghani Bellaachia, Simon Berkovich, Avinash Kanal Computer Science Department George Washington University Washington, DC And Anandpal Mahajan, Web Methods Virginia And Abdel-Hamid Gooda IBM Consulting Washington, DC A. Bellaachia

  3. Motivation • Decide whether a document is relevant or not • What is the first thing you read in a novel? • Get the summary of a book • Often the summary is all that is read. • Provide summaries of retrieved web pages related to a user query. • Automatic abstract of technical paper • Human-generated summaries are expensive. A. Bellaachia

  4. Motivation (Cont’d) • Think about your last minute REQUIRED ABSTRACT!!! • Document Length: 3073 words (No references and no title) • Summary length: 135 words • Extracted sentences length: 81 words (60% of the summary) A. Bellaachia

  5. What is a Summary? • Informative summary • Purpose: replace original document • Example: executive summary • Indicative summary • Purpose: support decision: do I want to read original document yes/no? • Example: Headline, scientific abstract • Evaluative summary: • Purpose: express the point of view of the author on a given topic. • Example: I think this document focuses more on … A. Bellaachia

  6. What Type of Summary? • Two types of summary: • Abstract • Extract • Abstract: • A set of manually generated sentences. • Extract: • A set of sentences extracted from the document. • Extract vs. Abstract? • An extracted summary remains closer to the original document; limiting the bias that might otherwise appear in a summary. A. Bellaachia

  7. What Type of Summary? (Cont’d) • Text summaries can also be categorized into two types: • Query-relevant summaries: • The summary is created based on the terms in the input query. • As they are “query-biased”, they do not provide an overall sense of the document content. • Generic summaries: • A generic summary provides an overall sense of the document’s contents and determines which category it belongs to. • A good generic summary should contain the main topics of the document while keeping redundancy to a minimum. • It is a challenging task: It is generally hard to develop a high-quality generic summarization method. A. Bellaachia

  8. Summarization Goals • The goals of text summarizers can be categorized by their intent, focus and coverage: • Intent • Focus • Coverage A. Bellaachia

  9. Summarization Goals (Cont’d) • Intent: • Intent refers to the potential use of the summary. • Firmin and Chrzanowski divide a summary’s intent into three main categories: • Indicative: Indicative summaries give an indication of the central topic of the original text or enough information to judge the text’s relevancy. • Informative: Informative summaries can serve as substitutes for the full documents. • Evaluative: Evaluative summaries express the point of view of the author on a given topic. • Focus: Is the summary generic or query relevant? • Coverage: It refers to the number of sentences that contribute to the summary. A. Bellaachia

  10. Proposed Summarizers • Three generic text summarization methods are presented. • They create text summaries by ranking and extracting sentences from the original documents. • Prior Work: • SUMMARIZER 1: uses standard Information Retrieval (IR) methods to rank sentence relevance. [Yihong Gong and Xin Liu, SIGIR 2001] A. Bellaachia

  11. Proposed Summarizers (Cont’d) • Proposed Solutions: • SUMMARIZER 2: uses the IR TF*IDF weighting scheme to rank sentences and selects top sentences to form a summary. • SUMMARIZER 3: Uses the popular k-means clustering algorithm, where k is the number of sentences in the desired summary, and select the sentence with the highest TF*IDF weight (Sum of the weights of all terms in the sentence) from each cluster. • SUMMARIZER 4: Use the popular k-means clustering algorithm and generate a summary using a new k-NN based classification algorithm. A. Bellaachia

  12. Summarization Approach • Each summarizer tries to select sentences that cover the This section introduces four generic text summarization techniques. • The summarization process follows a particular procedure that can be described in the steps below:  • Segmentation: Decompose the document into individual sentences and use these sentences to form the candidate sentence set S. • Vectorization: Create (1) the weighted term-frequency vector Si for each sentence i S and (2)the weighted term-frequency vector D for the whole document A. Bellaachia

  13. Vectorization: An IR Model • Get a set of all terms in the whole document and let n be the cardinality of this set. • Each term represents a dimension in an n dimensional space where n is the total number of term in the whole document. • Each sentence/document is a vector • D = (d1, d2, d3, d4, ... dn) • Si = (si1, si2, si3, si4, ... sin) A. Bellaachia

  14. Vectorization: An IR Model (Cont’d) • Possible similarity measure: • Other measures: • Euclidean distance T3 5 D 4 Si 2 3 T1 7 T2 A. Bellaachia

  15. SUMMARIZER 1 • The main steps of SUMMARIZER 1 are: • For each sentence i S, compute the relevance measure between Si and D: Inner Product, or Cosine Similarity, or Jaccard coefficient. • Select sentence Sk that has the highest relevance score and add it to the summary. • Delete Sk from S, and eliminate all the terms contained in Sk from the document vector and S vectors. Re-compute the weighted term-frequency vectors (D and all Si). • If the number of sentences in the summary reaches the predefined value, terminate the operation: otherwise go to step 1. A. Bellaachia

  16. SUMMARIZER 2 • This summarizer is the simplest among all the proposed techniques. • It uses the TF*IDF weighting schema to select sentences. • It works as follows: • Create the weighted term-frequency vector Si for each sentence i S using TF*IDF (Term frequency * Inverse Document Frequency). • Sum up the TF*IDF score for each sentence and rank them. • Select the predefined number of sentences in the summary from S. A. Bellaachia

  17. SUMMARIZER 3 • This summarizer uses the popular k-means clustering algorithm where k is the size of the summary. • K-means: • Start with random position of K centroids. • Iteratre until centroids are stable • Assign points to centroids • Move centroids to centerof assign points Iteration = 0 A. Bellaachia

  18. SUMMARIZER 3 (Cont’d) • This summarizer uses the popular k-means clustering algorithm where k is the size of the summary. • K-means: • Start with random position of K centroids. • Iteratre until centroids are stable • Assign points to centroids • Move centroids to centerof assign points Iteration = 1 A. Bellaachia

  19. SUMMARIZER 3 (Cont’d) • This summarizer uses the popular k-means clustering algorithm where k is the size of the summary. • K-means: • Start with random position of K centroids. • Iteratre until centroids are stable • Assign points to centroids • Move centroids to centerof assign points Iteration = 2 A. Bellaachia

  20. SUMMARIZER 3 (Cont’d) • This summarizer uses the popular k-means clustering algorithm where k is the size of the summary. • K-means: • Start with random position of K centroids. • Iteratre until centroids are stable • Assign points to centroids • Move centroids to centerof assign points Iteration = 3 A. Bellaachia

  21. SUMMARIZER 3 (Cont’d) • This summarizer works as follows: • Create the weighted term-frequency vector Ai for each sentence Si using TF*IDF. • Form a sentences-by-terms matrix and feed it to the K-means clustering algorithm to generate k clusters. • Sum up the TF*IDF score for each sentence in each cluster. • Pick the sentence with the highest TF*IDF score from within each cluster and add it to the summary. A. Bellaachia

  22. SUMMARIZER 4 • This summarizer uses Potential Attractor Class (PAC) technique to generate a summary • PAC is a k nearest neighbor (k-NN) based technique to generate a summary. • How does k-NN work? • Training set includes classes: a set of classes Ci • Run K-means to generate the initial classes. • For each new item • Examine k items from the training classes that are near to this item. • Apply a decision rule to select the the class to to which the new item will be belong to. • K is determined empirically. A. Bellaachia

  23. SUMMARIZER 4 (Cont’d) ? Class C1 Class C2 Class C3 A. Bellaachia

  24. SUMMARIZER 4 (Cont’d) • Decision rule: • Try to identify the class membership of the new item: What is the label of the new item? . • Voting: The new item is assigned to the class that has the largest number of items in the k closest neighbors. • Distance weighted: an enhanced version of Voting (Next slide) • PAC: uses laws of physics to determine the membership of a new item (Next slide) A. Bellaachia

  25. SUMMARIZER 4 (Cont’d) • Distance weighted: • The class of the new item is the one that has the largest weight: Weighted Count (Ci) = where dk – dj is the distance between neighbor j of classi and the kth nearest neighbor and dk – d0 is the distance between the first neighbor and the kth nearest neighbor. A. Bellaachia

  26. SUMMARIZER 4 (Cont’d) • PAC: • Step1: get the k nearest neighbors. • Step2: calculate the distance di between the new item q and the center of nearest neighbors ,in the k nearest neighbors, from each class. • Step3: calculate the mass mi of the nearest neighbors from each class. This mass, per class, is equal to the number of nearest neighbors , in the k nearest neighbors, from that class. • Step4: calculate the Class Force CF(Ci) that attracts sample q to each class: • Step5: assign q to the class that has the highest CF (PAC decision rule). A. Bellaachia

  27. A. Bellaachia

  28. Performance Evaluation • Document Understanding Conferences (DUC) datasets from the National Institute of Standards and Technology (NIST). • The dataset includes three sets of documents from each independent human evaluator/selector. Each set has between 3 and 20 documents. Each selector builds summaries (abstracts) for each document in the set with an approximate length of 100 words. • A sample of the DUC data was chosen for our evaluation: 2 sets of documents (one set from each of 2 selectors). • The set from Selector 1 consists of 5 documents, whereas the set from Selector 2 contains 4 documents. • For each document, we have a summary (abstract) from each selector. A. Bellaachia

  29. Questions … Thanks A. Bellaachia

More Related