summarization techniques n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Summarization Techniques PowerPoint Presentation
Download Presentation
Summarization Techniques

Loading in 2 Seconds...

play fullscreen
1 / 29

Summarization Techniques - PowerPoint PPT Presentation


  • 142 Views
  • Uploaded on

Summarization Techniques . A. Bellaachia Computer Science Department School of Engineering and Applied Sciences George Washington University Washington, DC 20052. Research Team. Abdelghani Bellaachia, Simon Berkovich, Avinash Kanal Computer Science Department George Washington University

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Summarization Techniques' - leyna


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
summarization techniques

Summarization Techniques

A. Bellaachia

Computer Science Department

School of Engineering and Applied Sciences

George Washington University

Washington, DC 20052

A. Bellaachia

research team
Research Team

Abdelghani Bellaachia, Simon Berkovich, Avinash Kanal

Computer Science Department

George Washington University

Washington, DC

And

Anandpal Mahajan,

Web Methods

Virginia

And

Abdel-Hamid Gooda

IBM Consulting

Washington, DC

A. Bellaachia

motivation
Motivation
  • Decide whether a document is relevant or not
  • What is the first thing you read in a novel?
    • Get the summary of a book
  • Often the summary is all that is read.
  • Provide summaries of retrieved web pages related to a user query.
  • Automatic abstract of technical paper
  • Human-generated summaries are expensive.

A. Bellaachia

motivation cont d
Motivation (Cont’d)
  • Think about your last minute REQUIRED ABSTRACT!!!
    • Document Length: 3073 words (No references and no title)
    • Summary length: 135 words
    • Extracted sentences length: 81 words (60% of the summary)

A. Bellaachia

what is a summary
What is a Summary?
  • Informative summary
    • Purpose: replace original document
    • Example: executive summary
  • Indicative summary
    • Purpose: support decision: do I want to read original document yes/no?
    • Example: Headline, scientific abstract
  • Evaluative summary:
    • Purpose: express the point of view of the author on a given topic.
    • Example: I think this document focuses more on …

A. Bellaachia

what type of summary
What Type of Summary?
  • Two types of summary:
    • Abstract
    • Extract
  • Abstract:
    • A set of manually generated sentences.
  • Extract:
    • A set of sentences extracted from the document.
  • Extract vs. Abstract?
    • An extracted summary remains closer to the original document; limiting the bias that might otherwise appear in a summary.

A. Bellaachia

what type of summary cont d
What Type of Summary? (Cont’d)
  • Text summaries can also be categorized into two types:
    • Query-relevant summaries:
      • The summary is created based on the terms in the input query.
      • As they are “query-biased”, they do not provide an overall sense of the document content.
    • Generic summaries:
      • A generic summary provides an overall sense of the document’s contents and determines which category it belongs to.
      • A good generic summary should contain the main topics of the document while keeping redundancy to a minimum.
      • It is a challenging task: It is generally hard to develop a high-quality generic summarization method.

A. Bellaachia

summarization goals
Summarization Goals
  • The goals of text summarizers can be categorized by their intent, focus and coverage:
    • Intent
    • Focus
    • Coverage

A. Bellaachia

summarization goals cont d
Summarization Goals (Cont’d)
  • Intent:
    • Intent refers to the potential use of the summary.
    • Firmin and Chrzanowski divide a summary’s intent into three main categories:
      • Indicative: Indicative summaries give an indication of the central topic of the original text or enough information to judge the text’s relevancy.
      • Informative: Informative summaries can serve as substitutes for the full documents.
      • Evaluative: Evaluative summaries express the point of view of the author on a given topic.
  • Focus: Is the summary generic or query relevant?
  • Coverage: It refers to the number of sentences that contribute to the summary.

A. Bellaachia

proposed summarizers
Proposed Summarizers
  • Three generic text summarization methods are presented.
  • They create text summaries by ranking and extracting sentences from the original documents.
  • Prior Work:
    • SUMMARIZER 1: uses standard Information Retrieval (IR) methods to rank sentence relevance. [Yihong Gong and Xin Liu, SIGIR 2001]

A. Bellaachia

proposed summarizers cont d
Proposed Summarizers (Cont’d)
  • Proposed Solutions:
    • SUMMARIZER 2: uses the IR TF*IDF weighting scheme to rank sentences and selects top sentences to form a summary.
    • SUMMARIZER 3: Uses the popular k-means clustering algorithm, where k is the number of sentences in the desired summary, and select the sentence with the highest TF*IDF weight (Sum of the weights of all terms in the sentence) from each cluster.
    • SUMMARIZER 4: Use the popular k-means clustering algorithm and generate a summary using a new k-NN based classification algorithm.

A. Bellaachia

summarization approach
Summarization Approach
  • Each summarizer tries to select sentences that cover the This section introduces four generic text summarization techniques.
  • The summarization process follows a particular procedure that can be described in the steps below: 
    • Segmentation: Decompose the document into individual sentences and use these sentences to form the candidate sentence set S.
    • Vectorization: Create (1) the weighted term-frequency vector Si for each sentence i S and (2)the weighted term-frequency vector D for the whole document

A. Bellaachia

vectorization an ir model
Vectorization: An IR Model
  • Get a set of all terms in the whole document and let n be the cardinality of this set.
  • Each term represents a dimension in an n dimensional space where n is the total number of term in the whole document.
  • Each sentence/document is a vector
    • D = (d1, d2, d3, d4, ... dn)
    • Si = (si1, si2, si3, si4, ... sin)

A. Bellaachia

vectorization an ir model cont d
Vectorization: An IR Model (Cont’d)
  • Possible similarity measure:
  • Other measures:
    • Euclidean distance

T3

5

D

4

Si

2

3

T1

7

T2

A. Bellaachia

summarizer 1
SUMMARIZER 1
  • The main steps of SUMMARIZER 1 are:
    • For each sentence i S, compute the relevance measure between Si and D: Inner Product, or Cosine Similarity, or Jaccard coefficient.
    • Select sentence Sk that has the highest relevance score and add it to the summary.
    • Delete Sk from S, and eliminate all the terms contained in Sk from the document vector and S vectors. Re-compute the weighted term-frequency vectors (D and all Si).
    • If the number of sentences in the summary reaches the predefined value, terminate the operation: otherwise go to step 1.

A. Bellaachia

summarizer 2
SUMMARIZER 2
  • This summarizer is the simplest among all the proposed techniques.
  • It uses the TF*IDF weighting schema to select sentences.
  • It works as follows:
    • Create the weighted term-frequency vector Si for each sentence i S using TF*IDF (Term frequency * Inverse Document Frequency).
    • Sum up the TF*IDF score for each sentence and rank them.
    • Select the predefined number of sentences in the summary from S.

A. Bellaachia

summarizer 3
SUMMARIZER 3
  • This summarizer uses the popular k-means clustering algorithm where k is the size of the summary.
  • K-means:
    • Start with random position of

K centroids.

    • Iteratre until centroids are stable
    • Assign points to centroids
    • Move centroids to centerof assign points

Iteration = 0

A. Bellaachia

summarizer 3 cont d
SUMMARIZER 3 (Cont’d)
  • This summarizer uses the popular k-means clustering algorithm where k is the size of the summary.
  • K-means:
    • Start with random position of

K centroids.

    • Iteratre until centroids are stable
    • Assign points to centroids
    • Move centroids to centerof assign points

Iteration = 1

A. Bellaachia

summarizer 3 cont d1
SUMMARIZER 3 (Cont’d)
  • This summarizer uses the popular k-means clustering algorithm where k is the size of the summary.
  • K-means:
    • Start with random position of

K centroids.

    • Iteratre until centroids are stable
    • Assign points to centroids
    • Move centroids to centerof assign points

Iteration = 2

A. Bellaachia

summarizer 3 cont d2
SUMMARIZER 3 (Cont’d)
  • This summarizer uses the popular k-means clustering algorithm where k is the size of the summary.
  • K-means:
    • Start with random position of

K centroids.

    • Iteratre until centroids are stable
    • Assign points to centroids
    • Move centroids to centerof assign points

Iteration = 3

A. Bellaachia

summarizer 3 cont d3
SUMMARIZER 3 (Cont’d)
  • This summarizer works as follows:
    • Create the weighted term-frequency vector Ai for each sentence Si using TF*IDF.
    • Form a sentences-by-terms matrix and feed it to the K-means clustering algorithm to generate k clusters.
    • Sum up the TF*IDF score for each sentence in each cluster.
    • Pick the sentence with the highest TF*IDF score from within each cluster and add it to the summary.

A. Bellaachia

summarizer 4
SUMMARIZER 4
  • This summarizer uses Potential Attractor Class (PAC) technique to generate a summary
  • PAC is a k nearest neighbor (k-NN) based technique to generate a summary.
  • How does k-NN work?
    • Training set includes classes: a set of classes Ci
    • Run K-means to generate the initial classes.
    • For each new item
      • Examine k items from the training classes that are near to this item.
      • Apply a decision rule to select the the class to to which the new item will be belong to.
    • K is determined empirically.

A. Bellaachia

summarizer 4 cont d
SUMMARIZER 4 (Cont’d)

?

Class C1

Class C2

Class C3

A. Bellaachia

summarizer 4 cont d1
SUMMARIZER 4 (Cont’d)
  • Decision rule:
    • Try to identify the class membership of the new item: What is the label of the new item? .
    • Voting: The new item is assigned to the class that has the largest number of items in the k closest neighbors.
    • Distance weighted: an enhanced version of Voting (Next slide)
    • PAC: uses laws of physics to determine the membership of a new item (Next slide)

A. Bellaachia

summarizer 4 cont d2
SUMMARIZER 4 (Cont’d)
  • Distance weighted:
    • The class of the new item is the one that has the largest weight:

Weighted Count (Ci) =

where

dk – dj is the distance between neighbor j of classi and the kth nearest neighbor and

dk – d0 is the distance between the first neighbor and the kth nearest neighbor.

A. Bellaachia

summarizer 4 cont d3
SUMMARIZER 4 (Cont’d)
  • PAC:
    • Step1: get the k nearest neighbors.
    • Step2: calculate the distance di between the new item q and the center of nearest neighbors ,in the k nearest neighbors, from each class.
    • Step3: calculate the mass mi of the nearest neighbors from each class. This mass, per class, is equal to the number of nearest neighbors , in the k nearest neighbors, from that class.
    • Step4: calculate the Class Force CF(Ci) that attracts sample q to each class:
    • Step5: assign q to the class that has the highest CF (PAC decision rule).

A. Bellaachia

performance evaluation
Performance Evaluation
  • Document Understanding Conferences (DUC) datasets from the National Institute of Standards and Technology (NIST).
  • The dataset includes three sets of documents from each independent human evaluator/selector. Each set has between 3 and 20 documents. Each selector builds summaries (abstracts) for each document in the set with an approximate length of 100 words.
  • A sample of the DUC data was chosen for our evaluation: 2 sets of documents (one set from each of 2 selectors).
  • The set from Selector 1 consists of 5 documents, whereas the set from Selector 2 contains 4 documents.
  • For each document, we have a summary (abstract) from each selector.

A. Bellaachia