1 / 24

Topic Themes for Multi-Document Summarization

This paper discusses different topic representation methods for multi-document summarization and explores the use of themes to improve the informativeness and coherence of the summaries.

maya
Download Presentation

Topic Themes for Multi-Document Summarization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Topic Themes for Multi-Document Summarization Sanda Harabagiu and Finley LacatusuLanguage Computer Corporation Presented by Yi-Ting NTNU Speech Lab

  2. References • Sanda Harabagiu and Finley Lacatusu, “Topic Themes for Multi-Document Summarization”, SIGIR 2005. • S. Harabagiu. Incremental Topic Representations. In Proceedings of the 20th COLING Conference, Geneva, Switzerland, 2004. NTNU Speech Lab

  3. Outline • Introduction • Topic representation • Theme representation • Using Topic and Theme representations for MDS • Evaluation MDS • Conclusions NTNU Speech Lab

  4. Introduction • One of the problems of data overload that we are facing today is that there are many document that cover the same topic. • Multi-document summaries need to be both informative and coherent. • Much work in summarization dealt with these problems separately. • An approach represents topics as a structure of themes. • To dictate both (a) the information content to be included in an MDS as well as (b) the order of the themes that are selected. NTNU Speech Lab

  5. Topic representation(1/12) • Five different topic representation (TRs) • (TR1) representing topics via topic signature (TS1) • (TR2) representing topic via enhanced topic signature (TS2) • (TR3) representing topic via thematic signature (TS3) • (TR4) representing topic by modeling the content structure of documents • (TR5) representing topic as templates implemented as a frame with slots and fillers. NTNU Speech Lab

  6. Topic representation(2/12) • TR1. Topic Representation 1: • The topic signature is represented as TS1 = {topic,< (t1,w1),…….,(tn,wn) >} where the terms ti are highly correlated to the topic with association weight wi. • Term selection and weight association are determined by the use of likelihood ratio . • With the likelihood ratio method, the confidence level for a specific value is found by (a) looking up the distribution table, (b) using the value c to select an appropriate cutoff associated weight, and (c) determining the terms selected in the topic signature based on the value c. NTNU Speech Lab

  7. Topic representation(3/12) • TR1. Topic Representation 1: • A set of documents is preclassified into (a) topic relevant texts , and (b) topic nonrelevant texts • Two hypotheses: • Hypothesis 1 (H1) : • Hypothesis 2 (H2) : NTNU Speech Lab

  8. Topic representation(4/12) • TR1. Topic Representation 1: NTNU Speech Lab

  9. Topic representation(5/12) • TR2. Topic Representation 2: • Topics can be represented by identifying the relevant relations that exist between topic signature terms: TS2 = {topic,< (r1,w1),…….,(rm,wm) >}, where ri is a binary relation between two topic concepts. • Two forms of topic relations are considered: (1) syntax-based relations between the VP and it’s Subject, Object, or Prepositional Attachments; and (2) C-relations between events and entities that cannot be identified by syntactic constraints, but belong to the same context. • The topic relations are discovered by starting with the topic terms uncovered in TS1 and selecting a seed syntactic relation between the topic terms. • Only nouns and verbs are considered from TS1. NTNU Speech Lab

  10. Topic representation(6/12) • TR2. Topic Representation 2: • The iterative process of discovering topic relations has four steps: • Step1-generate candidate relations • Step2-the candidate topic relations are ranked based on its Relevance-Rate and it’s Frequency.Relevance-Rate= Frequency/Count • Step3-select a new topic relation based on the ranking in step 2. • Step4-restart the discovery by using the latest discovered relation for classifying relevant documents. NTNU Speech Lab

  11. Topic representation(7/12) • TR2. Topic Representation 2: NTNU Speech Lab

  12. Topic representation(8/12) • TR3. Topic Representation 3: • A third topic representation that is based on the concept of themes. TS3 = {topic,< (Th1,r1),…….,(Ths,rs) >},where Thi is one the themes associated with the topic and ri is its rank. • The discovery of themes is based on (1) a segmentation of documents produced by the TextTiling algorithm (2) a method of (i) assigning labels to themes, and (ii) ranking them. • Four cases for theme labeling:Case 1: A single topic-relevant relation is identified in the segment.Case2: several topic relation are recognized in the segment.Case3: multiple topicCase4: the theme contains topic-relevant terms, but no topic relation. NTNU Speech Lab

  13. Topic representation(9/12) • TR4. Topic Representation 4: (Topics Represented as Content Models) • The content model is a Hidden Markov Model (HMM) wherein states correspond to topic themes and state transitions capture either (1) orderings within that domain, or (2) the probability of changing from one given topic theme to another. • Step1 initial topic induction: complete-link clustering • Step2 the model states and the emission/transition probabilities are determined. • Step3 Viterbi re-estimation • The cluster represents the topic representation TR4 NTNU Speech Lab

  14. Topic representation(10/12) • TR4. Topic Representation 4: NTNU Speech Lab

  15. Topic representation(11/12) • TR5. Topic Representation 5: (Topics Represented as Extraction Templates) • Topics can be represented as a set of inter-related concepts, implemented as a frame having slots and filler. NTNU Speech Lab

  16. Topic representation(12/12) • TR5. Topic Representation 5: • It is important to be able to generate scripts automatically from corpora. • Using the IS-A and Gloss lexical relations found in the WordNet lexical database to mine topic relations for topic relevant terms. • Combining the Is-A and GLOSS relations for generating the topical relations • An ad-hoc template generation algorithm(five step) NTNU Speech Lab

  17. Theme representation(1/4) • In order to produce exhaustive summaries, MDS systems must be able to identify information that is (1) common to multiple documents in the collection(2) unique to a single document in the collection and(3) contradictory to information presented in other document in the collection. • Extracting all similar sentences would produce a verbose and repetitive summary. • By observing to fine the core of method of representing themes. • Current semantic parsers are able to recognize all verbal predicates and their arguments. • The predicates that were recognized are underlined. NTNU Speech Lab

  18. Theme representation(2/4) NTNU Speech Lab

  19. Theme representation(3/4) • To generate the theme representation through the following six steps: • For every sentence in each document from the collection, the predicate-argument structures are identified. (involves the recognition of paraphrases as synonyms or idioms). • All sentences having at least one common predicate with a common argument are clustered together. The semantic consistency of the other arguments is also checked. • Conceptual representations for each cluster are generated. • Selection of the candidate themes is made by considering the mapping of the clusters into (1) the topic representation TR3 and (2) the topic representation TR4. • There are meaningful relations between the themes.Cohesion relations、Discourse relations. (recognized by the naïve Bayes classifiers) • The themes are structured into a graph. NTNU Speech Lab

  20. Theme representation(4/4) NTNU Speech Lab

  21. Using Topic and Theme representations for MDS • Multi-document summarization is performed by(1) extracting sentences that contain the most salient information;(2) compressing the sentences for retaining the most important pieces of information and (3) ordering the extracted sentences into the final summary. • To implement four extraction methods, two ordering methods, and a separate MDS method. • EM1(TR1)、EM2(TR2)、EM3(TR3)、EM4(TR5) • OM1、OM2 NTNU Speech Lab

  22. Evaluating MDS(1/2) NTNU Speech Lab

  23. Evaluating MDS(2/2) NTNU Speech Lab

  24. Conclusions • In this paper, they investigated to five topic representation that were used before in MDS and proposed a new representation based on topic themes. • Additionally, to represent themes in a graph-like structure that improve the quality of ordering information for MDS. NTNU Speech Lab

More Related