1 / 70

Towards Automated Related Work Summarization

Towards Automated Related Work Summarization. Cong Duy Vu Hoang July 2010. Outline. Introduction Previous Studies Data Manual Analysis Proposed System Experiments & Results Future Work Conclusion. Introduction. Scenario :. Prior community knowledge. relate.

clodia
Download Presentation

Towards Automated Related Work Summarization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TowardsAutomated Related Work Summarization Cong Duy Vu Hoang July 2010

  2. Outline • Introduction • Previous Studies • Data • Manual Analysis • Proposed System • Experiments & Results • Future Work • Conclusion

  3. Introduction • Scenario: Prior community knowledge relate a research topic/problem of interest Scholars return A very long list of related works

  4. Introduction • Example: use topic “multi-document summarization”, search engines may return a long list of hits Read through all of them is tedious and time-consuming

  5. Introduction • Motivation Prior community knowledge relate a research topic/problem of interest Scholars return How to re-organize this list as a compact related work summary? A very long list of related works

  6. Introduction • Motivation • I envision an NLP application that assists in creating a related work summary. • I propose the task of related work summarization • topic-biased, multi-document summarization problem • Input: a target research problem • Output: a related work summary needs to be drafted • This work examines the feasibility and possibility of the proposed task.

  7. Introduction • Related work summarization is significantly different from traditional summarization • Limited to the domain of scientific discourse • The output summary follows a specific structure of example related work sections • Evaluation is non-trivial, requires special evaluation metrics

  8. Previous Studies • There are no existing studies on this specific task! • Single-document scientific article summarization • (Luhn,1958; Baxendale,1958; Edmundson,1969)  surface features, extracts of technical documents • (Schwartz and Hearst, 2006)  citation texts, key concepts of bioscience texts • (Mei and Zhai, 2008; Qazvinian and Radev, 2008)  citation texts, computational linguistics • The iOPENER project works towards automated creation of technical surveys given a research topic. • (Mohammad et al., 2009)  structure of technical surveys using citation texts, multiple article summarization • (Qazvinian and Radev, 2010)  background information for generating technical surveys

  9. Previous Studies • Technical book summarization • (Mihalcea and Ceylan, 2007)  novel features based on text segmentation for summarization • Rhetorical analysis of scientific texts • (Teufel, 1999;Teufel and Moens, 2002)  argumentative zoning (AZ), computational linguistics • (Teufel et al., 2009)  AZ, chemistry domain • (Angrosh et al., 2010)  rhetorical classification scheme for related work section • Literature Review Generation • (Jaidka et al., 2010)  discourse analysis of literature review, decomposition

  10. Data • Data Construction • To create a data set for analysis & evaluation • Randomly collected 20 articles in different major conference proceeding in NLP & text processing • ACL(6), EMNLP(1), NAACL(5), COLING(4), and SIGIR(4) • Extract related work sections and its referenced articles • Pre-processing (PDF-to-TXT conversion, sentence boundary) & manual error correction • Named RWSData

  11. Data • Data Statistics No, RW, RA, SbL, and WbL are labeled as (N)umber(o)f, (R)elated (W)orks, (R)eferenced (A)rticles, (S)entence-(b)ased (L)ength of, and (W)ord-(b)ased (L)ength of, respectively.

  12. Manual Analysis • Objective: • To study characteristics of related work summaries • To deconstruct actual related work summaries • To gain a deeper insight on how they are structured and authored, from both rhetorical and content levels as well as on the surface, lexical level. • Towards efficient strategies for summarization and generation.

  13. Manual Analysis • Formal definition • a related work summary (RWS) is a text summary • covers stuffs of previous works that are relevant to current work • particularly indicating particular aspects of interest (e.g. evaluation, results, experiments, …) • mentions the similarities and dissimilarities about particular aspects relevant a topic among previous works

  14. Manual Analysis • Position: two possible positions • within the introduction section or the section on its own at the beginning of the article immediately after the Introduction section • give a strong overview about previous work • before the Conclusion section • a relatively short outline of previous studies and adequate comparisons between the technical content of the current study and previous studies

  15. Manual Analysis Repeated for other topics Possible to generate automatically Extremely hard to generate automatically A topical structure of a related work summary

  16. An illustrating example about structure of a related work summary General topic claim Topic 1 Other description & results Description & Results Topic 2 Proposed statement Description & Results

  17. Paraphrase evaluation Subjective manual evaluation Through improving performance of particular tasks (e.g. question answering, machine translation) Related Work Summary - Structure • RWS is topic-biased summary following a topic hierarchy tree A topic hierarchy tree for previous example

  18. Manual Analysis • Annotation of topical information for RWSData • Each related work summary associated with topic hierarchy tree • Statistics TS: tree depth TD: tree size

  19. Manual Analysis - The Decomposition • Decomposition is a way to understand the human process in creating a related work summary. • Help answer motivated questions: • A summary is created by human cut-and-paste operations? • Which components in the summary come from the original articles and where in the original document they come from? • Levels: words, phrases, clauses, or even sentences • How such components are constructed? Revisions?

  20. Manual Analysis - The Decomposition • Previous studies proposed automatic decomposition algorithm • (Jing et al., 1999) for news articles • (Ceylan et al., 2009) for books • Becomes non-trivial for multi-document summarization • Initially, I approach a manual decomposition for RW summarization

  21. Manual Analysis - The Decomposition • The Alignment • Human Revisions in Related Work Creation

  22. Manual Analysis - The Alignment Example of the alignment process

  23. Manual Analysis - The Alignment • I observed four categories of RWS sentences: • RWS1: (XX, 2000) ... - a summary of an aspect mentioned in referenced article with respect to a specific topic. • (Barzilay and McKeown 2001) evaluated their paraphrases by asking judges … • RWS2: Topic (XX, 2000) ... - summary of a topic. • Supervised approaches such as (Black et al. 1998) have used clustering … • RWS3: Fact or Opinion (XX, 2000) ... - evidence-based reference. • Co-training (Riloff and Jones, 1999; Collins and Singer, 1999) begins with ... • RWST: template-based summary, focus mainly on something about survey paper, dataset, metric, tool, and so on. • Sebastiani’s survey paper [23] provides an overview ...

  24. Manual Analysis - The Alignment • 5 sets chosen for the alignment • RWS summary is • not necessary to say everything about the referenced articles • but just refer to some specific aspects (e.g. of methods, results, evaluation ... )

  25. Manual Analysis - The Alignment • Relevant information can appear at various positions in original documents • Title and Abstract, Introduction, Body (usually Experiments and Results), Conclusion

  26. Manual Analysis - Revisions • Sentence Reduction • Text fragment 1: ... substituted each set of candidate paraphrasesinto between 2-10 sentenceswhich contained the original phrase. • RWSsentence: (Bannard and Callison-Burch 2005) replaced phraseswith paraphrasesin a number of sentences...

  27. Manual Analysis - Revisions • Sentence Combination • Text fragment 1: ... substitutedeach set of candidate paraphrasesinto between 2-10 sentenceswhich contained the original phrase. • Text fragment 2: ... had two native English speakers produce judgmentsas to whether the new sentences preservedthe meaningof the original phraseand as to whether they remained grammatical. • RWSsentence: (Bannard and Callison-Burch 2005) replacedphraseswith paraphrasesin a number of sentencesand asked judgeswhether the substitutions“preserved meaningand remained grammatical”.

  28. Manual Analysis - Revisions • Sentence Combination • Text fragment 1: ... to preserveboth meaning and grammaticality. • RWSsentence: ... “preservedmeaning and remained grammatical”.

  29. Manual Analysis - Revisions • Lexical Paraphrasing • Text fragment 1: ... substitutedeach set of candidate paraphrases into between 2-10 sentences which contained the original phrase. • RWSsentence: (Bannard and Callison-Burch 2005) replacedphrases with paraphrases in a number of sentences ...

  30. Manual Analysis - Revisions • Generalization/Specification • Text fragment 1: We present an unsupervised learning algorithm that mines large text corporafor patterns that express implicit semantic relations. • RWSsentence: (Turney 2006a) presents an unsupervised algorithm for mining the Webfor patterns expressing implicit semantic relations.

  31. Manual Analysis - Revisions • All of the above revisions are generally not used alone but usually combined together to construct sentences in a RWS • Dealing with all the above revisions for RWSsummarization is very hard, especially in two revisions: • lexical paraphrasing • generalization/specification • Consider the remaining revisions only!

  32. Manual Analysis – Related Work Representation • To examine how to generate and represent a complete RWS • Used another data set (RWSData-Sub) including 30 articles for this analysis. • There are two main factors which reflect related work summary representation • Topic transition • Local coherence

  33. Topic transition • The observation on chosen data set reveals that there are two types of topic representation for related work summaries • Type 1: using transition sentences to connect between topic nodes • (23/30-77%) • Type 2: representing topic nodes as topic titles • (07/30-23%) • Type 2 representation sometimes used in the case that there exists a combination of different research problems relevant to a specific research topic. (See some examples)

  34. 0 Topic transition 1 2 3 4 0 3 1 4 2 Type 1

  35. 0 Topic transition 1 2 3 0 2 1 3 Type 2

  36. Topic Transition • For “Type 1” representation: • Makes related work natural but: • Non-trivial for automatic generation because of lack of topic discourse information, e.g. “contrast”, “elaboration” b/w topic nodes • For “Type 2” representation: • Not as natural as Type 1 but: • Seems to be easy for automatic generation

  37. Local Coherence • Local coherence • The syntactic realization of discourse entities and transitions between focused entities • News summaries: entity = mention to people • RWS: entity = mention to citations • My analysis reveals that there are 14 patterns for mention to citations in RWSes

  38. Details on 14 main patterns explored in the analysis

  39. Local Coherence • Statistics 1 Statistics for 14 patterns over the RWSData-Sub data set.

  40. Local Coherence • Statistics 2 Statistics for 14 patterns that appear in each type of topic transition representation over the RWSData-Sub dataset

  41. Task Formulation RW: related work A set of articles [] RW Summarizer User A desired length [,] [] A RW summary Topic hierarchy tree assumption

  42. A Motivating Example A related work section extracted from “Bilingual Topic Aspect Classification with A few Training Examples” (Wu et al., 2008)

  43. The Proposed Approach For leaf nodes For internal nodes The ReWoS architecture, Decision edges are labeled as (T)rue, (F)alse or (R)elevant.

  44. The Proposed Approach • Pre-Processing • Based on heuristic rules of sentence length and lexical clues • Sentences with token-based length is too short (<7) or too long (>80) • Sentences referring to future tenses • Sentences containing obviously redundant clues such as: “in the section ...”, “figure XXX shows ...”, “for instance” …

  45. The Proposed Approach • Agent-based rule • Attempts to distinguish whether the sentence describes an author’s own work or not. • Based on the presence of tokens that signals work done by the author, such as “we”, “our”, “us”, “this approach”, and “this method” … • Says that if a sentence does not satisfy this rule, route for GCSum, otherwise for SCSum

  46. General Content Summarization (GCSum) • The objective of GCSum is to extract sentences containing useful background information on the topics of the internal node in focus.

  47. General Content Summarization (GCSum) General content informative indicative • Text classification is a task that assigns a certain number of pre-defined labels for a given text. • Statistical machine translation (SMT) seeks to develop mathematical models of the translation process whose parameters can be automatically estimated from a parallel corpus. • Many previous studies have approached the problem of mono-lingual text classification. • This paper refers to the problem of sentiment analysis.

  48. General Content Summarization (GCSum) • Informative sentences • Give detail on a specific aspect of the problem, e.g. definitions, purpose or application of the topic • Indicative sentences • simpler, inserted to make the topic transition explicit and rhetorically sound • Summarization issue • Given a topic: • For indicative sentences, using pre-defined templates • For informative sentences, extract from input articles

  49. General Content Summarization (GCSum) GCSum first checks the subject of each candidate sentence, filtering ones whose subjects do not contain at least one topic keyword. (Subject-based rule) Or GCSum checks whether stock verb phrases (i.e., “based on”, “make use of” and 23 other patterns) are used as the main verb. (Verb-based rule) Or GCSum checks for the presence of at least one citation – general sentences may list a set of citations as examples. (Citation-based rule) Importantly note that if cannot find out any informative sentences from input articles, generate indicative sentences instead!

  50. General Content Summarization (GCSum) • Topic relevance computation (GCSum) • ranks sentences based on keyword content • states that the topic of an internal node is affected by its surrounding nodes – ancestor, descendants and others - scoreS is the final relevance score - scoreSQA, scoreSQ, and scoreSQR mean the component relevance score of the sentence S with respect to the ancestor, current or other remaining nodes,respectively.

More Related