Towards Automated Related Work Summarization

TowardsAutomated Related Work Summarization Cong Duy Vu Hoang July 2010

Outline • Introduction • Previous Studies • Data • Manual Analysis • Proposed System • Experiments & Results • Future Work • Conclusion

Introduction • Scenario: Prior community knowledge relate a research topic/problem of interest Scholars return A very long list of related works

Introduction • Example: use topic “multi-document summarization”, search engines may return a long list of hits Read through all of them is tedious and time-consuming

Introduction • Motivation Prior community knowledge relate a research topic/problem of interest Scholars return How to re-organize this list as a compact related work summary? A very long list of related works

Introduction • Motivation • I envision an NLP application that assists in creating a related work summary. • I propose the task of related work summarization • topic-biased, multi-document summarization problem • Input: a target research problem • Output: a related work summary needs to be drafted • This work examines the feasibility and possibility of the proposed task.

Introduction • Related work summarization is signiﬁcantly different from traditional summarization • Limited to the domain of scientiﬁc discourse • The output summary follows a specific structure of example related work sections • Evaluation is non-trivial, requires special evaluation metrics

Previous Studies • There are no existing studies on this speciﬁc task! • Single-document scientiﬁc article summarization • (Luhn,1958; Baxendale,1958; Edmundson,1969)  surface features, extracts of technical documents • (Schwartz and Hearst, 2006)  citation texts, key concepts of bioscience texts • (Mei and Zhai, 2008; Qazvinian and Radev, 2008)  citation texts, computational linguistics • The iOPENER project works towards automated creation of technical surveys given a research topic. • (Mohammad et al., 2009)  structure of technical surveys using citation texts, multiple article summarization • (Qazvinian and Radev, 2010)  background information for generating technical surveys

Previous Studies • Technical book summarization • (Mihalcea and Ceylan, 2007)  novel features based on text segmentation for summarization • Rhetorical analysis of scientific texts • (Teufel, 1999;Teufel and Moens, 2002)  argumentative zoning (AZ), computational linguistics • (Teufel et al., 2009)  AZ, chemistry domain • (Angrosh et al., 2010)  rhetorical classification scheme for related work section • Literature Review Generation • (Jaidka et al., 2010)  discourse analysis of literature review, decomposition

Data • Data Construction • To create a data set for analysis & evaluation • Randomly collected 20 articles in different major conference proceeding in NLP & text processing • ACL(6), EMNLP(1), NAACL(5), COLING(4), and SIGIR(4) • Extract related work sections and its referenced articles • Pre-processing (PDF-to-TXT conversion, sentence boundary) & manual error correction • Named RWSData

Data • Data Statistics No, RW, RA, SbL, and WbL are labeled as (N)umber(o)f, (R)elated (W)orks, (R)eferenced (A)rticles, (S)entence-(b)ased (L)ength of, and (W)ord-(b)ased (L)ength of, respectively.

Manual Analysis • Objective: • To study characteristics of related work summaries • To deconstruct actual related work summaries • To gain a deeper insight on how they are structured and authored, from both rhetorical and content levels as well as on the surface, lexical level. • Towards efficient strategies for summarization and generation.

Manual Analysis • Formal definition • a related work summary (RWS) is a text summary • covers stuffs of previous works that are relevant to current work • particularly indicating particular aspects of interest (e.g. evaluation, results, experiments, …) • mentions the similarities and dissimilarities about particular aspects relevant a topic among previous works

Manual Analysis • Position: two possible positions • within the introduction section or the section on its own at the beginning of the article immediately after the Introduction section • give a strong overview about previous work • before the Conclusion section • a relatively short outline of previous studies and adequate comparisons between the technical content of the current study and previous studies

Manual Analysis Repeated for other topics Possible to generate automatically Extremely hard to generate automatically A topical structure of a related work summary

An illustrating example about structure of a related work summary General topic claim Topic 1 Other description & results Description & Results Topic 2 Proposed statement Description & Results

Paraphrase evaluation Subjective manual evaluation Through improving performance of particular tasks (e.g. question answering, machine translation) Related Work Summary - Structure • RWS is topic-biased summary following a topic hierarchy tree A topic hierarchy tree for previous example

Manual Analysis • Annotation of topical information for RWSData • Each related work summary associated with topic hierarchy tree • Statistics TS: tree depth TD: tree size

Manual Analysis - The Decomposition • Decomposition is a way to understand the human process in creating a related work summary. • Help answer motivated questions: • A summary is created by human cut-and-paste operations? • Which components in the summary come from the original articles and where in the original document they come from? • Levels: words, phrases, clauses, or even sentences • How such components are constructed? Revisions?

Manual Analysis - The Decomposition • Previous studies proposed automatic decomposition algorithm • (Jing et al., 1999) for news articles • (Ceylan et al., 2009) for books • Becomes non-trivial for multi-document summarization • Initially, I approach a manual decomposition for RW summarization

Manual Analysis - The Decomposition • The Alignment • Human Revisions in Related Work Creation

Manual Analysis - The Alignment Example of the alignment process

Manual Analysis - The Alignment • I observed four categories of RWS sentences: • RWS1: (XX, 2000) ... - a summary of an aspect mentioned in referenced article with respect to a specific topic. • (Barzilay and McKeown 2001) evaluated their paraphrases by asking judges … • RWS2: Topic (XX, 2000) ... - summary of a topic. • Supervised approaches such as (Black et al. 1998) have used clustering … • RWS3: Fact or Opinion (XX, 2000) ... - evidence-based reference. • Co-training (Riloff and Jones, 1999; Collins and Singer, 1999) begins with ... • RWST: template-based summary, focus mainly on something about survey paper, dataset, metric, tool, and so on. • Sebastiani’s survey paper [23] provides an overview ...

Manual Analysis - The Alignment • 5 sets chosen for the alignment • RWS summary is • not necessary to say everything about the referenced articles • but just refer to some specific aspects (e.g. of methods, results, evaluation ... )

Manual Analysis - The Alignment • Relevant information can appear at various positions in original documents • Title and Abstract, Introduction, Body (usually Experiments and Results), Conclusion

Manual Analysis - Revisions • Sentence Reduction • Text fragment 1: ... substituted each set of candidate paraphrasesinto between 2-10 sentenceswhich contained the original phrase. • RWSsentence: (Bannard and Callison-Burch 2005) replaced phraseswith paraphrasesin a number of sentences...

Manual Analysis - Revisions • Sentence Combination • Text fragment 1: ... substitutedeach set of candidate paraphrasesinto between 2-10 sentenceswhich contained the original phrase. • Text fragment 2: ... had two native English speakers produce judgmentsas to whether the new sentences preservedthe meaningof the original phraseand as to whether they remained grammatical. • RWSsentence: (Bannard and Callison-Burch 2005) replacedphraseswith paraphrasesin a number of sentencesand asked judgeswhether the substitutions“preserved meaningand remained grammatical”.

Manual Analysis - Revisions • Sentence Combination • Text fragment 1: ... to preserveboth meaning and grammaticality. • RWSsentence: ... “preservedmeaning and remained grammatical”.

Manual Analysis - Revisions • Lexical Paraphrasing • Text fragment 1: ... substitutedeach set of candidate paraphrases into between 2-10 sentences which contained the original phrase. • RWSsentence: (Bannard and Callison-Burch 2005) replacedphrases with paraphrases in a number of sentences ...

Manual Analysis - Revisions • Generalization/Specification • Text fragment 1: We present an unsupervised learning algorithm that mines large text corporafor patterns that express implicit semantic relations. • RWSsentence: (Turney 2006a) presents an unsupervised algorithm for mining the Webfor patterns expressing implicit semantic relations.

Manual Analysis - Revisions • All of the above revisions are generally not used alone but usually combined together to construct sentences in a RWS • Dealing with all the above revisions for RWSsummarization is very hard, especially in two revisions: • lexical paraphrasing • generalization/specification • Consider the remaining revisions only!

Manual Analysis – Related Work Representation • To examine how to generate and represent a complete RWS • Used another data set (RWSData-Sub) including 30 articles for this analysis. • There are two main factors which reﬂect related work summary representation • Topic transition • Local coherence

Topic transition • The observation on chosen data set reveals that there are two types of topic representation for related work summaries • Type 1: using transition sentences to connect between topic nodes • (23/30-77%) • Type 2: representing topic nodes as topic titles • (07/30-23%) • Type 2 representation sometimes used in the case that there exists a combination of different research problems relevant to a specific research topic. (See some examples)

0 Topic transition 1 2 3 4 0 3 1 4 2 Type 1

0 Topic transition 1 2 3 0 2 1 3 Type 2

Topic Transition • For “Type 1” representation: • Makes related work natural but: • Non-trivial for automatic generation because of lack of topic discourse information, e.g. “contrast”, “elaboration” b/w topic nodes • For “Type 2” representation: • Not as natural as Type 1 but: • Seems to be easy for automatic generation

Local Coherence • Local coherence • The syntactic realization of discourse entities and transitions between focused entities • News summaries: entity = mention to people • RWS: entity = mention to citations • My analysis reveals that there are 14 patterns for mention to citations in RWSes

Details on 14 main patterns explored in the analysis

Local Coherence • Statistics 1 Statistics for 14 patterns over the RWSData-Sub data set.

Local Coherence • Statistics 2 Statistics for 14 patterns that appear in each type of topic transition representation over the RWSData-Sub dataset

Task Formulation RW: related work A set of articles [] RW Summarizer User A desired length [,] [] A RW summary Topic hierarchy tree assumption

A Motivating Example A related work section extracted from “Bilingual Topic Aspect Classiﬁcation with A few Training Examples” (Wu et al., 2008)

The Proposed Approach For leaf nodes For internal nodes The ReWoS architecture, Decision edges are labeled as (T)rue, (F)alse or (R)elevant.

The Proposed Approach • Pre-Processing • Based on heuristic rules of sentence length and lexical clues • Sentences with token-based length is too short (<7) or too long (>80) • Sentences referring to future tenses • Sentences containing obviously redundant clues such as: “in the section ...”, “ﬁgure XXX shows ...”, “for instance” …

The Proposed Approach • Agent-based rule • Attempts to distinguish whether the sentence describes an author’s own work or not. • Based on the presence of tokens that signals work done by the author, such as “we”, “our”, “us”, “this approach”, and “this method” … • Says that if a sentence does not satisfy this rule, route for GCSum, otherwise for SCSum

General Content Summarization (GCSum) • The objective of GCSum is to extract sentences containing useful background information on the topics of the internal node in focus.

General Content Summarization (GCSum) General content informative indicative • Text classification is a task that assigns a certain number of pre-defined labels for a given text. • Statistical machine translation (SMT) seeks to develop mathematical models of the translation process whose parameters can be automatically estimated from a parallel corpus. • Many previous studies have approached the problem of mono-lingual text classification. • This paper refers to the problem of sentiment analysis.

General Content Summarization (GCSum) • Informative sentences • Give detail on a speciﬁc aspect of the problem, e.g. deﬁnitions, purpose or application of the topic • Indicative sentences • simpler, inserted to make the topic transition explicit and rhetorically sound • Summarization issue • Given a topic: • For indicative sentences, using pre-defined templates • For informative sentences, extract from input articles

General Content Summarization (GCSum) GCSum ﬁrst checks the subject of each candidate sentence, ﬁltering ones whose subjects do not contain at least one topic keyword. (Subject-based rule) Or GCSum checks whether stock verb phrases (i.e., “based on”, “make use of” and 23 other patterns) are used as the main verb. (Verb-based rule) Or GCSum checks for the presence of at least one citation – general sentences may list a set of citations as examples. (Citation-based rule) Importantly note that if cannot find out any informative sentences from input articles, generate indicative sentences instead!

General Content Summarization (GCSum) • Topic relevance computation (GCSum) • ranks sentences based on keyword content • states that the topic of an internal node is affected by its surrounding nodes – ancestor, descendants and others - scoreS is the ﬁnal relevance score - scoreSQA, scoreSQ, and scoreSQR mean the component relevance score of the sentence S with respect to the ancestor, current or other remaining nodes,respectively.

Towards Automated Related Work Summarization

Towards Automated Related Work Summarization

Presentation Transcript

Work-Related Attitudes

“Work Related Injuries”

Development towards work

Related Work

Work-related Asthma

Towards Automated Acoustic Model Training

Work-Related Texts

Related Work

Summarization

Summarization

Summarization

Related Work

Towards Automated Model Output Analysis

Work-related asthma

Work Related LBP

Work Related Issues

Work-Related Texts

Automated Planning Student’s Work

Related Work

Summarization

Towards competence-related interoperability

Work-Related Injuries