automatic summarization l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Automatic summarization PowerPoint Presentation
Download Presentation
Automatic summarization

Loading in 2 Seconds...

play fullscreen
1 / 22

Automatic summarization - PowerPoint PPT Presentation


  • 443 Views
  • Uploaded on

Automatic summarization. Dragomir R. Radev University of Michigan radev@umich.edu. Outline. What is summarization Genres of summarization (Single-doc, Multi-doc, Query-based, etc.) Extractive vs. non-extractive summarization Evaluation metrics Current systems Marcu/Knight MEAD/Lemur

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Automatic summarization' - salena


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
automatic summarization

Automatic summarization

Dragomir R. Radev

University of Michigan

radev@umich.edu

outline
Outline
  • What is summarization
  • Genres of summarization (Single-doc, Multi-doc, Query-based, etc.)
  • Extractive vs. non-extractive summarization
  • Evaluation metrics
  • Current systems
    • Marcu/Knight
    • MEAD/Lemur
    • NewsInEssence/NewsBlaster
  • What is possible and what is not
goal of summarization
Goal of summarization
  • Preserve the “most important information” in a document.
  • Make use of redundancy in text
  • Maximize information density

|S|

Compression Ratio =

|D|

i (S)

Retention Ratio =

i (D)

i (S)

|S|

Goal:

>

i (D)

|D|

sentence extraction based se summarization
Sentence-extraction based (SE) summarization
  • Classification problem
  • Approximation
typical approaches to se summarization
Typical approaches to SE summarization
  • Manually-selected features: position, overlap with query, cue words, structure information, overlap with centroid
  • Reranking: maximal marginal relevance [Carbonell/Goldstein98]
non se summarization
Non-SE summarization
  • Discourse-based [Marcu97]
  • Lexical chains [Barzilay&Elhadad97]
  • Template-based [Radev&McKeown98]
evaluation metrics
Evaluation metrics
  • Intrinsic measures
    • Precision, recall
    • Kappa
    • Relative utility [Radev&al.00]
    • Similarity measures (cosine, overlap, BLEU)
  • Extrinsic measures
    • Classification accuracy
    • Informativeness for question answering
    • Relevance correlation
web resources
Web resources

http://www.summarization.com

http://duc.nist.gov

http://www.newsinessence.com

http://www.clsp.jhu.edu/ws2001/groups/asmd/

http://www.cs.columbia.edu/~jing/summarization.html

http://www.dcs.shef.ac.uk/~gael/alphalist.html

http://www.csi.uottawa.ca/tanka/ts.html

http://www.ics.mq.edu.au/~swan/summarization/

summarization architecture

complexity

Summarization architecture
  • What do human summarizers do?
    • A: Start from scratch: analyze, transform, synthesize (top down)
    • B: Select material and revise: “cut and paste summarization” (Jing & McKeown-1999)
  • Automatic systems:
    • Extraction: selection of material
    • Revision: reduction, combination, syntactic transformation, paraphrasing, generalization, sentence reordering

Extracts

Abstracts

examples of generative models in summarization systems
Examples of generative models in summarization systems
  • Sentence selection
  • Sentence / document reduction
  • Headline generation
ex 1 sentence selection
Ex. 1: Sentence selection
  • Conroy et al (DUC 2001):
      • HMM on sentence level, each state has an associated feature vector (pos,len, #content terms)
      • Compute probability of being a summary sentence
  • Kraaij et al (DUC 2001)
      • Rank sentences according to posterior probability given a mixture model
  • Grammaticality is OK
  • Lacks aggregation, generalization, MDS
knight marcu aaai2000
Knight & Marcu (AAAI2000)
  • Compression: delete substrings in an informed way (based on parse tree)
    • Required: PCFG parser, tree aligned training corpus
    • Channel model: probabilistic model for expansion of a parse tree
    • Results: much better than NP baseline
  • Tight control on grammaticality
  • Mimics revision operations by humans
daum marcu acl2002
Daumé & Marcu (ACL2002)
  • Document compression, noisy channel
    • Based on syntactic structure and discourse structure (extension of Knight & Marcu model)
    • Required: Discourse & syntactic parsers
    • Training corpus where EDU’s in summaries are aligned with the documents
  • Cannot handle interesting document lengths (due to complexity)
berger mittal sigir2000
Berger & Mittal (sigir2000)
  • Input: web pages (often not running text)
    • Trigram language model
    • IBM model 1 like channel model:
      • Choose length, draw word from source model and replace with similar word, independence assumption
    • Trained on Open Directory
  • Non-extractive
  • Grammaticality and coherence are disappointing: indicative
zajic dorr schwartz duc2002
Zajic, Dorr & Schwartz (duc2002)
  • Headline generation from a full story: P(S|H)P(H)
  • Channel model based on HMM consisting of a bigram model of headline words and a unigram model of story words, bigram language model
  • Decoding parameters are crucial to produce good results (length, position, strings)
  • Good results in fluency and accuracy
conclusions
Conclusions
  • Fluent headlines within reach of simple generative models
  • High quality summaries (coverage, grammaticality, coherence) require higher level symbolic representations
  • Cut & paste metaphor divides the work into manageable sub-problems
  • Noisy channel method effective, but not always efficient
open issues
Open issues
  • Audience (user model)
  • Types of source documents
  • Dealing with redundancy
  • Information ordering (e.g., temporal)
  • Coherent text
  • Cross-lingual summarization (Norbert Fuhr)
  • Use summaries to improve IR (or CLIR) - relevance correlation
  • LM for text generation
  • Possibly not well-defined problem (low interjudge agreement)
  • Develop models with more linguistic structure
  • Develop integrated models, e.g. by using priors (Rosenfeld)
  • Build efficient implementations
  • Evaluation: Define a manageable task