Automatic Summary Evaluation

Automatic Summary Evaluation Ross Greenwood

Recap • Automatically evaluate summaries of text documents • Evaluate content coverage • Compare against one or more ideal summaries

Pyramid Evaluation • Manually annotate texts for phrases expressing similar ideas (summary content units) • Judge content coverage by number of overlapping summary content units

ROUGE: Four Summary Evaluation Measures • ROUGE-N: N-gram Co-Occurrence • Number of matching N-word substrings • ROUGE-L: Longest Common Subsequence • Allows for skipping words • Ex. “a b d f” is a subsequence of “a b c d e f” • ROUGE-W: Weighted LCS • Weight consecutive matches higher • ROUGE-S: Skip-bigram • Number of matching 2-word substrings with arbitrary gaps

Precision, Recall, and F-Measure • Precision = matches/num_words_peer • Recall = matches/num_words_models • F = 2/(1/P + 1/R)

Problems with ROUGE-N: False Positives • Homographs, ex: Model: … robbed the bank … Peer: … sat on the river bank …

Problems with ROUGE-N: False Negatives • Synonyms, ex: Model: … held up the financial institution … Peer: … robbed the bank …

Solution: WordNet • Lexical Database • Synsets: organize words by concepts • Method: • Tag words with POS • Tag words with meaning (senseLearner) • Lookup synset in WordNet

Architecture of Solution WordNet {go#v#7, pass#v#6, lead#v#6, extend#v#2} querySense(“run#v#3”, “syns”) POS tagger senseLearner ROUGE Results Data

Evaluating the Evaluator • Correlation with human evaluation scores (ROUGE, Basic Elements) • Success at reducing errors (i.e. number of false negatives/positives avoided vs. original ROUGE)

References • Lin, C.Y. (2004). Rouge: a package for automatic evaluation of summaries. Workshop On Text Summarization Branches Out • Fellbaum, C. (Ed.). (1998). Wordnet: an electronic lexical database. Cambridge, MA: MIT Press.

Questions?

Automatic Summary Evaluation

Automatic Summary Evaluation

Presentation Transcript

External Evaluation (Summary)

A Web-based Automatic Evaluation System

Building a sentential model for automatic prosody evaluation

Evaluation Summary

Saxon Phonics Evaluation: Executive Summary

2011-2013 Evaluation Summary

Democratically Oriented Governments Summary and Evaluation

Dependency-Based Automatic Evaluation for Machine Translation

WP4 – Evaluation Framework Summary

Evaluation of Text Generation: Automatic Evaluation vs. Variation

Automatic Evaluation Of Search Engines Project Presentation

Laptop Evaluation Summary

Automatic Evaluation of Intrusion Detection Systems

ITTF Evaluation Summary

Evaluation Team Summary Reports:

Data Evaluation Summary - Columbia River Component

Automatic methods of MT evaluation

Summary of Peer Evaluation Forms

DocValidator: Automatic Evaluation of Source Code Documentation

Council of Governors Evaluation Summary

North Carolina Educator Evaluation System Evaluation Summary Sheet

Empirical Evaluation of innovations in automatic repair