1 / 48

The PASCAL Recognizing Textual Entailment Challenges - RTE-1,2,3 - PowerPoint PPT Presentation

  • Uploaded on

The PASCAL Recognizing Textual Entailment Challenges - RTE-1,2,3. Ido Dagan Bar-Ilan University, Israel with …. Recognizing Textual Entailment PASCAL NOE Challenge 2004-5. Ido Dagan, Oren glickman Bar-Ilan University, Israel Bernardo Magnini ITC-irst, Trento, Italy.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' The PASCAL Recognizing Textual Entailment Challenges - RTE-1,2,3' - december

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

The PASCALRecognizing Textual Entailment Challenges - RTE-1,2,3

Ido Dagan Bar-Ilan University, Israel

with …

Recognizing Textual EntailmentPASCAL NOE Challenge2004-5

Ido Dagan, Oren glickman Bar-Ilan University, Israel

Bernardo Magnini ITC-irst, Trento, Italy

The second pascal recognising textual entailment challenge

The Second PASCAL Recognising Textual Entailment Challenge

Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampicollo, Bernardo Magnini, Idan Szpektor

Bar-Ilan, CELCT, ITC-irst, Microsoft Research, MITRE

The Third

Recognising Textual Entailment


Danilo Giampiccolo (CELCT) and Bernardo Magnini (FBK-ITC)

With Ido Dagan (Bar-Ilan) and Bill Dolan (Microsoft Research)

Patrick Pantel (USC-ISI), for Resources Pool

Hoa Dang and Ellen Voorhees (NIST), for Extended Task

Rte motivation
RTE Motivation

  • Text applications require semantic inference

  • A common framework for addressing applied inference as a whole is needed, but still missing

    • Global inference is typically application dependent

    • Application-independent approaches and resources exist for some semantic sub-problems

  • Textual entailment may provide such common application-independent semantic framework

Framework desiderata
Framework Desiderata

A framework for modeling a target level of language processing should provide:

  • Generic module for applications

    • A common underlying task, unified interface (cf. parsing)

  • Unified paradigm for investigating sub-phenomena


  • The textual entailment task – what and why?

  • Evaluation dataset & methodology

  • Participating systems and approaches

  • Potential for machine learning

  • Framework for investigating semantics

Natural language and meaning



Natural Language and Meaning



Variability of semantic expression
Variability of Semantic Expression

The Dow Jones Industrial Average closed up 255

Model variabilityas relations between text expressions:

  • Equivalence: text1  text2 (paraphrasing)

  • Entailment: text1  text2 – the general case

Dow ends up

Dow gains 255 points

Stock market hits a record high

Dow climbs 255

Typical application inference
Typical Application Inference

QuestionExpected answer formWhoboughtOverture? >> XboughtOverture

Overture’s acquisitionby Yahoo

Yahoo bought Overture


hypothesized answer


  • Similar for IE: X buy Y

  • “Semantic” IR: t: Overture was bought …

  • Summarization (multi-document) – identify redundant info

  • MT evaluation (and recent ideas for MT)

  • Educational applications, …

Kraq 05 workshop knowledge and reasoning for answering questions ijcai 05


  • Reasoning aspects:    * information fusion,    * search criteria expansion models     * summarization and intensional answers,    * reasoning under uncertainty or with incomplete knowledge,

  • Knowledge representation and integration:    * levels of knowledge involved (e.g. ontologies, domain knowledge),    * knowledge extraction models and techniques to optimize response accuracy… but similar needs for other applications – can entailment provide a common empirical task?

Classical entailment definition
Classical Entailment Definition QUESTIONS (IJCAI-05)

  • Chierchia & McConnell-Ginet (2001):A text t entails a hypothesis h if h is true in every circumstance (possible world) in which t is true

  • Strict entailment - doesn't account for some uncertainty allowed in applications

Almost certain entailments
“Almost certain” Entailments QUESTIONS (IJCAI-05)

t:The technological triumph known as GPS … was incubated in the mind of Ivan Getting.

h: Ivan Getting invented the GPS.

Applied textual entailment
Applied Textual Entailment QUESTIONS (IJCAI-05)

  • Directional relation between two text fragments: Text (t) and Hypothesis (h):

  • Operational (applied) definition:

    • Human gold standard - as in NLP applications

    • Assuming common background knowledge – which is indeed expected from applications

Evaluation Dataset QUESTIONS (IJCAI-05)

Generic dataset by application use
Generic Dataset by Application Use QUESTIONS (IJCAI-05)

  • 7 application settings in RTE-1, 4 in RTE-2/3

    • QA

    • IE

    • “Semantic” IR

    • Comparable documents / multi-doc summarization

    • MT evaluation

    • Reading comprehension

    • Paraphrase acquisition

  • Most data created from actual applications output

  • ~800 examples in development and test sets

  • 50-50% YES/NO split

Some examples
Some Examples QUESTIONS (IJCAI-05)

Final dataset rte 2
Final Dataset (RTE-2) QUESTIONS (IJCAI-05)

  • Average pairwise inter-judge agreement: 89.2%

    • Average Kappa 0.78 – substantial agreement

    • Better than RTE-1

  • Removed 18.2% of pairs due to disagreement (3-4 judges)

  • Disagreement example:

    • (t) Women are under-represented at all political levels ...(h) Women are poorlyrepresented in parliament.

  • Additional review removed 25.5% of pairs

    • too difficult / vague / redundant

Final dataset rte 3
Final Dataset (RTE-3) QUESTIONS (IJCAI-05)

  • Each pair judged by three annotators

  • Pairs on which the annotators disagreed were filtered-out.

  • Average pairwise annotator agreement: 87.8% (Kappa level of 0.75)

  • Filtered-out pairs:

    • 19.2 % due to disagreement

    • 9.4 % as controversial, too difficult, or too similar to other pairs

Progress from 1 to 3
Progress from 1 to 3 QUESTIONS (IJCAI-05)

  • More realistic application data:

    • RTE-1: some partly synthetic examples

    • RTE-2&3 mostly:

      • Input from common benchmarks for the different applications

      • Output from real systems

    • Test entailment potential across applications

  • Text length:

    • RTE-1&2: one-two sentences

    • RTE-3: 25% full paragraphs, requires discourse modeling/anaphora

  • Improve data collection and annotation

    • Revised and expanded guidelines

    • Most pairs triply annotated, some across organizers sites

  • Provide linguistic pre-processing, RTE Resources Pool

  • RTE-3 pilot task by NIST: 3-way judgments; explanations

Suggested perspective
Suggested Perspective QUESTIONS (IJCAI-05)

RE the Arthur Bernstein competition:

“… Competition, even a piano competition, is legitimate … as long as it is just an anecdotal side effect of the musical culture scene, and doesn’t threat to overtake the center stage”

Haaretz Israeli News Paper, Culture Section, April 1st, 2005

Participating Systems QUESTIONS (IJCAI-05)

Participation QUESTIONS (IJCAI-05)

  • Popular challenges, world wide:

    • RTE-1 – 17 groups

    • RTE-2 – 23 groups

    • RTE-3 – 26 groups

      • 14 Europe, 12 US

      • 11 newcomers (~40 groups so far)

      • 79 dev-set downloads (44 planned, 26 maybe)

      • 42 test-set downloads

      • Joint ACL-07/PASCAL workshop (~70 participants)

Methods and approaches
Methods and Approaches QUESTIONS (IJCAI-05)

  • Estimate similarity match between t and h (coverage of h by t):

    • Lexical overlap (unigram, N-gram, subsequence)

    • Lexical substitution (WordNet, statistical)

    • Lexical-syntactic variations (“paraphrases”)

    • Syntactic matching/edit-distance/transformations

    • Semantic role labeling and matching

    • Global similarity parameters (e.g. negation, modality)

    • Anaphora resolution

  • Probabilistic tree-transformations

  • Cross-pair similarity

  • Detect mismatch (for non-entailment)

  • Logical interpretation and inference

Dominant approach supervised learning
Dominant approach: QUESTIONS (IJCAI-05)Supervised Learning

  • Features model various aspects of similarity and mismatch

  • Classifier determines relative weights of information sources

  • Train on development set and auxiliary t-h corpora

Similarity Features:Lexical, n-gram,syntactic

semantic, global





Feature vector

Parse based proof systems






























Parse-based Proof Systems

It rained when John and Mary left

It rained when Mary left

Mary left















(Bar-Haim et al., RTE-3)

Resources QUESTIONS (IJCAI-05)

  • WordNet, Extended WordNet, distributional similarity

    • Britain  UK

    • steal  take

  • DIRT (paraphrase rules)

    • X file a lawsuit against Y  X accuse Y (world knowledge)

    • X confirm Y  X approve Y (linguistic knowledge)

  • FrameNet, ProBank, VerbNet

    • For semantic role labeling

  • Entailment pairs corpora

    • Automatically acquired training

  • No dedicated resources for entailment yet

Accuracy results rte 1
Accuracy Results – RTE-1 QUESTIONS (IJCAI-05)

Results rte 2
Results (RTE-2) QUESTIONS (IJCAI-05)

Average: 60%

Median: 59%

Results rte 3

Two systems above 70%

Most systems (65%) in the range 60-70%; they were just 30% at RTE-2

Current limitations
Current Limitations QUESTIONS (IJCAI-05)

  • Simple methods perform quite well, but not best

  • System reports point at:

    • Lack of knowledge (syntactic transformation rules, paraphrases, lexical relations, etc.)

    • Lack of training data

  • It seems that systems that coped better with these issues performed best:

    • Hickl et al. - acquisition of large entailment corpora for training

    • Tatu et al. – large knowledge bases (linguistic and world knowledge)


  • High interest in the research community

    • Papers, conference sessions and areas, PhD theses, funded projects

    • Special issue - Journal of Natural Language Engineering

    • ACL-07 tutorial

  • Initial contribution to specific applications

    • QA – Harabagiu & Hickl, ACL-06; CLEF-06/07

    • RE – Romano et al., EACL-06

  • RTE-4 – by NIST, with CELCT

    • Within TAC, a new semantic evaluation conference (with QA and summarization, subsuming DUC)

New Potentials for QUESTIONS (IJCAI-05)Machine Learning

Classical approach interpretation
Classical Approach = Interpretation QUESTIONS (IJCAI-05)

Stipulated Meaning Representation(by scholar)


Language(by nature)

  • Logical forms, word senses, semantic roles, named entity types, … - scattered tasks

  • Feasible/suitable framework for applied semantics?

Textual entailment text mapping
Textual Entailment = Text Mapping QUESTIONS (IJCAI-05)

Assumed Meaning (by humans)


Language(by nature)

General case inference
General Case – Inference QUESTIONS (IJCAI-05)





Textual Entailment

  • Entailment mapping is the actual applied goal - and also a touchstone for understanding!

  • Interpretation becomes a possiblemean

Machine learning perspectives
Machine Learning Perspectives QUESTIONS (IJCAI-05)

  • Issues with interpretation approach:

    • Hard to agree on target representations

    • Costly to annotate semantic representations for training

    • Has it been a barrier?

  • Language-level entailment mapping refers to texts

    • Texts are semantic-theory neutral

    • Amenable for unsupervised/semi-supervised learning

  • It would be interesting to explore (many do)

    • language-based representations of meaning, inference knowledge, and ontology,

    • for which learning and inference methods may be easier to develop.

    • Artificial intelligence through natural language?

Major learning directions
Major Learning Directions QUESTIONS (IJCAI-05)

  • Learning entailment knowledge (!!!)

    • Learning entailment relations between words/expressions

    • Integrating with manual resources and knowledge

  • Inference methods

    • Principled frameworks for probabilistic inference

      • Estimate likelihood of deriving hypothesis from text

      • Fusing information levels

    • More than bags of features

  • Relational learning relevant for both

  • How can we increase ML researchers involvement?

Learning entailment knowledge












Dist. sim


Learning Entailment Knowledge

  • Entailing “topical” terms from words/texts

    • E.g. medicine, law, cars, computer security, …

    • An unsupervised version of text categorization

  • Learning entailment graph for terms/expressions

    • Partial knowledge: statistical, lexical resources, Wikipedia, …

    • Estimate link likelihood in context

Meeting the knowledge challenge by a coordinated effort
Meeting the knowledge challenge – QUESTIONS (IJCAI-05)by a coordinated effort?

  • A vast amount of “entailment rules” needed

  • Speculation: can we have a joint community effort for knowledge acquisition?

    • Uniform representations

    • Mostly automatic acquisition (millions of rules)

    • Human Genome Projectanalogy

  • Preliminary: RTE-3 Resources Pool at ACLWiki(set up by Patrick Pantel)

Textual entailment human reading comprehension
Textual Entailment QUESTIONS (IJCAI-05)≈Human Reading Comprehension

  • From a children’s English learning book(Sela and Greenberg):

    Reference Text:“…The Bermuda Triangle lies in the Atlantic Ocean, off the coast of Florida. …”

    Hypothesis (True/False?):The Bermuda Triangle is near the United States


Where are we from rte 1
Where are we (from RTE-1)? QUESTIONS (IJCAI-05)

Cautious optimism
Cautious Optimism QUESTIONS (IJCAI-05)

  • Textual entailment provides a unified framework for applied semantics

    • Towards generic inference “engines” for applications

  • Potential for:

    • Scalable knowledge acquisition, boosted by (mostly unsupervised) learning

    • Learning-based inference methods

Thank you!

Summary textual entailment as goal
Summary: Textual Entailment as Goal QUESTIONS (IJCAI-05)

  • The essence of our proposal:

    • Base applied inference on entailment “engines” and KBs

    • Formulate various semantic problems as entailment tasks

  • Interpretations and “mapping” methods may compete/complement

  • Open question: which inferences

    • can be represented at language level?

    • require logical or specialized representation and inference? (temporal, spatial, mathematical, …)

Collecting qa pairs
Collecting QA Pairs QUESTIONS (IJCAI-05)

  • Motivation: a passage containing the answer slot filler should entail the corresponding answer statement.

    • E.g. for: Who invented the telephone?, and answer Bell,text should entail Bell invented the telephone

  • QA systems were given TREC and CLEF questions.

  • Hypothesis generated by “plugging” the system answer term into the affirmative form of the question

  • Texts correspond to the candidate answer passages

Collecting ie pairs
Collecting IE Pairs QUESTIONS (IJCAI-05)

  • Motivation: a sentence containing a target relation instance should entail an instantiated template of the relation

    • E.g: X is located in Y

  • Pairs were generated in several ways

    • Outputs of IE systems:

      • for ACE-2004 and MUC-4 relations

    • Manually:

      • for ACE-2004 and MUC-4 relations

      • for additional relations in news domain

Collecting ir pairs
Collecting IR Pairs QUESTIONS (IJCAI-05)

  • Motivation: relevant documents should entail a given “propositional” query.

  • Hypotheses are propositional IR queries, adapted and simplified from TREC and CLEF

    • drug legalization benefits  drug legalization has benefits

  • Texts selected from documents retrieved by different search engines

Collecting sum mds pairs
Collecting SUM (MDS) Pairs QUESTIONS (IJCAI-05)

  • Motivation: identifying redundant statements (particularly in multi-document summaries)

  • Using web document clusters and system summary

  • Picking for hypotheses sentences having high lexical overlap with summary

  • In final pairs:

    • Textsare original sentences (usually from summary)

    • Hypotheses:

      • Positive pairs: simplifyh until entailed by t

      • Negative pairs: simplifyh similarly

  • In RTE-3: using Pyramid benchmark data