a phrase based model of alignment for natural language inference l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
A Phrase-Based Model of Alignment for Natural Language Inference PowerPoint Presentation
Download Presentation
A Phrase-Based Model of Alignment for Natural Language Inference

Loading in 2 Seconds...

play fullscreen
1 / 25

A Phrase-Based Model of Alignment for Natural Language Inference - PowerPoint PPT Presentation


  • 203 Views
  • Uploaded on

A Phrase-Based Model of Alignment for Natural Language Inference. Bill MacCartney, Michel Galley, and Christopher D. Manning Stanford University 26 October 2008. Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'A Phrase-Based Model of Alignment for Natural Language Inference' - fergus


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
a phrase based model of alignment for natural language inference

A Phrase-Based Model of Alignmentfor Natural Language Inference

Bill MacCartney, Michel Galley,

and Christopher D. Manning

Stanford University

26 October 2008

natural language inference nli aka rte

Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

Natural language inference (NLI) (aka RTE)
  • Does premise P justify an inference to hypothesis H?
    • An informal notion of inference; variability of linguistic expression

P Gazprom today confirmed a two-fold increase in its gas price for Georgia, beginning next Monday.

H Gazprom will double Georgia’s gas bill. yes

  • Like MT, NLI depends on a facility for alignment
    • I.e., linking corresponding words/phrases in two related sentences
alignment example

unaligned content:

“deletions” from P

approximate match:

price ~ bill

phrase alignment:

two-fold increase ~ double

Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

Alignment example

H (hypothesis)

P (premise)

approaches to nli alignment

Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

Approaches to NLI alignment
  • Alignment addressed variously by current NLI systems
  • In some approaches to NLI, alignments are implicit:
    • NLI via lexical overlap [Glickman et al. 05, Jijkoun & de Rijke 05]
    • NLI as proof search [Tatu & Moldovan 07, Bar-Haim et al. 07]
  • Other NLI systems make alignment step explicit:
    • Align first, then determine inferential validity [Marsi & Kramer 05, MacCartney et al. 06]
  • What about using an MT aligner?
    • Alignment is familiar in MT, with extensive literature[Brown et al. 93, Vogel et al. 96, Och & Ney 03, Marcu & Wong 02, DeNero et al. 06, Birch et al. 06, DeNero & Klein 08]
    • Can tools & techniques of MT alignment transfer to NLI?
nli alignment vs mt alignment

Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

NLI alignment vs. MT alignment

Doubtful — NLI alignment differs in several respects:

  • Monolingual: can exploit resources like WordNet
  • Asymmetric: P often longer & has content unrelated to H
  • Cannot assume semantic equivalence
    • NLI aligner must accommodate frequent unaligned content
  • Little training data available
    • MT aligners use unsupervised training on huge amounts of bitext
    • NLI aligners must rely on supervised training & much less data
contributions of this paper

Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

Contributions of this paper

In this paper, we:

  • Undertake the first systematic study of alignment for NLI
    • Existing NLI aligners use idiosyncratic methods, are poorly documented, use proprietary data
  • Examine the relation between alignment in NLI and MT
    • How do existing MT aligners perform on NLI alignment task?
  • Propose a new model of alignment for NLI: MANLI
    • Outperforms existing MT & NLI aligners on NLI alignment task
the manli aligner

Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

The MANLI aligner

A model of alignment for NLI consisting of four components:

Phrase-based representation

Feature-based scoring function

Decoding using simulated annealing

Perceptron learning

phrase based alignment representation

Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

Phrase-based alignment representation

Represent alignments by sequence of phrase edits: EQ, SUB, DEL, INS

EQ(Gazprom1, Gazprom1)

INS(will2)

DEL(today2)

DEL(confirmed3)

DEL(a4)

SUB(two-fold5increase6, double3)

DEL(in7)

DEL(its8)

  • One-to-one at phrase level (but many-to-many at token level)
  • Avoids arbitrary alignment choices; can use phrase-based resources
a feature based scoring function

Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

A feature-based scoring function
  • Score edits as linear combination of features, then sum:
  • Edit type features: EQ, SUB, DEL, INS
  • Phrase features: phrase sizes, non-constituents
  • Lexical similarity feature: max over similarity scores
    • WordNet: synonymy, hyponymy, antonymy, Jiang-Conrath
    • Distributional similarity à la Dekang Lin
    • Various measures of string/lemma similarity
  • Contextual features: distortion, matching neighbors
decoding using simulated annealing

Start

Generate successors

Score

Smooth/sharpen

P(A) = P(A)1/T

Sample

Lower temp

Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

T = 0.9 T

Repeat

Decoding using simulated annealing

… 100 times

perceptron learning of feature weights

Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

Perceptron learning of feature weights

We use a variant of averaged perceptron [Collins 2002]

Initialize weight vector w = 0, learning rate R0 = 1

For training epoch i = 1 to 50:

For each problem Pj, Hj with gold alignment Ej:

Set Êj = ALIGN(Pj, Hj, w)

Set w = w + Ri ((Ej) – (Êj))

Set w = w / ‖w‖2 (L2 normalization)

Set w[i] = w (store weight vector for this epoch)

Set Ri = 0.8 Ri–1 (reduce learning rate)

Throw away weight vectors from first 20% of epochs

Return average weight vector

Training runs require about 20 hours (on 800 RTE problems)

the msr rte2 alignment data

Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

The MSR RTE2 alignment data
  • Previously, little supervised data
  • Now, MSR gold alignments for RTE2
    • [Brockett 2007]
    • dev & test sets, 800 problems each
  • Token-based, but many-to-many
    • allows implicit alignment of phrases
  • 3 independent annotators
    • 3 of 3 agreed on 70% of proposed links
    • 2 of 3 agreed on 99.7% of proposed links
    • merged using majority rule
evaluation on msr data

Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

Evaluation on MSR data
  • We evaluate several systems on MSR data
    • A simple baseline aligner
    • MT aligners: GIZA++ & Cross-EM
    • NLI aligners: Stanford RTE, MANLI
  • How well do they recover gold-standard alignments?
    • We report per-link precision, recall, and F1
    • We also report exact match rate for complete alignments
baseline bag of words aligner

Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

Baseline: bag-of-words aligner
  • Surprisingly good recall, despite extreme simplicity
  • But very mediocre precision, F1, & exact match rate
  • Main problem: aligns every token in H

Match each H token to most similar P token: [cf. Glickman et al. 2005]

mt aligners giza cross em

Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

MT aligners: GIZA++ & Cross-EM
  • Can we show that MT aligners aren’t suitable for NLI?
  • Run GIZA++ via Moses, with default parameters
    • Train on dev set, evaluate on dev & test sets
    • Asymmetric alignments in both directions
    • Then symmetrize using INTERSECTION heuristic
  • Initial results are very poor: 56% F1
    • Doesn’t even align equal words
  • Remedy: add lexicon of equal words as extra training data
  • Do similar experiments with Berkeley Cross-EM aligner
results mt aligners

Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

Results: MT aligners

Similar F1, but GIZA++ wins on precision, Cross-EM on recall

  • Both do best with lexicon & INTERSECTION heuristic
    • Also tried UNION, GROW, GROW-DIAG, GROW-DIAG-FINAL,GROW-DIAG-FINAL-AND, and asymmetric alignments
    • All achieve better recall, but much worse precision & F1
  • Problem: too little data for unsupervised learning
    • Need to compensate by exploiting external lexical resources
the stanford rte aligner

Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

The Stanford RTE aligner
  • Token-based alignments: map from H tokens to P tokens
    • Phrase alignments not directly representable
    • (But, named entities & collocations collapsed in pre-processing)
  • Exploits external lexical resources
    • WordNet, LSA, distributional similarity, string sim, …
  • Syntax-based features to promote aligning corresponding predicate-argument structures
  • Decoding & learning similar to MANLI
results stanford rte aligner

Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

Results: Stanford RTE aligner
  • Better F1 than MT aligners — but recall lags precision
  • Stanford does poor job aligning function words
    • 13% of links in gold are prepositions & articles
    • Stanford misses 67% of these (MANLI only 10%)
  • Also, Stanford fails to align multi-word phrases

peace activists ~ protestors, hackers ~ non-authorized personnel

*

*

*

*

* includes (generous) correction for missed punctuation

results manli aligner

Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

Results: MANLI aligner
  • MANLI outperforms all others on every measure
    • F1: 10.5% higher than GIZA++, 6.2% higher than Stanford
  • Good balance of precision & recall
  • Matched >20% exactly
manli results discussion

Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

MANLI results: discussion
  • Three factors contribute to success:
    • Lexical resources: jail ~ prison, prevent ~ stop , injured ~ wounded
    • Contextual features enable matching function words
    • Phrases: death penalty ~ capital punishment, abdicate ~ give up
  • But phrases help less than expected!
    • If we set max phrase size = 1, we lose just 0.2% in F1
  • Recall errors: room to improve
    • 40%: need better lexical resources: conservation ~ protecting, organization ~ agencies, bone fragility ~ osteoporosis
  • Precision errors harder to reduce
    • equal function words (49%), forms of be (21%), punctuation (7%)
can aligners predict rte answers

Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

Can aligners predict RTE answers?
  • We’ve been evaluating against gold-standard alignments
  • But alignment is just one component of an NLI system
  • Does a good alignment indicate a valid inference?
    • Not necessarily: negations, modals, non-factives & implicatives, …
    • But alignment score can be strongly predictive
    • And many NLI systems rely solely on alignment
  • Using alignment score to predict RTE answers:
    • Predict YES if score > threshold
    • Tune threshold on development data
    • Evaluate on test data
results predicting rte answers

Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

Results: predicting RTE answers
  • No NLI aligner rivals best complete RTE system
    • (Most) complete systems do a lot more than just alignment!
  • But, Stanford & MANLI beat average entry for RTE2
  • Many NLI systems could benefit from better alignments!
conclusion

Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

:-)

Thanks! Questions?

Conclusion
  • MT aligners not directly applicable to NLI
    • They rely on unsupervised learning from massive amounts of bitext
    • They assume semantic equivalence of P & H
  • MANLI succeeds by:
    • Exploiting (manually & automatically constructed) lexical resources
    • Accommodating frequent unaligned phrases
  • Phrase-based representation shows potential
    • But not yet proven: need better phrase-based lexical resources
related work

Introduction • The MANLI Aligner • Evaluation on MSR Data • Predicting RTE Answers • Conclusion

Related work
  • Lots of past work on phrase-based MT
  • But most systems extract phrases from word-aligned data
    • Despite assumption that many translations are non-compositional
  • Recent work jointly aligns & weights phrases[Marcu & Wong 02, DeNero et al. 06, Birch et al. 06, DeNero & Klein 08]
  • However, this is of limited applicability to the NLI task
    • MANLI uses phrases only when words aren’t appropriate
    • MT uses longer phrases to realize more dependencies(e.g. word order, agreement, subcategorization)
    • MT systems don’t model word insertions & deletions