Robust Local Textual Inference

Robust Local Textual Inference Christopher Manning, Stanford University Bill MacCartney, Marie-Catherine de Marneffe (U. C. de Louvain), Teg Grenager, Daniel Cer (U. Colorado), RajatRaina, Christopher Cox, Anna Rafferty, Roger Grosse, Josh Ainslie, Aria Haghighi, Jenny Finkel, Jeff Michels, Kristina Toutanova, and Andrew Y. Ng

The backdrop • There is a long, sometimes successful history of writing by hand systems writing systems that understand more deeply • Using limited vocabulary and syntax in a literal way over limited domains. The TAUM-METEO system. • Recently, statistical/machine learning computational linguistics has provided tools for disambiguating natural language. • Parsers and annotators for any text from any domain • E.g., Named Entity Recognition: Person? Company? • In August 2004, Charles Schwab came to Arizona and opened a temporary location on 92nd Street

An external perspective on NLP • NLP has many successful tools with all sorts of uses • Part of speech tagging, named entity recognition, syntactic parsing, semantic role parsing, coreference determination but they concentrate on structure not meaning • By-and-large non-NLP people want systems for more holistic semantic tasks • Text categorization • Information retrieval/web search • The state-of-the-art in these areas is (slightly extended) bag-of-words models

The problem for NLP • The problem for NLP: Search engines actually work pretty well for people • But people would like to get more from text-processing applications • Information gathering is not just surface expression • Answers to many questions are “a bit below” the surface • This interpretation is the difference between data and knowledge • Challenge: a tool that works robustly on any text and understands a useful, greater amount of sentence meaning

Talk Outline • The NLP Challenge: Beyond the bag of words • The Pascal task of robust local textual inference • Deep logical approaches to NLP • Answering GRE analytic section logic puzzles [Lev, MacCartney, Levy, and Manning 2004] • Some first attempts • [Raina, Ng, and Manning 2005; Haghighi, Ng, and Manning 2005] • A second attempt • [MacCartney, deMarneffe, Grenager, Cer, and Manning 2006]

2. The PASCAL Textual Inference Task[Dagan, Glickman, and Magnini 2005] • The task: Can systems correctly perform ‘local textual inferences’ [individual inference steps]? • On the assumption that some piece of text (T) is true, does this imply the truth of some other hypothesis text (H)? • Sydney was the host city of the 2000 Olympics → • The Olympics have been held in SydneyTRUE • The format could be used for evaluating extended inferential chains or knowledge • But, in practice, fairly direct, local stuff

The PASCAL Textual Inference Task • The task focuses on the variability of semantic expression in language • The reverse task of disambiguation • The Dow Jones Industrial Average closed up 255 • The Dow climbed 255 points today • The Dow Jones Industrial Average gained over 250 points • An abstraction from any particular application, but directly applicable to applications

Natural Examples: Reading Comprehension • (CNN Student News) -- January 24, 2006 • Answer the following questions about today's featured news stories. Write your answers in the space provided. • 1. Where is the country of Somalia located? What ocean borders this country? * * • 2. Why did crew members from the USS Winston S. Churchill recently stop a small vessel off the coast of Somalia? What action did the crew of the Churchill take? * *

Real Uses • Semantic search • Find documents about lobbyists attempting to bribe U.S. senators • ~(lobbyist attempted to bribe U.S. senator) • Question answering: • Who acquired Overture? • Use to score candidate answers based on passage retrieval and named entity recognition • Customer email response • My Squeezebox regularly skips during music playback • ➜ Sender can hear music through Squeezebox • Relation extraction (database building) • Document summarization

Verification of terms [Dan Roth] • Non-disclosure Agreement WHEREAS Recipient is desirous of obtaining said confidential information for purposes of evaluation thereof and as a basis for further discussions with Owner regarding assistance with development of the confidential information for the benefit of Owner or for the mutual benefit of Owner and Recipient; THEREFORE, Recipient hereby agrees to receive the information in confidence and to treat it as confidential for all purposes. Recipient will not divulge or use in any manner any of said confidential information unless by written consent from Owner, and Recipient will use at least the same efforts it regularly employs for its own confidential information to avoid disclosure to others. Provided, however, that this obligation to treat information confidentially will not apply to any information already in Recipient's possession or to any information that is generally available to the public or becomes generally available through no act or influence of Recipient. Recipient will inform Owner of the public nature or Recipient's possession of the information without delay after Owner's disclosure thereof or will be stopped from asserting such as defense to remedy under this agreement. Each party acknowledges that all of the disclosing party's Confidential Information is owned solely by the disclosing party (or its licensors and/or other vendors) and that the unauthorized disclosure or use of such Confidential Information would cause irreparable harm and significant injury, the degree of which may be difficult to ascertain. Accordingly, each party agrees that the disclosing party will have the right to obtain an immediate injunction enjoining any breach of this Agreement, as well as the right to pursueany and all other rights and remedies available at law or in equity for such a breach. Recipient will exercise its best efforts to conduct its evaluation within a reasonable time after Owner's disclosure and will provide Owner with its assessment thereof without delay. Recipient will return all information, including all copies thereof, to Owner upon request. This agreement shall remain in effect for ten years after the date of it's execution, and it shall be construed under the laws of the State of Texas. • Conditions I care about: • All information discussed is freely shareable unless other party indicates in advance that it is confidential • TRUE? FALSE?

PASCAL RTE Examples Should be easy… T: iTunes software has seen strong sales in Europe. H: Strong sales for iTunes in Europe. TRUE T: The anti-terrorist court found two men guilty of murdering Shapour Bakhtiar and his secretary Sorush Katibeh, who were found with their throats cut in August 1991. H: Shapour Bakhtiar died in 1991. TRUE T: Like the United States, U.N. officials are also dismayed that Aristide killed a conference called by Prime Minister Robert Malval in Port-au-Prince in hopes of bringing all the feuding parties together. H: Aristide had Prime Minister Robert Malval murdered in Port-au-Prince. FALSE Note: not entailed! They’re allowed to try to trick you

Evaluation • The notion of inference is “as would typically be interpreted by people,” assuming common human understanding of language and common background knowledge. • Not entailment according to some linguistic theory • High agreement on this data: human accuracy is about 95% • Accuracy: you correctly say whether the hypothesis does or does not follow from the text • Confidence weighted score or average accuracy: • Rank all n pairs by system-supplied confidence • Use ranking to define a weighted average • Tests whether you know what you know

3. Logics mapping from NL to Reasoning: GRE/LSAT logic puzzles Six sculptures—C, D, E, F, G, and H—are to be exhibited in rooms 1, 2, and 3 of an art gallery. Sculptures C and Emay not be exhibited inthe same room. Sculptures D and Gmust be exhibited in the same room. If sculptures E and Fare exhibited in the same room, no other sculpturemay be exhibited in that room. At least one sculpturemust be exhibited in each room, and no more than three sculpturesmay be exhibited in any room. 4. If sculpture D is exhibited in room 1 and sculptures E and F are exhibited in room 2, which of the following must be true ? (A) Sculpture C must be exhibited in room 1. (B) Sculpture H must be exhibited in room 3. (C) Sculpture G must be exhibited in room 1. (D) Sculpture H must be exhibited in room 2. (E) Sculptures C and H must be exhibited in the same room.

The GRE logic puzzles domain • An English description of a constraint satisfaction problem, followed by questions about satisfying assignments • Answers cannot be found in the text by surface “question answering” methods (e.g., TREC QA) • Formalization and logical inference are necessary • Obtaining proper formalization requires: • Accurate syntactic parsing • Resolving semantic ambiguities (scope, co-reference) • Discourse analysis • Easy to test (“found test material”) • If the formalization is right, the reasoning is easy • No ambiguity or subjectivity about the correct answer

Challenges • For most puzzles, the puzzle type, the variables, and values for assignments are not obvious Mrs. Green wishes to renovate her cottage. She hires the services of a plumber, a carpenter, a painter, an electrician, and an interior decorator. The renovation is to be completed in a period of one working week i.e. Monday to Friday. Every worker will be taking one complete day to do his job. Mrs. Green will allow just one person to work per day. The painterwill do his work only after the plumber and the carpenter have completed their jobs. The interior decoratorhas to complete his job before that of the electrician. • The type of this puzzle is a constrained linear ordering of things (here, contractors)

Scope Needs to be Resolved! At least one sculpture must be exhibited in each room. The same sculpture in each room? No more than three sculptures may be exhibited in any room. Reading 1: For every room, there are no more than three sculptures exhibited in it. Reading 2: Only three or less sculptures are exhibited (the rest are not shown). Reading 3: Only a certain set of three or less sculptures may be exhibited in any room (for the other sculptures there are restrictions in allowable rooms). • Some readings will be ruled out by being uninformative or by contradicting other statements • Otherwise we must be content with probability distributions over scope-resolved semantic forms

statisticalparser statisticalparser referenceresolution referenceresolution FOL translator FOL translator combinatorialsemantics combinatorialsemantics pluralitydisambiguation pluralitydisambiguation background theory background theory scoperesolution scoperesolution lexicalsemantics lexicalsemantics reasoningmodule reasoningmodule System overview[Lev, MacCartney, Levy, and Manning 2004] English text parse trees SLformulas FOLformulas URs DLformulas correctanswer

Semantic logic (SL) • Our goal is a translation to First Order Logic (FOL) • But FOL is ungainly, and far from NL • NL has events, plurals, modalities, complex quantifiers • Intermediate representation: semantic logic (SL) • Event and group variables • Modal operators:□ (necessarily) and ◊ (possibly) • Generalized quantifiers: Q(type, var, restrictor, body) • Our example becomes: • □ Q(, x1, room(x1), Q(1, x2, sculpture(x2), e exhibit(e)  patient(e, x2)  in(e, x1)) • ◊ Q(, y, room(y), Q(>3, g, sculpture(g), e exhibit(e)  patient(e, g)  in(e, y)) • More compact, more natural

Combinatorial semantics • Aim is to assign a semantic representation (roughly, a lambda expression) to each semantic unit • The hope is to use a small lexicon for semantically potent words and to synthesize semantics for open class words every dog barks (S) x.(dog(x)bark(x)) every dog (NP) λQ.x.(dog(x)Q@x) barks (VP) λx.bark(x) every (Det) λP.λQ.x.(P@xQ@x) dog (Noun) λx.dog(x)

FOL Reasoning module • Complementary reasoning engines • A theorem prover (TP) is used to show that a set of formulas is inconsistent (proof by contradiction) • A model builder (MB) is used to show that a set of formulas is consistent (proof by example) • Idea: harness TP and MB in tandem • “Could” questions: examine each answer choice • MB says choice consistent choice is correct • TP says choice inconsistent choice is incorrect • “Must” questions: examine negation of each choice • MB says negationconsistent choice is incorrect • TP says negationinconsistent choice is correct • Just a theorem prover is not enough • Can’t handle “could be true” questions properly • Despite finite domain, some proofs too deep to find

How far did we get? • Worked to be able to handle the sculptures example (set of 6 questions) completely • Worked to be able to do a second problem • What about new puzzle texts? • Statistical parse is “correct” (fully usable) in about 60% of cases • Main problem is unhandled semantic phenomena, e.g., ‘different’, ‘except’, ‘a complete list’, VP ellipsis, …, … • Only 1 out of 21 questions actually doable start to end!

Pascal RTE comparison • Bos and Markert (2005) used a similar theorem prover/model builder combination as part of a two strategy entry in RTE1. • Indeed, our logic puzzles approach was strongly influenced by Bos’ work • Coverage/correctness of approach • Found proof/contradiction for 30 pairs(3.75%–7.5%) • Of these, 23 were correct (77%) • Example error: • T: Microsoft was established in Italy in 1985.⇏ • H: Microsoft was established in 1985.

How real world textual inference differs from logical semantics • Modals • Text: Researchers at the Harvard School of Public Health say that people who drink coffee may be doing a lot more than keeping themselves awake - this kind of consumption apparently also can help reduce the risk of diseases. • ⊨Hypothesis: Coffee drinking has health benefits. (RTE1 ID: 19) • May is a discourse hedge, not a possible worlds modal • Reported views/speech • Text: According to the Encyclopedia Britannica, Indonesia is the largest archipelagic nation in the world, consisting of 13,670 islands. • ⊨ Hypothesis: 13,670 islands make up Indonesia. (RTE1 ID: 605) • Source for information, not to suggest truth unknown

Speaker meaning • The Pascal RTE task can be taken as an applied test of human notions of speaker meaning • It clearly goes beyond the literal meaning of the text • Recanati (2004: 19) proposes regarding what is said as “what a normal interpreter would understand as being said, in the context at hand.” • Pascal RTE could be viewed as operationalizing such a criterion.

4. Tackling robust textual inference: Weighted Abductive inference • Idea: [Raina, Ng, and Manning 2005] • Represent text and hypothesis as logical formulae. • A hypothesis can be inferred from the text if and only if the hypothesis logic formula can be proved from the text logical formula (at some cost). • Toy example: Prove? Weighted abduction: Allow assumptions at various “costs” “released(p, q, r) + $2 => freed(s, r)” (Hobbs et al., 1993)

Dependency graph Bill mother walked store grocery subj to poss nn ARGM-LOC Representation Example: Bill’s mother walked to the grocery store Logical formula mother(A) Bill(B) poss(B, A) grocery(C) store(C) walked(E, A, C) VBD PERSON ARGM-LOC VBD • Can and do make this representation richer • “walked” is a verb • “Bill” is a PERSON (named entity). • Add sem roles “store” is the location/destination of “walked”. • … PERSON

Linguistic preprocessing • High performance Named Entity Recognizers [Finkeletal. 2005] • Canonicalization quantity, date, and money expressions • Normalized dates and relational expressions of amount “> 200”: T: Kessler's team conducted 60,643 face-to-face interviews with adults in 14 countries. H: Kessler's team interviewed more than 60,000 adults in 14 countries. • Statistical parser. • Update data a little for 2005: Al Qaeda = [Aa]l[ -]Qa’?[ie]da • Collocations if appearing in WordNet (Bill hung_up the phone) • Semantic Role identification: Propbank roles[Toutanovaetal.2005] • Coreference T: Since its formation in 1948, Israel … H: Israel was established in 1948. • Heuristics to find event nouns (the murder of police commander) • Hand-built: acronyms, country and nationality, factive verbs

P(p1, p2, …, pm) Each of these gives a feature function! Q(q1, q2, …, qn) How can we model abductive assumption costs in proof? • Consider assumptions that unify pairs of terms. • Need to assign cost C(A) to all assumptions A of the form: • Possible considerations:

Abductive assumptions • Compute features f(A)=(f1(A), f2(A), , fD(A)) of A. • Given feature weights w=(w1, w2, , wD), define: • Each such assumption provides a potential proof step. • Can find a minimum cost complete proof by uniform cost search. • Output TRUE iff this proof has cost < a threshold wD+1. • Weak proof theory!!

Abductive cost Depends on w Can we learn the assumption costs? • Intuition: Given a data set, find assumptions that are used in the proofs for TRUE examples, and lower their costs. • The minimum cost proof Pmin consists of a sequence of assumptions A1, A2, , AN. • Construct a feature vector for the proof Pmin: • If Pmin is given, the final cost for an example is linear in w. • However, the overall feature vector is computed by abductive theorem proving, which uses w internally! • Solve by an iterative procedure guaranteed to converge to a local maximum of the (nonconvex) likelihood function.

Example features • Zero cost to match same item with same arguments • Low cost to unify things listed in WordNet as synonyms • Higher cost to match something with vague LSA similarity • Higher cost if arguments of verb mismatch • Antonyms/Negation: High cost for unifying verbs, if they are antonyms or one is negated and the other not. • T: Stocks fell. H: Stocks rose. FALSE • T: Clinton’s book was not a hit H: Clinton’s book was a hit. FALSE • Non-factive verbs: High cost for unifying verbs, if only one is modified by a non-factive verb. • T: John was charged for doing X. H: John did X. FALSE

Results • Evaluate on PASCAL RTE1 dataset. • Development set of 567 examples. • Test set of 800 examples. • Divided into 7 “tasks” (motivating applications) • Balanced number of true/false examples. • Output: TRUE/FALSE, confidence value. • Empirically found to be tough. • Baselines: TF, TFIDF • Standard information retrieval algorithms. • Ignore natural language syntax.

RTE1 Results [Raina et al. 2005] • Difficult! Best other results: Accuracy=58.6%, CWS=0.617

Partial coverage accuracy results Both know something, but task-specific optimization is better! ByTask ByTask General Random

5. New System Architecture[MacCartney, Grenager, de Marneffe, Cer, & Manning, HLT-NAACL 2006] An inference Linguistic Preprocessing Aligner Inferer Answer: ℝ➔ {yes, no}

Why the old approach was broken! • P: DeLay bought Enron stock and Clinton sold Enron stock • H: DeLay sold Enron stock Probably, yes No Yes

Why we need sloppy matching • Passage: Today's best estimate of giant panda numbers in the wild is about 1,100 individuals living in up to 32 separate populations mostly in China's Sichuan Province, but also in Shaanxi and Gansu provinces. • Hypothesis 1: There are 32 pandas in the wild in China. (FALSE) • Hypothesis 2: There are about 1,100 pandas in the wild in China. (TRUE) • We’d like to get this right, but we just don’t have the technology to fully understand from best estimate of giant panda numbers in the wild is about 1,100 to there are about 1,100 pandas in the wild

A solution: Align, then evaluate • P: DeLay bought Enron stock and Clinton sold Enron stock • H: DeLay sold Enron stock

Things we aim to fix [MacCartney, Grenager, de Marneffe, Cer, & Manning, HLT-NAACL 2006] • Confounding of alignment and entailment • Assumption on monotonicity • Matching/embedding methods assume upward monotonicity • Sue saw Les Miserables in London ➜ • Sue saw Les Miserables • But • Fedex began business in Zimbabwe in 2003 ⊭ • Fedex began business in 2003 • Assumption/requirement of locality

Whether an alignment is good depends on non-local factors #1 P: Some students came to school by car. Q: Did any students come to school? A: Yes #2 P: No students came to school by car. Q: Did any students come to school? A: Don’t know Context of monotonicity: Whether it is okay to have “by car” as “extra” material in the hypothesis depends on subject quantifier #3 P: It is not the case that Bin Laden was seen in Tora Bora. Q: Was Bin Laden seen in Tora Bora? A: no It’s difficult to see non-factive context when aligning seen → seen

Representation/alignment example T: Mitsubishi Motors Corp.’s new vehicle sales in the US fell 46 percent in June. H: Mitsubishi sales rose 46 percent. Answer: not entailed Alignment from hypothesis to text: Features: Aligned antonyms in pos/pos context – Structure: main predicate good match + Numeric quantity match + Date: text date deleted in hypothesis – Alignment: good + Infererence score: -5.42  FALSE

Modal Inferer • Identify aligned “roots” • Determine modality of each root • Using linguistic features e.g. “can”, “perhaps”, “might” ➜ POSSIBLE • Six canonical modalities POSSIBLE, NOT_POSSIBLE, ACTUAL, NOT_ACTUAL, NECESSARY, NOT_NECESSARY • Look up judgment for modality pair (POSSIBLE, ACTUAL) ➜ dontknow (NECESSARY, NOT_ACTUAL) ➜ no (ACTUAL, POSSIBLE) ➜ yes P: “The Scud C has a range of 500 kilometers and is manufactured in Syria with know-how from North Korea.” H: “A Scud C can fly 500 kilometers.” (ACTUAL, POSSIBLE) ➜ yes

Factives & other implicatives T: Libya has tried, with limited success, to develop its own indigenous missile, and to extend the range of its aging SCUD force for many years under the Al Fatah and other missile programs. H: Libya has developed its own domestic missile program. Answer: not entailed. [Tried to X does not entail X.] • Evaluate governing verbs for implicativity • Unknown: say, tell, suspect, try, … • Fact: know, wonderful, … • True: manage to, … • False: doubtful, misbelieve, … • Need to check for negative context

Numeric Mismatches • Check alignment of number, date, money nodes T: The Pew Internet Life survey interviewed people in 26 countries. H: The Pew Internet Life study interviewed people in more than 20 countries T: BioPort Corp. of Lansing, Michigan is the sole U.S. manufacturer of an anthrax vaccine. H: There are three U.S. manufacturers of anthrax vaccine. • three is aligned to NO_WORD here [sole ➜ 1 ?]

Restrictive adjuncts • We can check whether adding/dropping restrictive adjuncts is licensed relative to upward and downward entailing contexts • In all, Zerich bought $422 million worth of oil from Iraq, according to the Volcker committee • ⊭ Zerich bought oil from Iraq during the embargo • Zerich didn’t buy any oil from Iraq, according to the Volcker committee • ⊨ Zerich didn’t buy oil from Iraq during the embargo

What do we have? • Not “full, deep” semantics • But it still isn’t possible to do logical inference for open domain robust textual inference (with “real” data) • We do inference-pattern matching • On semantic dependency graphs … not surface patterns • Calculate rich semantic features • Adjunct_deletion_licensed_relative_to_universal • Related to the notion of “natural logic”

Natural logic • A logic whose vehicle of inference is natural language (syntactic structures) • No translation into conventional logical notation • Aristotle’s syllogisms ➜ Leibniz (who coined term) ➜ Lakoff ➜ van Benthem ➜ Sánchez Valencia • Natural logic lets us sidestep having to fully translate sentences into an accurate semantic representation • Exercise: “accurately” translate into FOL: According to Ruiz, police may have been reluctant to enter the building before they were convinced that most of the weapons had been found. 

Our RTE2 Results

Most useful features Positive • Structural match • Good alignment score • Modal: yes • Polarity: text and hypothesis both negative polarity Negative • Date inserted/mismatched • Structure: clear mismatch • Quantifier mismatch • Bad alignment score • Different polarity • Modal: no/don’t know

Things that it’s hard to do • Non-entailment is easier than entailment • Good at finding knock-out features • Hard to be certain that we’ve considered everything • Deal with dropping/adding modifiers vs. upward/downward entailing contexts is hard • Need to know which are restrictive/not/discourse items • Maurice was subsequently killed in Angola. • Multiword “lexical” semantics/world knowledge • We’re pretty good at synonyms, hyponyms, antonyms • But we can’t resolve a lot of multi-word equivalences • T: David McCool took the money and decided to startMuzzy Lane in 2002 • H: David McCool is the founder of Muzzy Lane

Robust Local Textual Inference

Robust Local Textual Inference

Presentation Transcript

Knowledge Representation and Inference Models for Textual Entailment

Two Related Approaches to the Problem of Textual Inference

Natural Logic for Textual Inference

Network centrality, inference and local computation

Scaling Textual Inference to the Web

Textual Analysis and Textual Theory

Textual Analysis and Textual Theory

Textual Analysis and Textual Theory

Probabilistic Lexical Models for Textual Inference

Textual Analysis and Textual Theory

INFERENCE + TEXTUAL EVIDENCE = WELL SUPPORTED ANSWER .

Robust inference of biological Bayesian networks

Robust Combination of Local Controllers

Textual entailment inference in machine translation

Robust Textual Inference via Graph Matching

WLD: A Robust Local Image Descriptor

Textual Analysis and Textual Theory

WLD: A Robust Local Image Descriptor

Natural Logic for Textual Inference

Textual entailment inference in machine translation

Robust inference of biological Bayesian networks

Incorporating Discourse Information within Textual Entailment Inference