Constrained Conditional Models Learning and Inference for Natural Language Understanding

Constrained Conditional Models Learning and Inference for Natural Language Understanding Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign • With thanks to: • Collaborators:Ming-Wei Chang, Dan Goldwasser, Vasin Punyakanok, Lev Ratinov, Nick Rizzolo, Mark Sammons, Ivan Titov, Scott Yih, Dav Zimak • Funding: ARDA, under the AQUAINT program • NSF: ITR IIS-0085836, ITR IIS-0428472, ITR IIS- 0085980, SoD-HCER-0613885 • A DOI grant under the Reflex program; DHS; DARPA-Bootstrap Learning Program • DASH Optimization (Xpress-MP) January 2010 Saarland University, Germany.

Nice to Meet You

Learning and Inference • Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome. • E.g. Structured Output Problems – multiple dependent output variables • (Learned) models/classifiers for different sub-problems • In some cases, not all local models can be learned simultaneously • Key examples in NLP are Textual Entailment and QA • In these cases, constraints may appear only at evaluation time • Incorporate models’ information, along with prior knowledge/constraints, in making coherent decisions • decisions that respect the local models as well as domain & context specific knowledge/constraints.

A process that maintains and updates a collection of propositions about the state of affairs. Comprehension (ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous. 1. Christopher Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at least 65 now. This is an Inference Problem

Constrained Conditional Models (CCMs) Issues to attend to: • While we formulate the problem as an ILP problem, Inference can be done multiple ways • Search; sampling; dynamic programming; SAT; ILP • The focus is on joint global inference • Learning may or may not be joint. • Decomposing models is often beneficial • Informally: Global decisions with learned models, in the presence of constraints • Why Constraints? • A effective way to inject expressive prior knowledge into models. • We propose mechanisms to injecting knowledge and use it to • improve decision making • guide learning (e.g., semi-supervised learning) • simplify the models we need to learn • Study learning of models that can effectively support this. • Has been shown useful in the context of many NLP problems • SRL, Summarization; Co-reference; Information Extraction; Transliteration [Roth&Yih04,07; Punyakanok et.al 05,08; Chang et.al 07,08; Clarke&Lapata06,07; Denise&Baldrige07;Goldwasser&Roth’08; Martin,Smith&Xing’09] [See tutorial on my web page and ILPNLP workshop]

Outline • Constrained Conditional Models • Motivation • Examples • Training Paradigms: Investigate ways for training models and combining constraints • Joint Learning and Inference vs. decoupling Learning & Inference • Training with Hard and Soft Constrains • Guiding Semi-Supervised Learning with Constraints • Training with latent structure • Examples • Semantic Parsing • Information Extraction • Pipeline processes • Transliteration

Pipeline • Conceptually, Pipelining is a crude approximation • Interactions occur across levels and down stream decisions often interact with previous decisions. • Leads to propagation of errors • Occasionally, later stage problems are easier but cannot correct earlier errors. • But, there are good reasons to use pipelines • Putting everything in one basket may not be right • How about choosing some stages and think about them jointly? Raw Data • Most problems are not single classification problems POS Tagging Phrases Semantic Entities Relations Parsing WSD Semantic Role Labeling

Dole ’s wife, Elizabeth , is a native of N.C. E1E2E3 R23 R12 Improvement over no inference: 2-5% Inference with General Constraint Structure [Roth&Yih’04]Recognizing Entities and Relations x* = argmaxx c(x=v) [x=v] = = argmaxx c{E1 = per}· x{E1 = per} + c{E1 = loc}· x{E1 = loc}+…+ c{R12 = spouse-of}· x{R12 = spouse-of} +…+ c{R12 = }· x{R12 = } Subject to Constraints Non-Sequential • Key Components: • Write down an objective function (Linear). • Write down constraints as linear inequalities Some Questions: How to guide the global inference? Why not learn Jointly? Models could be learned separately; constraints may come up only at decision time.

y1 y2 y3 C(y2,y3,y6,y7,y8) C(y1,y4) y4 y5 y6 y8 (+ WC) Problem Setting • Random Variables Y: • Conditional DistributionsP (learned by models/classifiers) • Constraints C– any Boolean function defined over partial assignments (possibly: + weights W ) • Goal: Find the “best” assignment • The assignment that achieves the highest global performance. • This is an Integer Programming Problem y7 observations Y*=argmaxYPY subject to constraints C

Penalty for violating the constraint. Weight Vector for “local” models How far y is from a “legal” assignment A collection of Classifiers; Log-linear models (HMM, CRF) or a combination Formal Model Subject to constraints (Soft) constraints component How to solve? This is an Integer Linear Program Solving using ILP packages gives an exact solution. Search techniques are also possible How to train? How to decompose the global objective function? Should we incorporate constraints in the learning process?

Example: Semantic Role Labeling Who did what to whom, when, where, why,… I left my pearls to my daughter in my will . [I]A0left[my pearls]A1[to my daughter]A2[in my will]AM-LOC . • A0 Leaver • A1 Things left • A2 Benefactor • AM-LOC Location I left my pearls to my daughter in my will . • Special Case (structured output problem): here, all the data is available at one time; in general, classifiers might be learned from different sources, at different times, at different contexts. • Implications on training paradigms Overlapping arguments If A2 is present, A1 must also be present.

Semantic Role Labeling (2/2) • PropBank [Palmer et. al. 05] provides a large human-annotated corpus of semantic verb-argument relations. • It adds a layer of generic semantic labels to Penn Tree Bank II. • (Almost) all the labels are on the constituents of the parse trees. • Core arguments: A0-A5 and AA • different semantics for each verb • specified in the PropBank Frame files • 13 types of adjuncts labeled as AM-arg • where arg specifies the adjunct type

I left my nice pearls to her I left my nice pearls to her I left my nice pearls to her I left my nice pearls to her [ [ [ [ [ [ [ [ [ [ ] ] ] ] ] ] ] ] ] ] Identify Vocabulary Algorithmic Approach candidate arguments • Identify argument candidates • Pruning [Xue&Palmer, EMNLP’04] • Argument Identifier • Binary classification (SNoW) • Classify argument candidates • Argument Classifier • Multi-class classification (SNoW) • Inference • Use the estimated probability distribution given by the argument classifier • Use structural and linguistic constraints • Infer the optimal global output EASY Inference over (old and new) Vocabulary Ileftmy nice pearlsto her

Semantic Role Labeling (SRL) I left my pearls to my daughter in my will . Page 14

Semantic Role Labeling (SRL) I left my pearls to my daughter in my will . Page 15

Semantic Role Labeling (SRL) I left my pearls to my daughter in my will . One inference problem for each verb predicate. Page 16

Integer Linear Programming Inference • For each argument ai • Set up a Boolean variable: ai,tindicating whether ai is classified as t • Goal is to maximize • i score(ai = t ) ai,t • Subject to the (linear) constraints • If score(ai = t ) = P(ai = t ), the objective is to find the assignment that maximizes the expected number of arguments that are correct and satisfies the constraints. The Constrained Conditional Model is completely decomposed during training

Constraints Any Boolean rule can be encoded as a linear constraint. • No duplicate argument classes aPOTARG x{a = A0} 1 • R-ARG  a2POTARG , aPOTARG x{a = A0}x{a2 = R-A0} • C-ARG • a2POTARG , (aPOTARG)  (a is before a2 )x{a = A0}x{a2 = C-A0} • Many other possible constraints: • Unique labels • No overlapping or embedding • Relations between number of arguments; order constraints • If verb is of type A, no argument of type B If there is an R-ARG phrase, there is an ARG Phrase If there is an C-ARG phrase, there is an ARG before it Universally quantified rules LBJ: allows a developer to encode constraints in FOL; these are compiled into linear inequalities automatically. Joint inference can be used also to combine different (SRL) Systems.

Learning Based Java (LBJ): http://L2R.cs.uiuc.edu/~cogcomp/software.php A modeling language for Constrained Conditional Models • Supports programming along with building learned models, high level specification of constraints and inference with constraints • Learning operator: • Functions defined in terms of data • Learning happens at “compile time” • Integrated constraint language: • Declarative, FOL-like syntax defines constraints in terms of your Java objects • Compositionality: • Use any function as feature extractor • Easily combine existing model specifications /learned models with each other

Example: Semantic Role Labeling LBJ site provides example code for NER, POS tagger etc. Declarative, FOL-style constraints written in terms of functions applied to Java objects [Rizzolo, Roth’07] Inference produces new functions that respect the constraints

Semantic Role Labeling Screen shot from a CCG demo http://L2R.cs.uiuc.edu/~cogcomp Semantic parsing reveals several relations in the sentence along with their arguments. This approach produces a very good semantic parser. F1~90% Easy and fast: ~7 Sent/Sec (using Xpress-MP) Top ranked system in CoNLL’05 shared task Key difference is the Inference

Features Versus Constraints Mathematically, soft constraints are features If Á(x,y) = Á(x) – constraints provide an easy way to introduce dependence on y • Ái : X £ Y ! R; Ci : X £ Y ! {0,1}; d: X £ Y ! R; • In principle, constraints and features can encode the same properties • In practice, they are very different • Features • Local , short distance properties – to support tractable inference • Propositional (grounded): • E.g. True if “the followed by a Noun occurs in the sentence” • Constraints • Global properties • Quantified, first order logic expressions • E.g.True iff “all yis in the sequence y are assigned different values.”

Constraints As a Way To Encode Prior Knowledge Need more training data A effective way to inject knowledge We can use constraints as a way to replace training data Allows one to learn simpler models • Consider encoding the knowledge that: • Entities of type A and B cannot occur simultaneously in a sentence • The “Feature” Way • Requires larger models • The Constraints Way • Keeps the model simple; add expressive constraints directly • A small set of constraints • Allows for decision time incorporation of constraints

Prediction result of a trained HMM Lars Ole Andersen . Program analysis and specialization for the C Programming language . PhD thesis . DIKU , University of Copenhagen , May 1994 . [AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION] [DATE] Information extraction without Prior Knowledge Lars Ole Andersen . Program analysis and specialization for the C Programming language. PhD thesis. DIKU , University of Copenhagen, May 1994 . Violates lots of natural constraints! Page 24

Examples of Constraints Easy to express pieces of “knowledge” Non Propositional; May use Quantifiers Each field must be aconsecutive list of words and can appear at mostoncein a citation. State transitions must occur onpunctuation marks. The citation can only start withAUTHORorEDITOR. The wordspp., pagescorrespond toPAGE. Four digits starting with20xx and 19xx areDATE. Quotationscan appear only inTITLE …….

Information Extraction with Constraints • Adding constraints, we getcorrectresults! • Without changing the model • [AUTHOR]Lars Ole Andersen . [TITLE]Program analysis andspecialization for the C Programming language . [TECH-REPORT] PhD thesis . [INSTITUTION] DIKU , University of Copenhagen , [DATE] May, 1994 . Page 26

Value of Constraints in Semi-Supervised Learning Objective function: Learning w/o Constraints: 300 examples. Constraints are used to Bootstrap a semi-supervised learner Poor model + constraints used to annotate unlabeled data, which in turn is used to keep training the model. Learning w 10 Constraints Factored model. # of available labeled examples

Semantic Role Labeling Punyakanok et. al’05,08 Phrasal verb paraphrasing [Connor&Roth’07] Textual Entailment Inference for Entailment Braz et. al’05, Sammons et. al 07,09 Entity matching [Li et. al, AAAI’04, NAACL’04] Is it true that…? (Textual Entailment) Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc. last year  Yahoo acquired Overture Overture is a search company Google is a search company Google owns Overture ……….

Training Paradigms that Support Global Inference • Coupling vs. Decoupling Training and Inference. • Incorporating global constraints is important but • Should it be done only at evaluation time or also at training time? • How to decompose the objective function and train in parts? • Issues related to: • Modularity, efficiency and performance, availability of training data • Problem specific considerations

Training in the presence of Constraints • General Training Paradigm: • First Term: Learning from data (could be further decomposed) • Second Term: Guiding the model by constraints • Can choose if constraints’ weights trained, when and how, or taken into account only in evaluation. Decompose Model (SRL case) Decompose Model from constraints

Comparing Training Methods • Option 1: Learning + Inference (with Constraints) • Ignore constraints during training • Option 2: Inference (with Constraints) Based Training • Consider constraints during training • In both cases: Global Decision Making with Constraints • Question: Isn’t Option 2 always better? • Not so simple… • Next, the “Local model story”

Training Methods y2 y3 y1 f1(x) y4 f2(x) y5 x3 f3(x) x4 f4(x) x1 x5 f5(x) x2 x7 x6 Each model can be more complex and may have a view on a set of output variables. Learning + Inference (L+I) Learn models independently Inference Based Training (IBT) Learn all models together! Y Intuition Learning with constraints may make learning more difficult X

True Global Labeling Y -1 1 -1 -1 1 Apply Constraints: Local Predictions Y’ -1 1 -1 1 1 x3 x4 x1 x5 x2 x7 x6 Y’ -1 1 1 1 1 Training with Constraints Example: Perceptron-based Global Learning f1(x) X f2(x) f3(x) Y f4(x) f5(x) Which one is better? When and Why?

Claims [Punyakanok et. al , IJCAI 2005; Rajhans, Roth, Titov,’10] • When the local modes are “easy”to learn, L+I outperforms IBT. • In many applications, the components are identifiable and easy to learn (e.g., argument, open-close, PER). • Only when the local problems become difficult to solve in isolation, IBT outperforms L+I, but needs a larger number of training examples. • Other training paradigms are possible • Pipeline-like Sequential Models: [Roth, Small, Titov: AI&Stat’09] • Identify a preferred ordering among components • Learn k-th model jointly with previously learned models L+I: cheaper computationally; modular IBT is better in the limit, and other extreme cases.

Bounds Simulated Data opt=0.1 opt=0 opt=0.2 L+I vs. IBT: the more identifiable individual problems are, the better overall performance is with L+I Bound Prediction • Local  ≤ opt + ( ( d log m + log 1/ ) / m)1/2 • Global  ≤ 0 + ( ( cd log m + c2d + log 1/ ) / m)1/2 Indication for hardness of problem

L+I is better. When the problem is artificially made harder, the tradeoff is clearer. Relative Merits: SRL Difficulty of the learning problem(# features) hard easy

Comparing Training Methods (Cont.) Decompose Model (SRL case) Decompose Model from constraints • Local Models (train independently)vs.Structured Models • In many cases, structured models might be better due to expressivity • But, what if we use constraints? • Local Models+ Constraintsvs.Structured Models +Constraints • Hard to tell: Constraints are expressive • For tractability reasons, structured models have less expressivity than the use of constraints (and are harder to learn than local models)

y1 y2 y3 y4 y5 y x1 x2 x3 x4 x5 x s t A A A A A B B B B B C C C C C Example: CRFs are CCMs But, you can do better • Consider a common model for sequential inference: HMM/CRF • Inference in this model is done via the Viterbi Algorithm. • Viterbi is a special case of the Linear Programming based Inference. • Viterbi is a shortest path problem, which is a LP, with a canonical matrix that is totally unimodular. Therefore, you can get integrality constraints for free. • One can now incorporate non-sequential/expressive/declarative constraints by modifying this canonical matrix • No value can appear twice; a specific value must appear at least once; AB • And, run the inference as an ILP inference. Learn a rather simple model; make decisions with a more expressive model

Experiment: [CRF Vs. perceptrons] + Constraints Sequential Models Local L+I IBT L+I Local Models are now better than Sequential Models! (With constraints) Sequential Models are better than Local Models ! (No constraints) • Experiments on SRL: [Roth and Yih, ICML 2005] • Story: Inject constraints into conditional random field models

Summary: Training Methods Learn a rather simple model; make decisions with a more expressive model • Many choices for training a CCM • Learning + Inference (Training without constraints) • Inference based Learning (Training with constraints) • Model Decomposition • Advantages of L+I • Require fewer training examples • More efficient; most of the time, better performance • Modularity; easier to incorporate already learned models. • Advantages of IBT • Better in the limit • Better when there are strong interactions among y’s

Training CCMs with Soft Constraints Constraint violation penalty How far y is from a “legal” assignment (Soft) constraints component • Soft: Constraints If all solutions violate constraints, we still want to rank solutions based on level of constraints’ violation. • Training: Need to figure out the penalty as well… • Option 1: Learning + Inference (with Constraints) • Learn the weights and penalties separately • Penalty(c) = -log{P(C is violated)} • Option 2: Inference (with Constraints) Based Training • Learn the weights and penalties together The tradeoff between L+I and IBT is similar to earlier.

Textual Entailment as a CCM x3 x4 x1 x2 x3 x4 x1 x5 x2 x7 x6 Former military specialist Carpenter took the helm at FictitiousCom Inc. after five years as press official at the United States embassy in the United Kingdom. Jim Carpenter worked for the US Government. Entailment Requires Alignment But only positive entailments are expected to align Given an alignment – learn a decision Entail/Does not Entail

Constraints in a Hidden Layer Single Output Problem x3 x4 x1 x5 x2 x7 x6 Hard to find constraints! Good decisions depends on good intermediate representation y1 Y Intuition: introduce structured hidden variables X

Adding Constraints Through Hidden Variables Single Output Problem with hidden variables f2 f1 f3 f4 x3 x4 x1 x5 x2 x7 x6 y1 Y f5 Use constraints to capture the dependencies. Better hidden layer, better output X

Learning Intermediate Representations A general learning framework that allows learning to select the “best” intermediate representation Key idea: Jointly learn to select the intermediate representation and classify instances A framework that allows injecting knowledge & optimizing intermediate representations easily, using ILP inference Excellent results on Transliteration, Paraphrasing, Textual Entailment

Learning Good Feature Representation for Discriminative Transliteration[NAACL’09; in Submission] features Subject to: • One-to-One mapping; • Non-crossing • Length difference restriction • Language specific constraints I t a l y י ט ל י א ה (איטליה,Italy)  Yes/No • Learning feature representation is a structured learning problem • Features are graph edges – the problem is choosing the optimal subset of edges • Many constraints on the legitimacy of the active feature representation  Formalize the problem as a constrained optimization problem • The alignment itself isn’t important. • The hidden structure is used as a feature representation for learning the binary classification task  find the feature representation that optimizes classification over the training data

Iterative Objective Function Learning • Transliteration: Inference can be done via dynamic programming (not for TE) • Formalized as structured SVM + (constrained) hidden structure Inference Prediction Romanization Table Training Generate features Initial objective function Predict labels for all word pairs (possibly supervised) Update weight vector

y1 y1 y2 y2 y3 y3 y4 y4 y5 y5 y6 y6 y7 y7 y8 y8 Summary: Constrained Conditional Models Conditional Markov Random Field Constraints Network y* = argmaxy wiÁ(x; y) • Linear objective functions • Typically Á(x,y) will be local functions, or Á(x,y) = Á(x) - i½i dC(x,y) • Expressive constraints over output variables • Soft, weighted constraints • Specified declaratively as FOL formulae • Clearly, there is a joint probability distribution that represents this mixed model. • We would like to: • Learn a simple model or several simple models • Make decisions with respect to a complex model Key difference from MLNs, which provide a concise definition of a model, but the whole joint one.

Constrained Conditional Models Learning and Inference for Natural Language Understanding

Constrained Conditional Models Learning and Inference for Natural Language Understanding

Presentation Transcript

Natural Language Inference

Learning and Inference for Natural Language Understanding

Natural Logic and Natural Language Inference

Decomposing Structured Prediction via Constrained Conditional Models

Natural Language Learning: Linear models

Constrained Conditional Models Tutorial

Constrained Conditional Models for Natural Language Processing

Bayesian Learning for Conditional Models

Declarative Learning Models for Natural Language Processing

Learning and Inference for Natural Language Understanding

Constraints Driven Learning for Natural Language Understanding

Constrained Conditional Models Learning and Inference in Natural Language Understanding

Representation and Inference for Natural Language

Global Inference in Learning for Natural Language Processing

Natural Language Processing for Automated Inference

Constraints Driven Learning for Natural Language Understanding

Constrained Conditional Models for Natural Language Processing

Natural Language Inference

Learning and Global Inference for Information Access and Natural Language Understanding

Global Inference and Learning Towards Natural Language Understanding

Natural Logic and Natural Language Inference

Constrained Conditional Models for Global Learning and Inference