1 / 53

Global Inference in Learning for Natural Language Processing

Global Inference in Learning for Natural Language Processing. Comprehension.

callie
Download Presentation

Global Inference in Learning for Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Global Inference in Learning for Natural Language Processing

  2. Comprehension (ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous. 1. Who is Christopher Robin? 2. When was Winnie the Pooh written? 3. What did Mr. Robin do when Chris was three years old? 4. Where did young Chris live? 5. Why did Chris write two books of his own?

  3. Phrasal verb paraphrasing [Connor&Roth’07] Textual Entailment Entity matching [Li et. al, AAAI’04, NAACL’04] • Given: Q: Who acquired Overture? • Determine: A: Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc last year. Semantic Role Labeling Inference for Entailment AAAI’05;TE’07 Is it true that…? (Textual Entailment) Eyeing the huge market potential, currently led by Google, Yahoo took over search company Overture Services Inc. last year  Yahoo acquired Overture Overture is a search company Google is a search company Google owns Overture ……….

  4. What we Know: Stand Alone Ambiguity Resolution Illinois’ bored of education board ...Nissan Car and truck plantis … …divide life into plant and animal kingdom (This Art) (can N) (will MD) (rust V) V,N,N The dog bit the kid. Hewas taken to a veterinarian a hospital Learn a function f: X Ythat maps observations in a domain to one of several categories or <

  5. Classification is Well Understood • Theoretically: generalization bounds • How many example does one need to see in order to guarantee good behavior on previously unobserved examples. • Algorithmically: good learning algorithms for linear representations. • Can deal with very high dimensionality (106 features) • Very efficient in terms of computation and # of examples. On-line. • Key issues remaining: • Learning protocols: how to minimize interaction (supervision); how to map domain/task information to supervision; semi-supervised learning; active learning; ranking; adaptation. • What are the features? No good theoretical understanding here. • How to decompose problems and learn tractable models. • Modeling/Programming systems that have multiple classifiers.

  6. A process that maintains and updates a collection of propositions about the state of affairs. Comprehension (ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous. 1. Christopher Robin was born in England. 2. Winnie the Pooh is a title of a book. 3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at least 65 now. This is an Inference Problem

  7. Learning and Inference • Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome. • Learned classifiers for different sub-problems • Incorporate classifiers’ information, along with constraints, in making coherent decisions – decisions that respect the local classifiers as well as domain & context specific constraints. • Global inference for the best assignment to all variables of interest.

  8. Special case I: Structured Output • Classifiers • Recognizing “The beginning of NP” • Recognizing “The end of NP” (or: word based classifiers: BIO representation) Also for other kinds of phrases… • Some Constraints • Phrases do not overlap • Order of phrases ( Prob(NPVP) ) • Length of phrases • Non-sequential and declarative: If PP then NP in the sentence • Inference: Use classifiers to infer a coherent set of phrases [NP He ] [VP reckons ] [NP the current account deficit ] [VP will narrow ] [PP to ] [NP only # 1.8 billion ] [PP in ] [NP September ] He reckons the current account deficit will narrow to only # 1.8 billion in September

  9. s1 s2 s3 s4 s5 s6 s2 s1 s3 s4 s5 s6 o2 o1 o3 o4 o5 o6 o1 o2 o3 o4 o5 o6 Allows for Dynamic Programming based Inference Sequential Constrains Structure • Three models for sequential inference with classifiers [Punyakanok & Roth NIPS’01] • HMM; HMM with Classifiers • Sufficient for easy problems • Conditional Models (PMM) • Allows direct modeling of states as a function of input • Classifiers may vary; SNoW (Winnow;Perceptron); MEMM: MaxEnt; SVM based • Constraint Satisfaction Models • The inference problem is modeled as weighted 2-SAT • With sequential constraints: shown to have efficient solution • Recent work – viewed as multi-class classification; emphasis on global training [Collins’02, CRFs, M3Ns] What if the structure of the problem/constraints is not sequential?

  10. Why Special Cases? (ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous. 1. Who is Christopher Robin? 2. When was Winnie the Pooh written? 3. What did Mr. Robin do when Chris was three years old? 4. Where did young Chris live? 5. Why did Chris write two books of his own? Classifiers might be learned from different sources, at different times, at different contexts. In general, cannot assume that all the data is available at one time. Global Training is not an option

  11. Pipeline Raw Data • Most problems are not single classification problems • Pipelining is a crude approximation; interactions occur across levels and down stream decisions often interact with previous decisions. • Leads to propagation of errors • Occasionally, later stage problems are easier but upstream mistakes will not be corrected. POS Tagging Phrases Semantic Entities Relations Parsing WSD Semantic Role Labeling • Looking for: • Global inference over the outcomes of different local predictors as a way to break away from this paradigm [between pipeline & fully global] • A flexible way to incorporate linguistic and structural constraints.

  12. person person Special Case II: General Constraints Structure J.V. Oswald was murdered at JFK after his assassin, K. F. Johns… Identify: Kill (X, Y) J.V. Oswald was murdered at JFK after his assassin, K. F. Johns… location Some knowledge (classifiers) may be known in advance Some constraints may be available only at decision time • Identify named entities • Identify relations between entities • Exploit mutual dependencies between named entities and relation to yield a coherent global detection.[Roth & Yih, COLING’02;CoNLL’04]

  13. Dole ’s wife, Elizabeth , is a native of N.C. E1E2E3 R23 R12 Improvement over no inference: 2-5% Inference with General Constraint Structure[Roth&Yih’04,07]

  14. Global Inference over Local Models/Classifiers + Expressive Constraints • Constrained Conditional Models • Generality of the framework • Training Paradigms • Global training, Decomposition and Local training • Examples • Semantic Parsing • Information Extraction • Pipeline processes

  15. Issues • Incorporating general constraints (Algorithmic Approach) • Allow both statistical and expressive declarative constraints • Allow non-sequential constraints (generally difficult) • The value of using more constraints • Coupling vs. Decoupling Training and Inference. • Incorporating global constraints is important but • Should it be done only at evaluation time or also in training time? • Issues related to modularity, efficiency and performance

  16. y1 y2 y3 C(y2,y3,y6,y7,y8) C(y1,y4) y4 y5 y6 y8 (+ WC) Problem Setting • Random Variables Y: • Conditional DistributionsP (learned by models/classifiers) • Constraints C– any Boolean function defined on partial assignments (possibly: + weights W ) • Goal: Find the “best” assignment • The assignment that achieves the highest global performance. • This is an Integer Programming Problem y7 observations Y*=argmaxYPY subject to constraints C Other, more general ways to incorporate soft constraints here [ACL’07]

  17. y1 y1 y2 y2 y3 y3 y4 y4 y5 y5 y6 y6 y7 y7 y8 y8 Constrained Conditional Models Conditional Markov Random Field Constraints Network y* = argmaxy wiÁ(x; y) • Typically, Linear or log-linear • Typically Á(x,y) will be local functions, or Á(x,y) = Á(x) i½i Ci(x,y) • Optimize for general constraints • Constraints may have weights. • May be soft • Specified declaratively as FOL formulae • Clearly, there is a joint probability distribution that represents this mixed model. • We would like to: • Make decisions with respect to the mixed model, but • Not necessarily learn this complex model.

  18. A General Inference Setting • Linear objective function: • Essentially all complex models studied today can be viewed as optimizing a : HMMs/CRFs[Roth’99; Collins’02;Lafferty et. al 02] • Linear objective functions can be derived from probabilistic perspective: • The probabilistic perspective supports finding the most likely assignment • Not necessarily what we want • Integer linear programming (ILP) formulation • Allows the incorporation of more general cost functions • General (non-sequential) constraint structure • Better exploitation (computationally) of hard constraints • Can find the optimal solution if desired

  19. Penalty for violating the constraint. Weight Vector for “local” models How far away is y from a “legal” assignment A collection of Classifiers; Log-linear models (HMM, CRF) or a combination Formal Model Subject to constraints (Soft) constraints component How to solve? This is an Integer Linear Program Solve using ILP packages gives an exact solution. Search techniques are also possible How to train? How to decompose global objective function? Should we incorporate constraints in the learning process?

  20. Score scaling issues may need to be addressed Example: Semantic Role Labeling Who did what to whom, when, where, why,… I left my pearls to my daughter in my will . [I]A0left[my pearls]A1[to my daughter]A2[in my will]AM-LOC . • A0 Leaver • A1 Things left • A2 Benefactor • AM-LOC Location I left my pearls to my daughter in my will . • Special Case (structured output problem): here, all the data is available at one time; in general, classifiers might be learned from different sources, at different times, at different contexts. • Implications on training paradigms Overlapping arguments If A2 is present, A1 must also be present.

  21. A1: utterance C-A1: utterance A1: thing left A0 : leaver A0 : leaver A2: benefactor The pearls whichIleftto my daughter-in-laware fake. The pearls, Isaid, were left to my daughter-in-law. Ileftmy pearlsto my daughter-in-lawin my will. R-A1 A0 : sayer A2: benefactor A1: thing left AM-LOC Semantic Role Labeling (1/2) • For each verb in a sentence • Identify all constituents that fill a semantic role • Determine their roles • Core Arguments, e.g., Agent, Patient or Instrument • Their adjuncts, e.g., Locative, Temporal or Manner

  22. Semantic Role Labeling (2/2) • PropBank [Palmer et. al. 05] provides a large human-annotated corpus of semantic verb-argument relations. • It adds a layer of generic semantic labels to Penn Tree Bank II. • (Almost) all the labels are on the constituents of the parse trees. • Core arguments: A0-A5 and AA • different semantics for each verb • specified in the PropBank Frame files • 13 types of adjuncts labeled as AM-arg • where arg specifies the adjunct type

  23. I left my nice pearls to her I left my nice pearls to her I left my nice pearls to her I left my nice pearls to her [ [ [ [ [ [ [ [ [ [ ] ] ] ] ] ] ] ] ] ] Identify Vocabulary Algorithmic Approach candidate arguments • Identify argument candidates • Pruning [Xue&Palmer, EMNLP’04] • Argument Identifier • Binary classification ( • Classify argument candidates • Argument Classifier • Multi-class classification • Inference • Use the estimated probability distribution given by the argument classifier • Use structural and linguistic constraints • Infer the optimal global output EASY Inference over (old and new) Vocabulary Ileftmy nice pearlsto her

  24. I left my nice pearls to her I left my nice pearls to her [ [ [ [ [ ] ] ] ] ] Argument Identification & Classification • Both argument identifier and argument classifier are trained phrase-based classifiers. • Features (some examples) • voice, phrase type, head word, path, chunk, chunk pattern, etc. [some make use of a full syntactic parse] • Learning Algorithm – SNoW • Sparse network of linear functions • weights learned by regularized Winnow multiplicative update rule • Probability conversion is done via softmax pi = exp{acti}/j exp{actj}

  25. I left my nice pearls to her Inference • The output of the argument classifier often violates some constraints, especially when the sentence is long. • Finding the best legitimate output is formalized as an optimization problem and solved via Integer Linear Programming. [Punyakanok et. al 04, Roth & Yih 04;05] • Input: • The probability estimation (by the argument classifier) • Structural and linguistic constraints • Allows incorporating expressive (non-sequential) constraints on the variables (the arguments types).

  26. Integer Linear Programming Inference • For each argument ai • Set up a Boolean variable: ai,tindicating whether ai is classified as t • Goal is to maximize • i score(ai = t ) ai,t • Subject to the (linear) constraints • If score(ai = t ) = P(ai = t ), the objective is to find the assignment that maximizes the expected number of arguments that are correct and satisfies the constraints. The Constrained Conditional Model is completely decomposed during training Note though that these parts can still be inconsistent in some more complex way

  27. Inference • Maximize expected number correct • T* = argmaxT i P( ai = ti ) • Subject to some constraints • Structural and Linguistic (R-A1A1) • Solved with Integer Learning Programming I left my nice pearls to her Ileftmy nice pearlsto her

  28. Constraints • No duplicate argument classes aPOTARG x{a = A0} 1 • R-ARG  a2, ax{a = A0}x{a2 = R-A0} • C-ARG • a2, a is before a2 x{a = A0}x{a2 = C-A0} • Many other possible constraints: • Unique labels • No overlapping or embedding • Relations between number of arguments • If verb is of type A, no argument of type B If there is an R-ARG phrase, there is an ARG Phrase If there is an C-ARG phrase, there is an ARG before it Joint inference can be used also to combine different SRL Systems.

  29. Results • We built two SRL systems based on two full parsers • Collins’s (We talked on Wednesday about this parser) • Charniak’s (“Maximum Entropy Insired Parser, 2000”) • Results (PropBank; on the PennTree corpus; CoNLL evalution (sec.23) ): • AISTATS’09 (new theoretical results), IJCAI’05, CL’08 (analysis; ablation study) CoNLL’05 (more parsers) • Easy and fast: 7-8 Sentences/Second (using Xpress-MP) • A lot of room for improvement (additional constraints) • Demo available http://L2R.cs.uiuc.edu/~cogcomp Top ranked system in CoNLL’05 shared task:Key difference is the Inference

  30. A General Inference Setting • An Integer linear programming (ILP) formulation • General: works on non-sequential constraint structure • Expressive: can represent many types of declarative constraints • Optimal: finds the optimal solution • Fast: commercial packages are able to quickly solve very large problems (hundreds of variables and constraints; sparsity is important)

  31. Issues • Incorporating general constraints (Algorithmic Approach) • Allow both statistical and expressive declarative constraints • Allow non-sequential constraints (generally difficult) • The value of using more constraints • Coupling vs. Decoupling Training and Inference. • Incorporating global constraints is important but • Should it be done only at evaluation time or also in training time? • Issues related to modularity, efficiency and performance

  32. Some Properties of ILP Inference • Allows expressive constraints • Any Boolean rule can be represented by a set of linear (in)equalities • Combining acquired (statistical) constraints with declarative constraints • Start with shortest path matrix and constraints • Add new constraints to the basic integer linear program. • Solved using off-the-shelf packages • For example, Xpress-MP or CPLEX • If the additional constraints don’t change the solution, LP is enough • Otherwise, the computational time depends on sparsity; fast in practice

  33. Experiments: Semantic Role Labeling • For each verb in a sentence • Identify all constituents that fill a semantic role • Determine their roles, such as, Agent, Patient or Instrument • No two arguments share the same label • Use IO representation X Y

  34. Constraints • No duplicate argument labels (no dup) • e.g., no two discontinuous segments are both A0 • Specific token sequence share same labels (cand) • Derive argument candidates from the parse tree • At least one argument in a sentence (argument) • Not all of the tokens are label O • Given the verb position (verb pos) • The label of the verb should be O • Disallow some arguments (disallow) • Derive from the frame files in the PropBank corpus

  35. Results: Contribution of Expressive Constraints • Learning with statistical constraints only; additional constraints added at evaluation time (efficiency)

  36. Issues • Incorporating general constraints (Algorithmic Approach) • Allow both statistical and expressive declarative constraints • Allow non-sequential constraints (generally difficult) • The value of using more constraints • Coupling vs. Decoupling Training and Inference. • Incorporating global constraints is important but • Should it be done only at evaluation time or also in training time? • Issues related to modularity, efficiency and performance May not be relevant in some problems.

  37. [ [ [ [ ] ] ] ] ] ] [ ] [ ] Did this classifier make a mistake? Phrase Identification Problem How to train it? • Use classifiers’ outcomes to identify phrases • Final outcome determined by optimizing classifiers outcome and constrains Input:o1 o2 o3 o4 o5 o6 o7 o8 o9 o10 Classifier 1: Classifier 2: Infer: Output: s1 s2 s3 s4 s5s6 s7 s8 s9 s10

  38. Training in the presence of Constraints • General Training Paradigm: • First Term: Learning from data (could be further decomposed) • Second Term: Guiding the model by constraints • Can choose if constraints’ weights trained, when and how, or taken into account only in evaluation.

  39. Learning the components together! y2 y3 y1 y4 y5 x3 x4 x1 x5 x2 x7 x6 Testing: Inference with Constraints L+I: Learning plus Inference Training w/o Constraints IBT: Inference-based Training f1(x) X f2(x) f3(x) Y f4(x) f5(x) Which one is better? When and Why?

  40. Learning Local and Global Classifiers • Learning + Inference: No inference used during learning. • For each example (x, y) ∈ D, the learning algorithm attempts to ensure that each component of y produces the correct output. • Global constraints are enforced only at evaluation time. • Inference based Training (IBT): Train for correct global output. • Feedback from the inference process determines which classifiers to provide feedback to; together, the classifiers and the inference yield the desired result. • At each step a subset of the classifiers are updated according to inference feedback. • We study the tradeoff in an online setting (perceptron) L+I: cheaper computationally; modular But intuitively, IBT should be better in the limit

  41. True Global Labeling Y -1 1 -1 -1 1 Local Predictions Apply Constraints: Y’ -1 1 -1 1 1 x3 x4 x1 x5 x2 x7 x6 Y’ -1 1 1 1 1 Perceptron-based Global Learning f1(x) X f2(x) f3(x) Y f4(x) f5(x)

  42. Claims • When the local modes are “easy”to learn, L+I outperforms IBT. • In many applications, the components are identifiable and easy to learn (e.g., argument, open-close, PER). • Only when the local problems become difficult to solve in isolation, IBT outperforms L+I, but needs a larger number of training examples. L+I: cheaper computationally; modular IBT is better in the limit

  43. L+I vs. IBT: the more identifiable individual problems are, the better overall performance is with L+I Simulated data Generalization bounds can be derived which suggest a similar behavior [AISTATS 09, IJCAI 05]

  44. L+I vs. IBT: SRL Experiment (Accuracy and Training Efficiency) • CRF (global): Learning with all constraints discriminatively • VP: no edge features • When more expressive and informative constraints are available, simple L+I strategy may be better. • For more discussion, see (Punyakanok, Roth, Yih, Zimak, Learning and Inference over Constrained Output, IJCAI-05) and (Roth, Small, Titov, Sequential Learning of Classifiers for Structured Prediction Problems; AISTATS-09)

  45. Information extraction with Background Knowledge (Constraints) Lars Ole Andersen . Program analysis and specialization for the C Programming language. PhD thesis. DIKU , University of Copenhagen, May 1994 . Prediction result of a trained HMM Lars Ole Andersen . Program analysis and specialization for the C Programming language . PhD thesis . DIKU , University of Copenhagen , May 1994 . [AUTHOR] [TITLE] [EDITOR] [BOOKTITLE] [TECH-REPORT] [INSTITUTION] [DATE] Violates lots of constraints!

  46. Examples of Constraints • Each field must be aconsecutive list of words,and can appear at mostoncein a citation. • State transitions must occur onpunctuation marks. • The citation can only start withAUTHORorEDITOR. • The wordspp., pagescorrespond toPAGE. • Four digits starting with20xx and 19xx areDATE. • Quotationscan appear only inTITLE • …….

  47. Information Extraction with Constraints • Adding constraints, we getcorrectresults! • [AUTHOR]Lars Ole Andersen . [TITLE]Program analysis andspecialization for the C Programming language . [TECH-REPORT] PhD thesis . [INSTITUTION] DIKU , University of Copenhagen , [DATE] May, 1994 . • If incorporated into semi-supervised training, better results mean • Better Feedback!

  48. Semi-Supervised Learning with Constraints • In traditional Semi-Supervised learning the model can drift away from the correct one. • Constraints can be used • At decision time, to bias the objective function towards favoring constraint satisfaction. • At training to improve labeling of unlabeled data (and thus improve the model) Constraints Model Un-labeled Data Decision Time Constraints

  49. Constraint - DrivenLearning (CODL)[Chang, Ratinov, Roth, ACL’07] Supervised learning algorithm parameterized by  =learn(T) For N iterations do T= For each x in unlabeled dataset y Inference(x, ) T=T  {(x, y)} Inference based augmentation of the training set (feedback) (inference with constraints). Inference(x,C, )

  50. Constraint - DrivenLearning (CODL)[Chang, Ratinov, Roth, ACL’07] Supervised learning algorithm parameterized by  =learn(T) For N iterations do T= For each x in unlabeled dataset {y1,…,yK} Top-K-Inference(x,C, ) T=T  {(x, yi)}i=1…k = +(1- )learn(T) Inference based augmentation of the training set (feedback) (inference with constraints). Learn from new training data. Weight supervised and unsupervised model.

More Related