Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign

Learning and Inference in Natural LanguageFrom Stand Alone Learning Tasks to StructuredRepresentations Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign Joint work with my students:Vasin Punyakanok, Wen-tau Yih, Dav Zimak Biologically Inspired Computing Sendai, Japan, Nov. 2004

Cognitive Computation Group The group develops theories and systems pertaining to intelligent behavior using a unified methodology. At the heart of the approach is the idea that learning has a central role in intelligence. We have concentrated on developing the theoretical basis within which to address some of the obstacles and on developing an experimental paradigm so that realistic experiments can be performed to validate the theoretical basis. The emphasis is on large-scale real-world problems in natural language understanding and visual recognition

Cognitive Computation Group • Foundations • Learning Theory: Classification; Multi-Class Classification; Ranking • Knowledge Representation: Relational Representations, Relational Kernels • Inference approaches: structural mappings • Intelligent Information Access • Information Extraction • Named Entities and Relations • Matching Entities Mentions within and across documents and data bases • Natural Language Processing • Semantic Role Labeling • Question answering • Semantics • Software • Basic tools development: SNoW, FEX; shallow parser, pos tagger, semantic parser, … Some of our work on understanding the role of learning in supporting reasoning in the natural language domain

Comprehension (ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is the same person that you read about in the book, Winnie the Pooh. As a boy, Chris lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father wrote a poem about him. The poem was printed in a magazine for others to read. Mr. Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends were animals. There was a bear called Winnie the Pooh. There was also an owl and a young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr. Robin made them come to life with his words. The places in the story were all near Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about Christopher Robin and his animal friends. Most people don't know he is a real person who is grown now. He has written two books of his own. They tell what it is like to be famous. 1. Who is Christopher Robin? 2. When was Winnie the Pooh written? 3. What did Mr. Robin do when Chris was three years old? 4. Where did young Chris live? 5. Why did Chris write two books of his own?

What we Know: Stand Alone Ambiguity Resolution Illinois’ bored of education board ...Nissan Car and truck plantis … …divide life into plant and animal kingdom (This Art) (can N) (will MD) (rust V) V,N,N The dog bit the kid. Hewas taken to a veterinarian a hospital Tiger was in Washington for the PGA Tour

Inference

Inference with Classifiers • Global decisions in which several local decisions play a role but there are mutual dependencies on their outcome. • Learned classifiers for different sub-problems • Incorporate classifiers’ information, along with constraints, in making coherent decisions – decisions that respect the local classifiers as well as domain & context specific constraints. • Global inference for the best assignment to all variables of interest.

Overview • Stand Alone Learning • Modeling • Representational Issues. • Computational Issues • Inference • Making Decisions under General Constraints • Semantic Role Labeling • How to train Components of Global Decisions • Tradeoff that depends on easiness of learning components. • Feedback to learning is (indirectly) given by the reasoning stage. • There may not be a need (or even a possibility) to learn exactly, but only to the extent that is supports Reasoning.

Structure output Learning of Multi valued outputs Features Functions (non linear) Primitive Features Structured Input Structured Input  Feature Mapping  Learning Structured Output

Stand Alone Ambiguity Resolution Illinois’ bored of education board ...Nissan Car and truck plantis … …divide life into plant and animal kingdom (This Art) (can N) (will MD) (rust V) V,N,N The dog bit the kid. Hewas taken to a veterinarian a hospital Tiger was in Washington for the PGA Tour

Disambiguation Problems Middle Eastern ____ are known for their sweetness Task:Decide which of { deserts , desserts } is more likely in the given context. Ambiguity: modeled as confusion sets (class labels C) C={ deserts, desserts} C={ Noun,Adj.., Verb…} C={ topic=Finance, topic=Computing} C={ NE=Person, NE=location}

Learning to Disambiguate • Given • a confusion set C={ deserts, desserts} • sentence (s) Middle Eastern ____ are known for their sweetness • Map into a feature based representation  : S  { 1(s), 2(s), …} • Learn a function FC that determines which of C={ deserts, desserts}ismore likely in a given context. FC (x)= w ¢(x) • Evaluate the function on future C sentences

Example: Representation S= I don’t know whether to laugh or cry [x x x x] Consider words, pos tags, relative location in window Generate binary features representing presence of: a word/pos within window around target word don’twithin +/-3know within +/-3 Verb at -1 to within +/- 3laugh within +/-3 to a +1 conjunctions of size 2, within window of size 3 words:know__to; ___to laugh pos+words:Verb__to; ____to Verb The sentence is represented as a set of its active features S= (don’tat -2 , knowwithin +/-3,… ____to Verb,...) Hope: S=I don’t carewhether to laugh or cry hasalmostthe same representation

join Word= POS= IS-A= … will as board a John director the Structured Input: Features can be Complex Can be an involved process; builds on previous learners; computationally hard; some algorithms (perceptron) support implicit mapping S = John will join the board as a director [NP Which type] [PP of ] [NP submarine] [VP was bought ] [ADVPrecently ] [PP by ] [NP South Korea ] (. ?)

Notes on Representation • A feature is a function over sentences, which maps a sentence to a set of properties of the sentence.  : S  {0,1} or [0,1] • There is a huge number of potential features (~105); Out of these – only a small number is actually active in each example. • Representation: List only features that are active (non zero) in example • When the number of features is fixed, the collection of examples is {1(s), 2(s), … n(s)} = {0,1}n. No need to fix number of features (on-line algorithms).infinite attribute domain{ 1(s), 2(s), …} = {0,1}1 • Some algorithms can make use of variable size input.

Whether Weather Embedding New discriminator in functionally simpler

Natural Language: Domain Characteristics • The number of potential features is very large • The instance space is sparse • Decisions depend on a small set of features (sparse) • Want to learn from a number of examples that is small relative to the dimensionality

Algorithm Descriptions • Focus: Two families of on-line algorithms • Examples x 2 {0,1}n; Hypothesis w 2 Rn; Prediction: sgn{w ¢ x - } • Additive weight update algorithm (Perceptron, Rosenblatt, 1958. Variations exist) • Multiplicative weight update algorithm (Winnow, Littlestone, 1988. Variations exist)

Generalization • Dominated by the sparseness of the function space • Most features are irrelevant  advantage to multiplicative • # of examples required by multiplicative algorithms depends mostly on # of relevant features • Generalization bounds depend on ||w||. • Lesser issue: Sparseness of features space • Very few active features  advantage to additive. • Generalization depend on ||x|| [Kivinen/Warmuth 95]

Mistakes bounds for 10 of 100 of n Function: At least 10 out of fixed 100 variables are active Dimensionality is n Perceptron,SVMs # of mistakes to convergence Winnow n: Total # of Variables (Dimensionality)

Multiclass Classification in NLP • Name/Entity Recognition • Label people, locations, and organizations in a sentence • [PER Sam Houston],[born in] [LOC Virginia], [was a member of the] [ORG US Congress]. • Decompose into sub-problems • Sam Houston, born in Virginia...  (PER,LOC,ORG,?)  PER (1) • Sam Houston, born in Virginia...  (PER,LOC,ORG,?)  None (0) • Sam Houston, born in Virginia...  (PER,LOC,ORG,?)  LOC (2) • Input : {0,1}d or Rd • Output: {0,1,2,3,...,k}

Solving Multi-Class via Binary Learning • Decompose; use Winner-Take-All • y = argmax wi¢x + ti • wi, xRn , tiR • (Pairwise classification also possible) • Key issue: how to train the binary classifiers wi Via Kessler Construction - comparative training - allows learning voroni diagrams. Equivalently: learn in nk-dimension 1-vs-all: not expressive enough

Detour – Basic Classifier: SNoW • A learning architecture that supports several linear update rules (Winnow, Perceptron, naïve Bayes) • Allows regularization; pruning; many options • True multi-class classification [Har-Peled, Roth, Zimak, NIPS 2003] • Variable size examples; very good support for large scale domains like NLP both in terms of number of examples and number of features. • Very efficient (1-2 order of magnitude faster than SVMs) • Integrated with an expressive Feature EXtraction Language (FEX) [Dowload from: http://L2R.cs.uiuc.edu/~cogcomp ]

Summary: Stand Alone Classification • Theory is well understood • Generalization bounds • Practical issues • Essentially all work is done with linear representations • Features: generated explicitly or implicitly (Kernels) • Tradeoff here is relatively understood • Success on a large number of large scale classification problems • Key issues: • Features • How to decide what are good features • How to compute/extract features (intermediate representations) • Supervision: learning protocol

Overview • Stand Alone Learning • Modeling • Representational Issues. • Computational Issues • Inference • Making Decisions under General Constraints • Semantic Role Labeling • How to train Components of Global Decisions • Tradeoff that depends on easiness of learning components. • Feedback to learning is (indirectly) given by the reasoning stage. • There may not be a need (or even a possibility) to learn exactly, but only to the extent that is supports Reasoning.

Identifying Phrase Structure • Classifiers • Recognizing “The beginning of NP” • Recognizing “The end of NP” (or: word based classifiers: BIO representation) Also for other kinds of phrases… • Some Constraints • Phrasesdo not overlap • Order of phrases • Length of phrases • Use classifiers to infer a coherent set of phrases [NP He ] [VP reckons ] [NP the current account deficit ] [VP will narrow ] [PP to ] [NP only # 1.8 billion ] [PP in ] [NP September ] He reckons the current account deficit will narrow to only # 1.8 billion in September

s1 s2 s3 s4 s5 s6 s1 s2 s3 s4 s5 s6 o2 o1 o3 o4 o5 o6 o1 o2 o3 o4 o5 o6 Constrains Structure Allow for Dynamic Programming based Inference • Sequential Constraints • Three models for sequential inference with classifiers [Punyakanok&Roth NIPS’01,JMLR05] • HMM; HMM with Classifiers • Conditional Models • Constraint Satisfaction Models (CSCL: more general constrains) • Other models have been proposed that can deal with sequential structures. Conditional models (other classifiers); CRF, StructurePerceptron [later] • Many Applications: Shallow Parsing, Named Entity; Biological Sequences • General Constraints Structure • An Integer/Linear Programming formulation[Roth&Yih ‘02,’03,’04] No Dynamic Programming.

person person Identifying Entities and Relations J.V. Oswald was murdered at JFK after his assassin, K. F. Johns… Identify: Kill (X, Y) J.V. Oswald was murdered at JFK after his assassin, K. F. Johns… location Some knowledge (classifiers) may be known in advance Some constraints may be available only at decision time • Identify named entities • Identify relations between entities • Exploit mutual dependencies between named entities and relation to yield a coherent global detection.

Inference with Classifiers • Scenario: Global decisions in which several local decisions / components play a role, but there are mutual dependencies on their outcome. • Assume: Learned classifiers for different sub-problems Constraints on classifiers’ labels (known during training or only at evaluation time). • Goal: Incorporate classifiers’ predictions, along with the constraints, in making coherent decisions – that respect the classifiers as well as domain/context specific constrains. • Formally: Global inference for the best assignment to all variables of interest.

Setting • Inference with classifiers is not a new idea. • On sequential constraint structure: • HMM, PMM [Punyakanok&Roth], CRF[Lafferty et al.], CSCL[Punyakanok&Roth] • On general structure: Heuristic search • Attempts to use Bayesian Networks [Roth&Yih’02] have problems • The Proposed Integer linear programming (ILP) formulation • General: works on non-sequential constraint structure • Expressive: can represent many types of constraints • Optimal: finds the optimal solution • Fast: commercial packages are able to quickly solve very large problems (hundreds of variables and constraints)

x1 x2 x3 C(x1,x4) C(x2,x3,x6,x7,x8) x4 x5 x6 x7 x8 (+ WC) Problem Setting (1/2) Everything is Linear • Random Variables X: • Conditional Distributions P (learned by classifiers) • Constraints C– any Boolean function defined on partial assignments (possible weights W on constraints) • Goal: Find the “best” assignment • The assignment that achieves the highest global accuracy. • This is an Integer Programming Problem X*=argmaxXPX subject to constraints C

Integer Linear Programming • A set of binary variables, x = (x1,…, xd) • A cost vector p Rd, • Cost matrices C1RdRt ; C2RdRr, • t, r: # of (inequality, equality) constraints; d - # of variables. • The ILP solution x* is the vector that maximizes the cost function, x* = argmax x {0,1}d px • Subject to C2x> b1; and C1x = b2, • where b1, b2Rd and x{0,1}d

Problem Setting (2/2) • Very general formalism; • Connections to a large number of well studied optimisation problems and a variety of applications. • Justification: • direct argument for the appropriate “best assignment” • Relations to Markov Random Fields (but better computationally) • Significant modelling and computational advantages

Semantic Role Labeling • For each verb in a sentence • identify all constituents that fill a semantic role • determine their roles • Agent, Patient or Instrument, … • Their adjuncts, e.g., Locative, Temporal or Manner • PropBank project [Kingsbury & Palmer02] provides a large human-annotated corpus of semantic verb-argument relations. • Experiment: CoNLL-2004 shared task [Carreras & Marquez 04] • No parsed data in the input

Example • A0 represents the leaver, • A1 represents the thing left, • A2 represents the benefactor, • AM-LOC is an adjunct indicating the location of the action, • V determines the verb.

Argument Types • A0-A5 and AA have different semantics for each verb as specified in the PropBank Frame files. • 13 types of adjuncts labeled as AM-XXX where ARG specifies the adjunct type. • C-ARG is used to specify the continuity of the argument ARG. • In some cases, the actual agent is labeled as the appropriate argument type, ARG, while the relative pronoun is instead labeled as R-ARG.

Examples • C-ARG • R-ARG

Algorithm • I. Find potential argument candidates (Filtering) • II. Classify arguments to types • III. Inference for Argument Structure • Cost Function • Constraints • Integer linear programming (ILP)

I left my nice pearls to her I left my nice pearls to her [ [ [ [ [ ] ] ] ] ] I. Find Potential Arguments • An argument can be any set of consecutive words • Restrict potential arguments • Classify BEGIN(word) • BEGIN(word) = 1  “word begins argument” • Classify END(word) • END(word) = 1  “word ends argument” • Argument • (wi,...,wj) is a potential argument iff • BEGIN(wi) = 1 and END(wj) = 1 • Reduce set of potential arguments (PotArg)

I left my nice pearls to her I left my nice pearls to her [ [ [ [ [ ] ] ] ] ] II. Arguments Type Likelihood • Assign type-likelihood • How likely is it that arg a is type t? • For all aPOTARG , tT • P (argument a = type t ) 0.30.20.20.3 0.60.00.00.4 A0 C-A1 A1 Ø

Details – Phrase-level Classifier • Learn a classifier (SNoW) • ARGTYPE(arg) • P(arg)  {A0,A1,...,C-A0,...,AM-LOC,...} • argmaxt{A0,A1,...,C-A0,...,LOC,...}wtP(arg) • Estimate Probabilities • Softmax over SNoW activations • P(a = t) = exp(wtP(a)) / Z

What is a Good Assignment? • Likelihood of being correct • P(Arg a = Type t) • if t is the correct type for argument a • For a set of arguments a1, a2, ..., an • Expected number of arguments that are correct • i P( ai = ti ) • The solution is the assignment with the maximum expected number of correct arguments.

0.30.20.20.3 0.60.00.00.4 0.10.30.50.1 0.10.20.30.4 Cost = 0.3 + 0.4 + 0.3 + 0.4 = 1.4 BlueRed & N-O Cost = 0.3 + 0.4 + 0.5 + 0.4 = 1.6 Non-Overlapping Cost = 0.3 + 0.6 + 0.5 + 0.4 = 1.8 Independent Max Inference • Maximize expected number correct • T* = argmaxT i P( ai = ti ) • Subject to some constraints • Structural and Linguistic (R-A1A1) I left my nice pearls to her Ileftmy nice pearlsto her

LP Formulation – Linear Cost Corresponds to Maximizing expected number of correct phrases • Cost function • aPOTARG P(a=t) = aPOTARG , tT P(a=t) x{a=t} • Indicator variables x{a1=A0}, x{a1= A1}, …, x{a4= AM-LOC}, x{P4=}  {0,1} • Total Cost = p(a1= A0)· x(a1= A1) + p(a1= )· x(a1= ) +… + p(a4= )· x(a4= )

Linear Constraints (1/2) • Binary values  aPOTARG, tT , x{a = t} {0,1} • Unique labels  aPOTARG ,  tTx{a = t}= 1 • No overlapping or embedding a1 and a2 overlap  x{a1=Ø} + x{a2=Ø} 1

Linear Constraints (2/2) Any Boolean rule can be encoded as a linear constraint. • No duplicate argument classes aPOTARG x{a = A0} 1 • R-ARG  a2POTARG , aPOTARG x{a = A0}x{a2 = R-A0} • C-ARG • a2POTARG , (aPOTARG)  (a is before a2 )x{a = A0}x{a2 = C-A0} • Many other possible constraints: • Exactly one argument of type Z • If verb is of type A, no argument of type B If the is an R-ARG phrase, there is an ARG Phrase If the is an C-ARG phrase, there is an ARG before it

Discussion • Inference approach used also for simultaneous named entities and relation identification (CoNLL’04) • A few other problems in progress • Global inference helps ! • All constraints vs. only non-overlapping constraints: • error reduction > 5% ; > 1% absolute F1 • A lot of room for improvement (additional constraints) • Easy and fast: 70-80 Sentences/Second (using Xpress-MP) • Modeling and Implementation details: • http://L2R.cs.uiuc.edu/~cogcomp • http://www.scottyih.org/html/publications.html#ILP

Overview • Stand Alone Learning • Modeling • Representational Issues. • Computational Issues • Inference • Semantic Role Labeling • Making Decisions under General Constraints • How to train Components of Global Decisions • Tradeoff that depends on easiness of learning components. • Feedback to learning is (indirectly) given by the reasoning stage. • There may not be a need (or even a possibility) to learn exactly, but only to the extent that is supports Reasoning.

Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign