Natural Language Processing COMPSCI 423/723 Rohit Kate

Natural Language ProcessingCOMPSCI 423/723 Rohit Kate

Conditional Random Fields (CRFs) for Sequence Labeling Some of the slides have been adapted from Raymond Mooney’s NLP course at UT Austin.

Graphical Models If no assumption of independence is made, then an exponential number of parameters must be estimated No realistic amount of training data is sufficient to estimate so many parameters If a blanket assumption of conditional independence is made, efficient training and inference is possible, but such a strong assumption is rarely warranted Graphical models use directed or undirected graphs over a set of random variables to explicitly specify variable dependencies and allow for less restrictive independence assumptions while limiting the number of parameters that must be estimated Bayesian Networks: Directed acyclic graphs that indicate causal structure Markov Networks: Undirected graphs that capture general dependencies

Bayesian Networks Directed Acyclic Graph (DAG) Nodes are random variables Edges indicate causal influences Earthquake Burglary Alarm MaryCalls JohnCalls

Conditional Probability Tables Each node has a conditional probability table (CPT) that gives the probability of each of its values given every possible combination of values for its parents (conditioning case). Roots (sources) of the DAG that have no parents are given prior probabilities. Earthquake Burglary Alarm MaryCalls JohnCalls

Joint Distributions for Bayes Nets A Bayesian Network implicitly defines (factors) a joint distribution • Example

Naïve Bayes as a Bayes Net Naïve Bayes is a simple Bayes Net Y … Xn X2 X1 • Priors P(Y) and conditionals P(Xi|Y) for Naïve Bayes provide CPTs for the network

Drawbacks of HMMs HMMs are generative models and are not directly designed to maximize the performance of sequence labeling. They model the joint distribution P(O,Q) and thus only indirectly model P(Q|O) which is what is needed for the sequence labeling task (O: observation sequence, Q: label sequence) Can’t use arbitrary features related to the words (e.g. capitalization, prefixes etc. that can help POS tagging) unless these are explicitly modeled as part of observations 9

Undirected Graphical Model Also called Markov Network, Random Field Undirected graph over a set of random variables, where an edge represents a dependency The Markov blanket of a node, X, in a Markov Net is the set of its neighbors in the graph (nodes that have an edge connecting to X) Every node in a Markov Net is conditionally independent of every other node given its Markov blanket 10

Sample Markov Network Earthquake Burglary Alarm MaryCalls JohnCalls

Distribution for a Markov Network The distribution of a Markov net is most compactly described in terms of a set of potential functions (a.k.a. factors, compatibility functions),φk, for each clique, k, in the graph. For each joint assignment of values to the variables in clique k, φk assigns a non-negative real value that represents the compatibility of these values. The joint distribution of a Markov network is then defined by: Where x{k} represents the joint assignment of the variables in clique k, and Z is a normalizing constant that makes a joint distribution that sums to 1.

Sample Markov Network Earthquake Burglary Alarm MaryCalls JohnCalls

Discriminative Markov Network or Conditional Random Field Directly models P(Y|X) The potential functions could be based on arbitrary features of X and Y and they are expressed as exponentials

Random Field (Undirected Graphical Model) v2 v1 vn … v3 v10 v4 Conditional Random Field (CRF) Y3 … Y2 Yn Y1 X1, X2,…, Xn Two types of variables x & y, there is no factor with only x variables

Linear-Chain Conditional Random Field (CRF) … Y2 Yn Y1 Ys are connected in a linear chain X1, X2,…, Xn

Logistic Regression as a Simplest CRF Logistic regression is a simple CRF with only one output variable Y … Xn X2 X1 • Models the conditional distribution, P(Y | X) and not the full joint P(X,Y)

Simplification Assumption for MaxEnt The probability P(Y|X1..Xn) can be factored as: 18

Generative vs. Discriminative Sequence Labeling Models HMMs are generative models and are not directly designed to maximize the performance of sequence labeling. They model the joint distribution P(O,Q) HMMs are trained to have an accurate probabilistic model of the underlying language, and not all aspects of this model benefit the sequence labeling task Conditional Random Fields (CRFs)are specifically designed and trained to maximize performance of sequence labeling. They model the conditional distribution P(Q | O)

Classification Y Naïve Bayes … Xn X2 X1 Generative Discriminative Conditional Y Logistic Regression … Xn X2 X1

Sequence Labeling .. Y2 Y1 YT HMM … XT X2 X1 Generative Discriminative Conditional .. Y2 Y1 YT Linear-chain CRF … XT X2 X1

Simple Linear Chain CRF Features • Modeling the conditional distribution is similar to that used in multinomial logistic regression. • Create feature functions fk(Yt, Yt−1, Xt) • Feature for each state transition pair i, j • fi,j(Yt, Yt−1, Xt) = 1 if Yt = i and Yt−1 = j and 0 otherwise • Feature for each state observation pair i, o • fi,o(Yt, Yt−1, Xt) = 1 if Yt = i and Xt = oand 0 otherwise • Note: number of features grows quadratically in the number of states (i.e. tags). 22

Conditional Distribution forLinear Chain CRF Using these feature functions for a simple linear chain CRF, we can define: 23

Adding Token Features to a CRF … YT Y2 Y1 … … … … XT,m X2,m X1,m XT,1 X2,1 X1,1 • Can add additional feature functions for each token feature to model conditional distribution. Can add token features Xi,j 24

Features in POS Tagging • For POS Tagging, use lexicographic features of tokens. • Capitalized? • Start with numeral? • Ends in given suffix (e.g. “s”, “ed”, “ly”)? 25

Enhanced Linear Chain CRF(standard approach) … YT Y2 Y1 … XT,1 X2,1 X1,1 … … … XT,m X1,m X2,m • Add feature functions: • fi,j,k(Yt, Yt−1, X) 1 if Yt = i and Yt−1 = j and Xt −1,k = 1 and 0 otherwise Can also condition transition on the current token features. 26

Supervised Learning (Parameter Estimation) As in logistic regression, use L-BFGS optimization procedure, to set λ weights to maximize CLL of the supervised training data 27

Sequence Tagging (Inference) Variant of dynamic programming (Viterbi) algorithm can be used to efficiently, O(TN2), determine the globally most probable label sequence for a given token sequence using a given log-linear model of the conditional probability P(Y | X) 28

Skip-Chain CRFs Y2 Y1 Y3 Y100 Y101 … X3 X2 X1 X101 X100 Dell bought Michael Dell said • Additional links make exact inference intractable, so must resort to approximate inference to try to find the most probable labeling. 29 Can model some long-distance dependencies (i.e. the same word appearing in different parts of the text) by including long-distance edges in the Markov model.

CRF Results Experimental results verify that they have superior accuracy on various sequence labeling tasks Part of Speech tagging Noun phrase chunking Named entity recognition Semantic role labeling However, CRFs are much slower to train and do not scale as well to large amounts of training data Training for POS on full Penn Treebank (~1M words) currently takes “over a week.” Skip-chain CRFs improve results on IE 30

CRF Summary CRFs are a discriminative approach to sequence labeling whereas HMMs are generative Discriminative methods are usually more accurate since they are trained for a specific performance task CRFs also easily allow adding additional token features without making additional independence assumptions Training time is increased since a complex optimization procedure is needed to fit supervised training data CRFs are a state-of-the-art method for sequence labeling 31

Phrase Structure • Most languages have a word order • Words are organized into phrases, group of words that act as a single unit or a constituent • [The dog] [chased] [the cat]. • [The fat dog] [chased] [the thin cat]. • [The fat dog with red collar] [chased] [the thin old cat]. • [The fat dog with red collar named Tom] [suddenly chased] [the thin old white cat].

Phrases • Noun phrase: A syntactic unit of a sentence which acts like a noun and in which a noun is usually embedded called its head • An optional determiner followed by zero or more adjectives, a noun head and zero or more prepositional phrases • Prepositional phrase: Headed by a preposition and express spatial, temporal or other attributes • Verb phrase: Part of the sentence that depend on the verb. Headed by the verb. • Adjective phrase: Acts like an adjective.

Phrase Chunking Find all non-recursive noun phrases (NPs) and verb phrases (VPs) in a sentence. [NP I] [VP ate] [NP the spaghetti] [PP with] [NP meatballs]. [NP He ] [VPreckons] [NP the current account deficit ] [VPwill narrow] [PP to ] [NP only # 1.8 billion ] [PP in ] [NP September ] Some applications need all the noun phrases in a sentence

Phrase Chunking as Sequence Labeling BeginInside Other • Tag individual words with one of 3 tags • B (Begin) word starts new target phrase • I (Inside) word is part of target phrase but not the first word • O (Other) word is not part of target phrase • Sample for NP chunking • He reckons thecurrent account deficit will narrow to only# 1.8 billion in September. 35

Evaluating Chunking • Take harmonic mean to produce a single evaluation metric called F measure. Per token accuracy does not evaluate finding correct full chunks. Instead use: 36

Current Chunking Results Best system for NP chunking: F1=96% Typical results for finding range of chunk types (CoNLL 2000 shared task: NP, VP, PP, ADV, SBAR, ADJP) is F1=92−94% 37

Natural Language Processing COMPSCI 423/723 Rohit Kate