Create Presentation
Download Presentation

Download Presentation
## Graphical models for part of speech tagging

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Different Models for POS tagging**• HMM • Maximum Entropy Markov Models • Conditional Random Fields**POS tagging: A Sequence Labeling Problem**• Input and Output • Input sequence x= x1x2xn • Output sequence y= y1y2ym • Labels of the input sequence • Semantic representation of the input • Other Applications • Automatic speech recognition • Text processing, e.g., tagging, name entity recognition, summarization by exploiting layout structure of text, etc.**0.5**0.9 0.5 0.1 0.8 0.2 Hidden Markov Models • Doubly stochastic models • Efficient dynamic programming algorithms exist for • Finding Pr(S) • The highest probability path P that maximizes Pr(S,P) (Viterbi) • Training the model • (Baum-Welch algorithm) A C 0.6 0.4 A C 0.9 0.1 S1 S2 S4 S3 A C 0.5 0.5 A C 0.3 0.7**Hidden Markov Model (HMM) : Generative Modeling**Source Model P(Y) Noisy Channel P(X|Y) y x e.g., 1st order Markov chain Parameter estimation: maximize the joint likelihood of training examples**Different Models for POS tagging**• HMM • Maximum Entropy Markov Models • Conditional Random Fields**Disadvantage of HMMs (1)**• No Rich Feature Information • Rich information are required • When xk is complex • When data of xk is sparse • Example: POS Tagging • How to evaluate P(wk|tk) for unknown words wk ? • Useful features • Suffix, e.g., -ed, -tion, -ing, etc. • Capitalization**Disadvantage of HMMs (2)**• Generative Model • Parameter estimation: maximize the joint likelihood of training examples • Better Approach • Discriminative model which models P(y|x) directly • Maximize the conditional likelihood of training examples**Maximum Entropy Markov Model**• Discriminative Sub Models • Unify two parameters in generative model into one conditional model • Two parameters in generative model, • parameter in source model and parameter in noisy channel • Unified conditional model • Employ maximum entropy principle • Maximum Entropy Markov Model**General Maximum Entropy Model**• Model • Model distribution P(Y|X) with a set of features {f1, f2, , fl} defined on X and Y • Idea • Collect information of features from training data • Assume nothing on distribution P(Y|X) other than the collected information • Maximize the entropy as a criterion**Features**• Features • 0-1 indicator functions • 1 if (x, y)satisfies a predefined condition • 0 if not • Example: POS Tagging**Constraints**• Empirical Information • Statistics from training data T • Expected Value • From the distribution P(Y|X) we want to model • Constraints**Maximum Entropy: Objective**• Entropy • Maximization Problem**Dual Problem**• Dual Problem • Conditional model • Maximum likelihood of conditional data • Solution • Improved iterative scaling (IIS) (Berger et al. 1996) • Generalized iterative scaling (GIS) (McCallum et al. 2000)**Maximum Entropy Markov Model**• Use Maximum Entropy Approach to Model • 1st order • Features • Basic features (like parameters in HMM) • Bigram (1st order) or trigram (2nd order) in source model • State-output pair feature (Xk = xk,Yk=yk) • Advantage: incorporate other advanced features on (xk,yk)**HMM vs MEMM (1st order)**Maximum Entropy Markov Model (MEMM) HMM**Performance in POS Tagging**• POS Tagging • Data set: WSJ • Features: • HMM features, spelling features (like –ed, -tion, -s, -ing, etc.) • Results (Lafferty et al. 2001) • 1st order HMM • 94.31% accuracy, 54.01% OOV accuracy • 1st order MEMM • 95.19% accuracy, 73.01% OOV accuracy**Different Models for POS tagging**• HMM • Maximum Entropy Markov Models • Conditional Random Fields**Disadvantage of MEMMs (1)**• Complex Algorithm of Maximum Entropy Solution • Both IIS and GIS are difficult to implement • Require many tricks in implementation • Slow in Training • Time consuming when data set is large • Especially for MEMM**Disadvantage of MEMMs (2)**• Maximum Entropy Markov Model • Maximum entropy model as a sub model • Optimization of entropy on sub models, not on global model • Label Bias Problem • Conditional models with per-state normalization • Effects of observations are weakened for states with fewer outgoing transitions**Parameters**Model i b r 2 3 1 o b r 5 6 4 Label Bias Problem Training Data X:Y rib:123 rib:123 rib:123 rob:456 rob:456 New input: rob**Solution**• Global Optimization • Optimize parameters in a global model simultaneously, not in sub models separately • Alternatives • Conditional random fields • Application of perceptron algorithm**Conditional Random Field (CRF) (1)**• Let • be a graph such that Y is indexed by the vertices • Then • (X, Y) is a conditional random field if • Conditioned globally on X**Conditional Random Field (CRF) (2)**Determined by State Transitions • Exponential Model • : a tree (or more specifically, a chain) with cliques as edges and vertices State determined • Parameter Estimation • Maximize the conditional likelihood of training examples • IIS or GIS**MEMM vs CRF**• Similarities • Both employ maximum entropy principle • Both incorporate rich feature information • Differences • Conditional random fields are always globally conditioned on X, resulting in a global optimized model**Performance in POS Tagging**• POS Tagging • Data set: WSJ • Features: • HMM features, spelling features (like –ed, -tion, -s, -ing, etc.) • Results (Lafferty et al. 2001) • 1st order MEMM • 95.19% accuracy, 73.01% OOV accuracy • Conditional random fields • 95.73% accuracy, 76.24% OOV accuracy**Comparison of the three approaches to POS Tagging**• Results (Lafferty et al. 2001) • 1st order HMM • 94.31% accuracy, 54.01% OOV accuracy • 1st order MEMM • 95.19% accuracy, 73.01% OOV accuracy • Conditional random fields • 95.73% accuracy, 76.24% OOV accuracy**References**• A. Berger, S. Della Pietra, and V. Della Pietra (1996). A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics, 22(1), 39-71. • J. Lafferty, A. McCallumn, and F. Pereira (2001). Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proc. ICML-2001, 282-289.