Part-of-speech tagging and chunking with log-linear models

Part-of-speech tagging and chunking with log-linear models University of Manchester National Centre for Text Mining (NaCTeM) Yoshimasa Tsuruoka

Outline • POS tagging and Chunking for English • Conditional Markov Models (CMMs) • Dependency Networks • Bidirectional CMMs • Maximum entropy learning • Conditional Random Fields (CRFs) • Domain adaptation of a tagger

Part-of-speech tagging The peri-kappa B site mediates human immunodeficiency DT NN NN NN VBZ JJ NN virus type 2 enhancer activation in monocytes … NN NN CD NN NN IN NNS • The tagger assigns a part-of-speech tag to each word in the sentence.

Algorithms for part-of-speech tagging • Tagging speed and accuracy on WSJ * evaluated on different portion of WSJ

Chunking (shallow parsing) He reckons the current account deficit will narrow to NPVPNPVPPP only # 1.8 billion in September . NPPPNP • A chunker (shallow parser) segments a sentence into non-recursive phrases

Chunking (shallow parsing) He reckons the current account deficit will narrow to BNPBVPBNPINPINPINPBVPIVPBPP only # 1.8 billion in September . BNP INPINPINPBPPBNP • Chunking tasks can be converted into a standard tagging task • Different approaches: • Sliding window • Semi-Markov CRF • …

Algorithms for chunking • Chunking speed and accuracy on Penn Treebank

Conditional Markov Models (CMMs) • Left to right decomposition (with the first-order Markov assumption) t1 t2 t3 o

POS tagging with CMMs[Ratnaparkhi 1996; etc.] • Left-to-right decomposition • The local classifier uses the information on the preceding tag. He runs fast VBZ RB PRP ? ? ?

Examples of the features for local classification He runs fast PRP ?

POS tagging with Dependency Network [Toutanova et al. 2003] • Use the information on the following tag as well t1 t2 t3 You can use the following tag as a feature in the local classification model This is no longer a probability

POS tagging with a Cyclic Dependency Network [Toutanova et al. 2003] • Training cost is small – almost equal to CMMs. • Decoding can be performed with dynamic programming, but it is still expensive. • Collusion – the model can lock onto conditionally consistent but jointly unlikely sequences. t1 t2 t3

Bidirectional CMMs [Tsuruoka and Tsujii, 2005] • Possible decomposition structures • Bidirectional CMMs • We can find the “best” structure and tag sequences in polynomial time (a) (b) t1 t2 t3 t1 t2 t3 (c) (d) t1 t2 t3 t1 t2 t3

Maximum entropy learning • Log-linear modeling Feature function Feature weight

Maximum entropy learning • Maximum likelihood estimation • Find the parameters that maximize the (log-) likelihood of the training data • Smoothing • Gaussian prior [Berger et al, 1996] • Inequality constrains [Kazama and Tsujii, 2005]

Parameter estimation • Algorithms for maximum entropy • GIS [Darroch and Ratcliff, 1972], IIS [Della Pietra et al., 1997] • General-purpose algorithms for numerical optimization • BFGS [Nocedal and Wright, 1999], LMVM [Benson and More, 2001] • You need to provide the objective function and gradient: • Likelihood of training samples • Model expectation of each feature

Computing likelihood and model expectation • Example • Two possible tags: “Noun” and “Verb” • Two types of features: “word” and “suffix” He opened it Noun Verb Noun tag = noun tag = verb

Conditional Random Fields (CRFs) • A single log-linear model on the whole sentence • One can use exactly the same techniques as maximum entropy learning to estimate the parameters. • However, the number of classes is HUGE, and it is impossible in practice to do it in a naive way.

Conditional Random Fields (CRFs) • Solution • Let’s restrict the types of features • Then, you can use a dynamic programming algorithm that drastically reduces the amount of computation • Features you can use (in first-order CRFs) • Features defined on the tag • Features defined on the adjacent pair of tags

Features W0=He & Tag = Noun • Feature weights are associated with states and edges He has opened it Noun Noun Noun Noun Tagleft = Noun & Tagright = Noun Verb Verb Verb Verb

A naive way of calculating Z(x) = 7.2 = 4.1 Noun Noun Noun Noun Verb Noun Noun Noun = 1.3 = 0.8 Noun Noun Noun Verb Verb Noun Noun Verb = 4.5 = 9.7 Noun Noun Verb Noun Verb Noun Verb Noun = 0.9 = 5.5 Noun Noun Verb Verb Verb Noun Verb Verb = 2.3 = 5.7 Noun Verb Noun Noun Verb Verb Noun Noun = 11.2 = 4.3 Noun Verb Noun Verb Verb Verb Noun Verb = 3.4 = 2.2 Noun Verb Verb Noun Verb Verb Verb Noun = 2.5 = 1.9 Noun Verb Verb Verb Verb Verb Verb Verb Sum = 67.5

Dynamic programming • Results of intermediate computation can be reused. He has opened it Noun Noun Noun Noun Verb Verb Verb Verb

Maximum entropy learning and Conditional Random Fields • Maximum entropy learning • Log-linear modeling + MLE • Parameter estimation • Likelihood of each sample • Model expectation of each feature • Conditional Random Fields • Log-linear modeling on the whole sentence • Features are defined on states and edges • Dynamic programming

Named Entity Recognition We have shown that interleukin-1 (IL-1) and IL-2 control protein protein protein IL-2 receptor alpha (IL-2R alpha) gene transcription in DNA CD4-CD8-murine T lymphocyte precursors. cell_line

Algorithms for Biomedical Named Entity Recognition • Shared task data for Coling 2004 BioNLP workshop

Domain adaptation • Large training data has been available for general domains (e.g. Penn Treebank WSJ) • NLP Tools trained with general domain data are less accurate on biomedical domains • Development of domain-specific data requires considerable human efforts

Tagging errors made by a tagger trained on WSJ … and membrane potential after mitogen binding. CC NN NN IN NN JJ … two factors, which bind to the same kappa B enhancers… CD NNS WDT NN TO DT JJ NN NN NNS … by analysing the Ag amino acid sequence. IN VBG DT VBG JJ NN NN … to contain more T-cell determinants than … TO VB RBR JJ NNS IN Stimulation of interferon beta gene transcription in vitro by NN IN JJ JJ NN NN IN NN IN • Accuracy of the tagger on the GENIA POS corpus: 84.4%

Re-training of maximum entropy models • Taggers trained as maximum entropy models • Adapting Maximum entropy models to target domains by re-training with domain specific data Feature function(given by the developer) Model parameter

Methods for domain adaptation • Combined training data: a model is trained from scratch with the original and domain-specific data • Reference distribution: an original model is used as a reference probabilistic distribution of a domain-specific model

Adaptation of the part-of-speech tagger • Relationships among training and test data are evaluated for the following corpora • WSJ: Penn Treebank WSJ • GENIA: GENIA POS corpus [Kim et al., 2003] • 2,000 MEDLINE abstracts selected by MeSH terms, Human, Blood cells, and Transcription factors • PennBioIE: Penn BioIE corpus [Kulick et al., 2004] • 1,100 MEDLINE abstracts about inhibition of the cytochrome P450 family of enzymes • 1,157 MEDLINE abstracts about molecular genetics of cancer • Fly: 200 MEDLINE abstracts on Drosophiamelanogaster

Training and test sets • Training sets • Test sets

Experimental results

Corpus size vs. accuracy(combined training data)

Corpus size vs. accuracy(reference distribution)

Summary • POS tagging • MEMM-like approaches achieve good performance with reasonable computational cost. CRFs seems to be too computationally expensive at present. • Chunking • CRFs yield good performance for NP chunking. Semi-Markov CRFs are promising, but we need to somehow reduce computational cost. • Domain Adaptation • One can easily use the information about the original domain as the reference distribution.

References • A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra. (1996). A maximum entropy approach to natural language processing. Computational Linguistics. • Adwait Ratnaparkhi. (1996). A Maximum Entropy Part-Of-Speech Tagger. Proceedings of EMNLP. • Thorsten Brants. (2000). TnT A Statistical Part-Of-Speech Tagger. Proceedings of ANLP. • Taku Kudo and Yuji Matsumoto. (2001). Chunking with Support Vector Machines, Proceedings of NAACL. • John Lafferty, Andrew McCallum, and Fernando Pereira. (2001). Conditional Random Fields,, Probabilistic Models for Segmenting and Labeling Sequence Data. Proceedings of ICML. • Michael Collins. (2002). Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. Proceedings of EMNLP. • Fei Sha and Fernando Pereira. (2003). Shallow Parsing with Conditional Random Fields. Proceedings of HLT-NAACL. • K. Toutanova, D. Klein, C. Manning, and Y. Singer. (2003). Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. Proceedings of HLT-NAACL.

References • Xavier Carreras and Lluis Marquez. (2003). Phrase recognition by filtering and ranking with perceptrons. Proceedings of RANLP. • Jesús Giménez and Lluís Márquez. (2004). SVMTool: A general POS tagger generator based on Support Vector Machines. Proceedings of LREC. • Sunita Sarawagi and William W. Cohen. (2004). Semimarkov conditional random fields for information extraction. Proceedings of NIPS 2004. • Yoshimasa Tsuruoka and Jun'ichi Tsujii. (2005). Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data. Proceedings of HLT/EMNLP. • Yuka Tateisi,Yoshimasa Tsuruoka and Jun'ichi Tsujii. (2006). Subdomain adaptation of a POS tagger with a small corpus. In Proceedings of HLT-NAACL BioNLP Workshop. • Daisuke Okanohara, Yusuke Miyao, Yoshimasa Tsuruoka, and Jun'ichi Tsujii. (2006). Improving the Scalability of Semi-Markov Conditional Random Fields for Named Entity Recognition. Proceedings of COLING/ACL 2006.

Part-of-speech tagging and chunking with log-linear models