1 / 37

Part-of-speech tagging and chunking with log-linear models

Part-of-speech tagging and chunking with log-linear models. University of Manchester National Centre for Text Mining (NaCTeM) Yoshimasa Tsuruoka. Outline. POS tagging and Chunking for English Conditional Markov Models (CMMs) Dependency Networks Bidirectional CMMs Maximum entropy learning

jetta
Download Presentation

Part-of-speech tagging and chunking with log-linear models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Part-of-speech tagging and chunking with log-linear models University of Manchester National Centre for Text Mining (NaCTeM) Yoshimasa Tsuruoka

  2. Outline • POS tagging and Chunking for English • Conditional Markov Models (CMMs) • Dependency Networks • Bidirectional CMMs • Maximum entropy learning • Conditional Random Fields (CRFs) • Domain adaptation of a tagger

  3. Part-of-speech tagging The peri-kappa B site mediates human immunodeficiency DT NN NN NN VBZ JJ NN virus type 2 enhancer activation in monocytes … NN NN CD NN NN IN NNS • The tagger assigns a part-of-speech tag to each word in the sentence.

  4. Algorithms for part-of-speech tagging • Tagging speed and accuracy on WSJ * evaluated on different portion of WSJ

  5. Chunking (shallow parsing) He reckons the current account deficit will narrow to NPVPNPVPPP only # 1.8 billion in September . NPPPNP • A chunker (shallow parser) segments a sentence into non-recursive phrases

  6. Chunking (shallow parsing) He reckons the current account deficit will narrow to BNPBVPBNPINPINPINPBVPIVPBPP only # 1.8 billion in September . BNP INPINPINPBPPBNP • Chunking tasks can be converted into a standard tagging task • Different approaches: • Sliding window • Semi-Markov CRF • …

  7. Algorithms for chunking • Chunking speed and accuracy on Penn Treebank

  8. Conditional Markov Models (CMMs) • Left to right decomposition (with the first-order Markov assumption) t1 t2 t3 o

  9. POS tagging with CMMs[Ratnaparkhi 1996; etc.] • Left-to-right decomposition • The local classifier uses the information on the preceding tag. He runs fast VBZ RB PRP ? ? ?

  10. Examples of the features for local classification He runs fast PRP ?

  11. POS tagging with Dependency Network [Toutanova et al. 2003] • Use the information on the following tag as well t1 t2 t3 You can use the following tag as a feature in the local classification model This is no longer a probability

  12. POS tagging with a Cyclic Dependency Network [Toutanova et al. 2003] • Training cost is small – almost equal to CMMs. • Decoding can be performed with dynamic programming, but it is still expensive. • Collusion – the model can lock onto conditionally consistent but jointly unlikely sequences. t1 t2 t3

  13. Bidirectional CMMs [Tsuruoka and Tsujii, 2005] • Possible decomposition structures • Bidirectional CMMs • We can find the “best” structure and tag sequences in polynomial time (a) (b) t1 t2 t3 t1 t2 t3 (c) (d) t1 t2 t3 t1 t2 t3

  14. Maximum entropy learning • Log-linear modeling Feature function Feature weight

  15. Maximum entropy learning • Maximum likelihood estimation • Find the parameters that maximize the (log-) likelihood of the training data • Smoothing • Gaussian prior [Berger et al, 1996] • Inequality constrains [Kazama and Tsujii, 2005]

  16. Parameter estimation • Algorithms for maximum entropy • GIS [Darroch and Ratcliff, 1972], IIS [Della Pietra et al., 1997] • General-purpose algorithms for numerical optimization • BFGS [Nocedal and Wright, 1999], LMVM [Benson and More, 2001] • You need to provide the objective function and gradient: • Likelihood of training samples • Model expectation of each feature

  17. Computing likelihood and model expectation • Example • Two possible tags: “Noun” and “Verb” • Two types of features: “word” and “suffix” He opened it Noun Verb Noun tag = noun tag = verb

  18. Conditional Random Fields (CRFs) • A single log-linear model on the whole sentence • One can use exactly the same techniques as maximum entropy learning to estimate the parameters. • However, the number of classes is HUGE, and it is impossible in practice to do it in a naive way.

  19. Conditional Random Fields (CRFs) • Solution • Let’s restrict the types of features • Then, you can use a dynamic programming algorithm that drastically reduces the amount of computation • Features you can use (in first-order CRFs) • Features defined on the tag • Features defined on the adjacent pair of tags

  20. Features W0=He & Tag = Noun • Feature weights are associated with states and edges He has opened it Noun Noun Noun Noun Tagleft = Noun & Tagright = Noun Verb Verb Verb Verb

  21. A naive way of calculating Z(x) = 7.2 = 4.1 Noun Noun Noun Noun Verb Noun Noun Noun = 1.3 = 0.8 Noun Noun Noun Verb Verb Noun Noun Verb = 4.5 = 9.7 Noun Noun Verb Noun Verb Noun Verb Noun = 0.9 = 5.5 Noun Noun Verb Verb Verb Noun Verb Verb = 2.3 = 5.7 Noun Verb Noun Noun Verb Verb Noun Noun = 11.2 = 4.3 Noun Verb Noun Verb Verb Verb Noun Verb = 3.4 = 2.2 Noun Verb Verb Noun Verb Verb Verb Noun = 2.5 = 1.9 Noun Verb Verb Verb Verb Verb Verb Verb Sum = 67.5

  22. Dynamic programming • Results of intermediate computation can be reused. He has opened it Noun Noun Noun Noun Verb Verb Verb Verb

  23. Maximum entropy learning and Conditional Random Fields • Maximum entropy learning • Log-linear modeling + MLE • Parameter estimation • Likelihood of each sample • Model expectation of each feature • Conditional Random Fields • Log-linear modeling on the whole sentence • Features are defined on states and edges • Dynamic programming

  24. Named Entity Recognition We have shown that interleukin-1 (IL-1) and IL-2 control protein protein protein IL-2 receptor alpha (IL-2R alpha) gene transcription in DNA CD4-CD8-murine T lymphocyte precursors. cell_line

  25. Algorithms for Biomedical Named Entity Recognition • Shared task data for Coling 2004 BioNLP workshop

  26. Domain adaptation • Large training data has been available for general domains (e.g. Penn Treebank WSJ) • NLP Tools trained with general domain data are less accurate on biomedical domains • Development of domain-specific data requires considerable human efforts

  27. Tagging errors made by a tagger trained on WSJ … and membrane potential after mitogen binding. CC NN NN IN NN JJ … two factors, which bind to the same kappa B enhancers… CD NNS WDT NN TO DT JJ NN NN NNS … by analysing the Ag amino acid sequence. IN VBG DT VBG JJ NN NN … to contain more T-cell determinants than … TO VB RBR JJ NNS IN Stimulation of interferon beta gene transcription in vitro by NN IN JJ JJ NN NN IN NN IN • Accuracy of the tagger on the GENIA POS corpus: 84.4%

  28. Re-training of maximum entropy models • Taggers trained as maximum entropy models • Adapting Maximum entropy models to target domains by re-training with domain specific data Feature function(given by the developer) Model parameter

  29. Methods for domain adaptation • Combined training data: a model is trained from scratch with the original and domain-specific data • Reference distribution: an original model is used as a reference probabilistic distribution of a domain-specific model

  30. Adaptation of the part-of-speech tagger • Relationships among training and test data are evaluated for the following corpora • WSJ: Penn Treebank WSJ • GENIA: GENIA POS corpus [Kim et al., 2003] • 2,000 MEDLINE abstracts selected by MeSH terms, Human, Blood cells, and Transcription factors • PennBioIE: Penn BioIE corpus [Kulick et al., 2004] • 1,100 MEDLINE abstracts about inhibition of the cytochrome P450 family of enzymes • 1,157 MEDLINE abstracts about molecular genetics of cancer • Fly: 200 MEDLINE abstracts on Drosophiamelanogaster

  31. Training and test sets • Training sets • Test sets

  32. Experimental results

  33. Corpus size vs. accuracy(combined training data)

  34. Corpus size vs. accuracy(reference distribution)

  35. Summary • POS tagging • MEMM-like approaches achieve good performance with reasonable computational cost. CRFs seems to be too computationally expensive at present. • Chunking • CRFs yield good performance for NP chunking. Semi-Markov CRFs are promising, but we need to somehow reduce computational cost. • Domain Adaptation • One can easily use the information about the original domain as the reference distribution.

  36. References • A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra. (1996). A maximum entropy approach to natural language processing. Computational Linguistics. • Adwait Ratnaparkhi. (1996). A Maximum Entropy Part-Of-Speech Tagger. Proceedings of EMNLP. • Thorsten Brants. (2000). TnT A Statistical Part-Of-Speech Tagger. Proceedings of ANLP. • Taku Kudo and Yuji Matsumoto. (2001). Chunking with Support Vector Machines, Proceedings of NAACL. • John Lafferty, Andrew McCallum, and Fernando Pereira. (2001). Conditional Random Fields,, Probabilistic Models for Segmenting and Labeling Sequence Data. Proceedings of ICML. • Michael Collins. (2002). Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. Proceedings of EMNLP. • Fei Sha and Fernando Pereira. (2003). Shallow Parsing with Conditional Random Fields. Proceedings of HLT-NAACL. • K. Toutanova, D. Klein, C. Manning, and Y. Singer. (2003). Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. Proceedings of HLT-NAACL.

  37. References • Xavier Carreras and Lluis Marquez. (2003). Phrase recognition by filtering and ranking with perceptrons. Proceedings of RANLP. • Jesús Giménez and Lluís Márquez. (2004). SVMTool: A general POS tagger generator based on Support Vector Machines. Proceedings of LREC. • Sunita Sarawagi and William W. Cohen. (2004). Semimarkov conditional random fields for information extraction. Proceedings of NIPS 2004. • Yoshimasa Tsuruoka and Jun'ichi Tsujii. (2005). Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data. Proceedings of HLT/EMNLP. • Yuka Tateisi,Yoshimasa Tsuruoka and Jun'ichi Tsujii. (2006). Subdomain adaptation of a POS tagger with a small corpus. In Proceedings of HLT-NAACL BioNLP Workshop. • Daisuke Okanohara, Yusuke Miyao, Yoshimasa Tsuruoka, and Jun'ichi Tsujii. (2006). Improving the Scalability of Semi-Markov Conditional Random Fields for Named Entity Recognition. Proceedings of COLING/ACL 2006.

More Related