1 / 19

TAGPRO A system for ITALIAN POS TAGGING based on SVM

TAGPRO A system for ITALIAN POS TAGGING based on SVM. EVALITA 2007 Frascati, September 10th 2007. Emanuele Pianta and Roberto Zanoli FBK-irst, Trento. TextPro. A suite of modular NLP tools developed at FBK-irst TokenPro: tokenization MorphoPro: morphological analysis

tala
Download Presentation

TAGPRO A system for ITALIAN POS TAGGING based on SVM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TAGPROA system for ITALIAN POS TAGGING based on SVM EVALITA 2007 Frascati, September 10th 2007 Emanuele Pianta and Roberto Zanoli FBK-irst, Trento

  2. TextPro • A suite of modular NLP tools developed at FBK-irst • TokenPro: tokenization • MorphoPro: morphological analysis • TagPro: Part-of-Speech tagging • LemmaPro: lemmatization • EntityPro: Named Entity recognition • ChunkPro: phrase chunking • SentencePro: sentence splitting • Architecture designed to be efficient, scalable and robust. • Cross-platform: Unix / Linux / Windows / MacOS X • Multi-lingual models • All modules integrated and accessible through unified command line interface 2

  3. TagPro YamCha Feature extraction ortho, prefix, suffix, dictionary, morpho analysis Training data Feature selection Learning dictionary MorphoPro models Controller Feature extraction ortho, prefix, suffix, dictionary, morpho analysis Test data Feature selection Classification To build TagPro we used YamCha, an SVM-based machine learning environment. TagPro can exploit a rich set of linguistic features, such as morphological analysis, prefixes and suffixes TagPro’s architecture

  4. YamCha • Created as generic, customizable, open source text chunker • Can be adapted to a lot of other tag-oriented NLP tasks • Uses state-of-the-art machine learning algorithm (SVM) • Can redefine • Context (window-size) • parsing-direction (forward/backward) • algorithms for multi-class problem (pair wise/one vs rest) • Practical chunking time (1 or 2 sec./sentence.) • Available as C/C++ library 4

  5. Support Vector Machines Based on the Structural Risk Minimization principle (Vladimir N. Vapnik, 1995) • SVM map input vectors to a higher dimensional space where a maximal separating hyperplane is constructed. • Two parallel hyperplanes are constructed on each side of the hyperplane that separates the data. • The separating hyperplane is the hyperplane that maximizes the distance between the two parallel hyperplanes.

  6. YamCha: Setting Window Size Default setting is "F:-2..2:0.. T:-2..-1". The window setting can be customized 6

  7. Training and Tuning Set • The Evalita development set was randomly split into 2 parts • Training: 89,170 tokens • Tuning: 44,586 tokens

  8. FEATURES • For each running word a rich set of features are extracted • WORD: the word itself (both unchanged and lower-cased) • e.g. Autore autore • MORPHO: the morphological analysis (produced by MorphoPro) • e.g. Autore autore+n+m+sing • Calcio calcio calcio+n+m+sing calciare+v+indic+pres+nil+1+sing • AFFIX: prefixes/suffixes (2, 3, 4 or 5 chars. at the start/end of the word) • e.g. libro {li,lib,libr,libro,ro,bro,ibro,libro} • ORTHOgraphic information (e.g. capitalization, hypenation) • e.g. Oggi C (capitalized) • oggi L (lowercased) • GAZETTeers of proper nouns (154,000 proper names, 12,000 cities, • 5,000 organizations and 3,200 locations)

  9. Static vs Dynamic Features • STATIC FEATURES • extracted for the current, previous and following word • WORD, MORPHO, AFFIXes, ORTHO, GAZET • DYNAMIC FEATURES • decided dynamically during tagging • tag of the two tokens preceding the current token.

  10. An Example of Feature Extraction l' ART ex ADJ leader NN socialista ADJ Bettino NN_P Craxi NN_P l' l' l' __nil__ __nil__ __nil__ l' __nil__ __nil__ __nil__ L A N N N N N N N N N N N Y N N N N N N N N Y N O O O O ART ex ex ex __nil__ __nil__ __nil__ ex __nil__ __nil__ __nil__ L N N N N N N N N N N Y 2 N N N Y N N N N N N N O O O O ADJ leader leader le lea lead leade er der ader eader L N N N N N N N N N N Y N N Y 0 N N N N N N N N O O O O NN socialista socialista so soc soci socia ta sta ista lista L N N N N N N N N N N Y 2 N Y 0 N N N N N N N N O O O O ADJ Bettino bettino be bet bett betti no ino tino ttino C N N N N N N N N N N N N N N N N N N N N Y N N O O O B-NAM NN_P Craxi craxi cr cra crax craxi xi axi raxi craxi C N N N N N N N N N N N N N N N N N N N N Y N N O O O B-SUR NN_P

  11. Finding the best features Baseline: WORD (both unchanged and lower-cased) window-size: +1,-1

  12. Finding the best window-size Given the best set of features (F1=97.42) we tried to improve Accuracy by changing the window-size

  13. multi-class problempair-wise/one vs rest • one vs rest: fewer bigger classifiers • pairwise: • a classifier for each possible pair of classes • choose the classifier with best confidence • many relatively small classifiers • faster, less memory

  14. Evaluating the best algorithmPKI vs. PKE • YamCha uses two implementations of SVMs: PKI and PKE. • both are faster than the original SVMs • PKI (3-12 x faster) produces the same accuracy as the original SVMs. • PKE (10-300 x) approximates the orginal SVM, slightly less accurate but much faster

  15. Results on the development set

  16. Test Results

  17. Conclusions • A statistical approach to PoS-Tagging for Italian based on YamCha / SVMs. • Results confirm that SVMs can deal with a big number of features without incurring in overfitting. • We used the same best configuration for both tagsets. • No specific method was applied for classifying unknown words. • Features: • AFFIX+ORTHO: +8.56 over baseline • MORPHO: 2.13 improvement over AFFIX+ORTHO • GAZETteers do not contribute any further significant improvement • Features for unknown words: • AFFIX+ORTHO:+25.56 MORPHO: ++7,62 • No benefit from a larger context (e.g. window-size +2,-2 and more)

  18. TagPro • TagPro is a system for PoS-tagging based on YamCha. • YamCha (Yet Another Multipurpose Chunk Annotator, by Taku Kudo) • is a generic, customizable, and open source text chunker. • is based on Support Vector Machines (SVMs) • TagPro exploits a rich set of linguistic features such as the morphological analysis prefixes and suffixes. • The system is part of TextPro, a suite of NLP tools developed at FBK-irst. 18

  19. Confusion matrix

More Related