Tagpro a system for italian pos tagging based on svm
Download
1 / 19

TAGPRO A system for ITALIAN POS TAGGING based on SVM - PowerPoint PPT Presentation


  • 120 Views
  • Uploaded on

TAGPRO A system for ITALIAN POS TAGGING based on SVM. EVALITA 2007 Frascati, September 10th 2007. Emanuele Pianta and Roberto Zanoli FBK-irst, Trento. TextPro. A suite of modular NLP tools developed at FBK-irst TokenPro: tokenization MorphoPro: morphological analysis

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' TAGPRO A system for ITALIAN POS TAGGING based on SVM' - tala


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Tagpro a system for italian pos tagging based on svm

TAGPROA system for ITALIAN POS TAGGING based on SVM

EVALITA 2007

Frascati, September 10th 2007

Emanuele Pianta and Roberto Zanoli

FBK-irst, Trento


Textpro
TextPro

  • A suite of modular NLP tools developed at FBK-irst

    • TokenPro: tokenization

    • MorphoPro: morphological analysis

    • TagPro: Part-of-Speech tagging

    • LemmaPro: lemmatization

    • EntityPro: Named Entity recognition

    • ChunkPro: phrase chunking

    • SentencePro: sentence splitting

  • Architecture designed to be efficient, scalable and robust.

  • Cross-platform: Unix / Linux / Windows / MacOS X

  • Multi-lingual models

  • All modules integrated and accessible through unified command line interface

2


TagPro

YamCha

Feature extraction

ortho, prefix, suffix, dictionary, morpho analysis

Training

data

Feature selection

Learning

dictionary

MorphoPro

models

Controller

Feature extraction

ortho, prefix, suffix, dictionary, morpho analysis

Test

data

Feature selection

Classification

To build TagPro we used YamCha, an SVM-based machine learning environment. TagPro can exploit a rich set of linguistic features, such as morphological analysis, prefixes and suffixes

TagPro’s architecture


Yamcha
YamCha

  • Created as generic, customizable, open source text chunker

  • Can be adapted to a lot of other tag-oriented NLP tasks

  • Uses state-of-the-art machine learning algorithm (SVM)

  • Can redefine

    • Context (window-size)

    • parsing-direction (forward/backward)

    • algorithms for multi-class problem (pair wise/one vs rest)

  • Practical chunking time (1 or 2 sec./sentence.)

  • Available as C/C++ library

4


Support vector machines
Support Vector Machines

Based on the Structural Risk Minimization principle (Vladimir N. Vapnik, 1995)

  • SVM map input vectors to a higher dimensional space where a maximal separating hyperplane is constructed.

  • Two parallel hyperplanes are constructed on each side of the hyperplane that separates the data.

  • The separating hyperplane is the hyperplane that maximizes the distance between the two parallel hyperplanes.


Yamcha setting window size
YamCha: Setting Window Size

Default setting is "F:-2..2:0.. T:-2..-1".

The window setting can be customized

6


Training and tuning set
Training and Tuning Set

  • The Evalita development set was randomly split into 2 parts

    • Training: 89,170 tokens

    • Tuning: 44,586 tokens


Features
FEATURES

  • For each running word a rich set of features are extracted

  • WORD: the word itself (both unchanged and lower-cased)

  • e.g. Autore autore

  • MORPHO: the morphological analysis (produced by MorphoPro)

  • e.g. Autore autore+n+m+sing

  • Calcio calcio calcio+n+m+sing calciare+v+indic+pres+nil+1+sing

  • AFFIX: prefixes/suffixes (2, 3, 4 or 5 chars. at the start/end of the word)

  • e.g. libro {li,lib,libr,libro,ro,bro,ibro,libro}

  • ORTHOgraphic information (e.g. capitalization, hypenation)

  • e.g. Oggi C (capitalized)

  • oggi L (lowercased)

  • GAZETTeers of proper nouns (154,000 proper names, 12,000 cities,

  • 5,000 organizations and 3,200 locations)


Static vs dynamic features
Static vs Dynamic Features

  • STATIC FEATURES

    • extracted for the current, previous and following word

    • WORD, MORPHO, AFFIXes, ORTHO, GAZET

  • DYNAMIC FEATURES

    • decided dynamically during tagging

    • tag of the two tokens preceding the current token.


An example of feature extraction
An Example of Feature Extraction

l' ART

ex ADJ

leader NN

socialista ADJ

Bettino NN_P

Craxi NN_P

l' l' l' __nil__ __nil__ __nil__ l' __nil__ __nil__ __nil__ L A N N N N N N N N N N N Y N N N N N N N N Y N O O O O ART

ex ex ex __nil__ __nil__ __nil__ ex __nil__ __nil__ __nil__ L N N N N N N N N N N Y 2 N N N Y N N N N N N N O O O O ADJ

leader leader le lea lead leade er der ader eader L N N N N N N N N N N Y N N Y 0 N N N N N N N N O O O O NN

socialista socialista so soc soci socia ta sta ista lista L N N N N N N N N N N Y 2 N Y 0 N N N N N N N N O O O O ADJ

Bettino bettino be bet bett betti no ino tino ttino C N N N N N N N N N N N N N N N N N N N N Y N N O O O B-NAM NN_P

Craxi craxi cr cra crax craxi xi axi raxi craxi C N N N N N N N N N N N N N N N N N N N N Y N N O O O B-SUR NN_P


Finding the best features
Finding the best features

Baseline: WORD (both unchanged and lower-cased)

window-size: +1,-1


Finding the best window size
Finding the best window-size

Given the best set of features (F1=97.42)

we tried to improve Accuracy by changing the window-size


Multi class problem pair wise one vs rest
multi-class problempair-wise/one vs rest

  • one vs rest: fewer bigger classifiers

  • pairwise:

    • a classifier for each possible pair of classes

    • choose the classifier with best confidence

    • many relatively small classifiers

    • faster, less memory


Evaluating the best algorithm pki vs pke
Evaluating the best algorithmPKI vs. PKE

  • YamCha uses two implementations of SVMs: PKI and PKE.

  • both are faster than the original SVMs

  • PKI (3-12 x faster) produces the same accuracy as the original SVMs.

  • PKE (10-300 x) approximates the orginal SVM, slightly less accurate but much faster




Conclusions
Conclusions

  • A statistical approach to PoS-Tagging for Italian based on YamCha / SVMs.

  • Results confirm that SVMs can deal with a big number of features without incurring in overfitting.

  • We used the same best configuration for both tagsets.

  • No specific method was applied for classifying unknown words.

  • Features:

    • AFFIX+ORTHO: +8.56 over baseline

    • MORPHO: 2.13 improvement over AFFIX+ORTHO

    • GAZETteers do not contribute any further significant improvement

  • Features for unknown words:

    • AFFIX+ORTHO:+25.56 MORPHO: ++7,62

  • No benefit from a larger context (e.g. window-size +2,-2 and more)


Tagpro
TagPro

  • TagPro is a system for PoS-tagging based on YamCha.

  • YamCha (Yet Another Multipurpose Chunk Annotator, by Taku Kudo)

  • is a generic, customizable, and open source text chunker.

  • is based on Support Vector Machines (SVMs)

  • TagPro exploits a rich set of linguistic features such as the morphological analysis prefixes and suffixes.

  • The system is part of TextPro, a suite of NLP tools developed at FBK-irst.

18



ad