totale multilingual tokenisation tagging and lemmatisation l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
totale Multilingual Tokenisation, Tagging and Lemmatisation PowerPoint Presentation
Download Presentation
totale Multilingual Tokenisation, Tagging and Lemmatisation

Loading in 2 Seconds...

play fullscreen
1 / 22

totale Multilingual Tokenisation, Tagging and Lemmatisation - PowerPoint PPT Presentation


  • 92 Views
  • Uploaded on

totale Multilingual Tokenisation, Tagging and Lemmatisation. Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Ljubljana, Slovenia JRC Workshop, 26-27 September 2005. Overview of the talk. Introduction The totale pipeline Training totale Annotating JRC-ACQUIS-sl

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

totale Multilingual Tokenisation, Tagging and Lemmatisation


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
totale multilingual tokenisation tagging and lemmatisation

totaleMultilingual Tokenisation, Tagging and Lemmatisation

Tomaž Erjavec

Dept. of Knowledge Technologies, Jožef Stefan Institute

Ljubljana, Slovenia

JRC Workshop, 26-27 September 2005

overview of the talk
Overview of the talk
  • Introduction
  • The totale pipeline
  • Training totale
  • Annotating JRC-ACQUIS-sl
  • Conclusions

Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

introduction
Introduction
  • Hypothesis: to efficiently exploit the JRC-ACQUIS its texts need to be linguistically pre-processed
  • This normalizes (reduces) the data and gives other tools more features to work with

Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

example
Example

2. (a) Where an exporter has declared goods packaged using automatic systems for bagging, canning, bottling, etc.,

TOKEN TYPE LEMMA MSD

--------------------------------------

2. TOK_ENUM 2. Rmp

(a) TOK_ENUM (a) Rmp

Where TOK where Cs

an TOK a Di

exporter TOK exporter Ncns

has TOK have Vaip3s

declared TOK declare Vmps

goods TOK good Ncnp

packaged TOK package Vmis

using TOK use Vmpp

automatic TOK automatic Afp

systems TOK system Ncnp

for TOK for Sp

bagging TOK bag Vmpp

, PUN

canning TOK can Vmpp

, PUN

bottling TOK bottle Vmpp

, PUN

etc. TOK_ABBR etc. Rmp

MSD and LEMMA are context dependent

MSD useful for any syntactically oriented further processing (PoS filtering)

LEMMA useful for reducing the lexical space (easier searches)

Task is much harder for inflectionally rich (or agglutinative) languages than for English or most ‘old’ EU!

Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

nagging doubts
Nagging doubts
  • Normalization loses information
  • Annotation introduces errors and bias
  • Evaluation for IE non-conclusive
  • Unsupervised methods!

Still…

Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

wanted
Wanted

A tool that would take text in any language and

  • tokenise,
  • PoS tag and
  • lemmatise it.

Should be simple to install and use, robust, fast, and adaptable to new languages, preferably with a large number of already available models

(and work under Linux!)

Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

what is out there
What is out there
  • Component software:tokenisers, taggers, (stemmers)
  • FS/RE environments: INTEX, CLARK
  • Various LT workbenches, most famous GATE
  • Alas: Java, time investment, history

Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

linguistic annotation with totale
Linguistic annotation with totale
  • Multilingual tokenisation, tagging and lemmatisation
  • Perl program with a simple pipeline architecture
  • Input is plain UTF-8 text
  • Output is a list of annotated tokens
  • Several output formats (tabular, XML)

Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

example use
Example use

$ totale -l en

Doctor, can you help?

^D

<TEXT>

Doctor TOK doctor Ncfs

, PUN

can TOK can Voip

you TOK you Pp2

help TOK help Vmn

? PUN_TERM

<S/>

</TEXT>

Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

totale building blocks

Multilingual

resources

Multilingual

resources

Multilingual

resources

Totale building blocks

Perl

CLOG

TnT

mlToken

Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

tokenisation in totale
Tokenisation in totale
  • Perl module mlToken.pm(Camelia Ignat, JRC)
  • Multilingual, with resource files for supported languages (also default rules)
  • Splits text into tokens, marks token type
  • Marks paragraph and sentence boundaries
  • Modelled on mtSeg

Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

tagging in totale
Tagging in totale
  • Annotating words in the text with their context disambiguated morphosyntactic annotations (MSDs)
  • Used the tri-gram tagger TnT
  • Trainable, fast, unknown-word guessing module, able to accommodate the large morphosyntactic tagsets of various EU languages
  • Uses (and induces from annotated corpus) a lexicon with ambiguity classes and tri-gram file

Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

lemmatisation in totale
Lemmatisation in totale
  • Used CLOG, which learns first-order decision lists (+ list of exceptions)
  • Learns lemmatisation rules for each MSD
  • CLOG produces Prolog programs, but these converted into Perl

Tomaž Erjavec and Sašo Džeroski: Machine Learning of Morphosyntactic Structure: Lemmatising Unknown Slovene Words. Applied Artificial Intelligence 18(1), pp. 17-40, 2004.

Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

example clog rule
Example CLOG rule

sub SUB_afcfda {

my $w = $_[0]; my $lem;

if ($w=~/^(.*)svetlej#353i$/){$lem=$1."svetel"}

elsif ($w=~/^(.*)polnej#353i$/){$lem=$1."poln"}

elsif ($w=~/^(.*)b#353i$/) {$lem=$1."b"}

elsif ($w=~/^(.*)elej#353i$/) {$lem=$1."el"}

elsif ($w=~/^(.*)ivej#353i$/) {$lem=$1."iv"}

elsif ($w=~/^(.*)anej#353i$/) {$lem=$1."an"}

elsif ($w=~/^(.*)kej#353i$/) {$lem=$1."ek"}

elsif ($w=~/^(.*)tej#353i$/) {$lem=$1."t"}

elsif ($w=~/^(.*)i#382ji$/) {$lem=$1."izek"}

elsif ($w=~/^(.*)enej#353i$/) {$lem=$1."en"}

elsif ($w=~/^(.*)rej#353i$/) {$lem=$1."er"}

elsif ($w=~/^(.*)nej#353i$/) {$lem=$1."en"}

else {$lem="???"}

return $lem;

}

Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

training totale with multext east resources
Training totale with MULTEXT-East resources
  • Learning totale tagging and lemmatisation models
  • MULTEXT-East language resources V3, a standardised multilingual dataset for language engineering R&D
  • Covers mainly Central and Eastern European languages
  • Freely available for research use from http://nl.ijs.si/ME/V3/
  • Used MSD tagged “1984” corpus (100kW) for tagger training
  • Used MSD lexica (15k lemmas) for lemmatiser training

Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

currently supported languages
Currently supported languages
  • English
  • Slovene
  • Czech
  • Romanian
  • Serbian
  • Estonian
  • Hungarian

Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

processing jrc s acquis sl with totale
Processing JRC’s ACQUIS-sl with totale
  • sl.tar.gz 03-Sep-2005 03:51 34.4Msl/slcelex_*.xml = 144M, 7772 files
  • Wrapper perl program: for each file
      • extract text (all <P>s except first)
      • | totale -l sl -f XML |
      • substitute contents of original <P>s with annotated ones
      • validate against DTD
  • 72 hrs on asterix but 10s startup time = 77720s = 21hrs

Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

the problem of titles
The problem of titles
  • Dual role of titles: as text and name of document
  • Should they contain P at all?
  • Many titles untranslated – experiment with TextCat:4,964 sl 1,663 en “Ni na razpolago v slovenskem jeziku”1,074 en 59 sl or en 12 en or sl
  • Also cases like “ODLOCBA t. 1346/2001/ES …”
  • So, did not process them..

Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

quantitative results elements
Quantitative results: elements

Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

lexical analysis
Lexical analysis

Extracted the MULTEXT lexicon from corpus:

8 rafinacija rafinacija Ncfsn

2 rafinacije rafinacija Ncfpa

40 rafinacije rafinacija Ncfsg

2 rafinacije15rafinacije15 Mc---d

26 rafinaciji rafinacij Npmpn

9 rafinaciji rafinacija Ncfsl

17 rafinacijo rafinacija Ncfsa

Number of lexical entries: 381,068

Different word-forms: 221,876

Different lemmas: 154,241

Different MSDs: 970

Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

some problems
Some problems
  • Complex tokenisation – over 15% “weird” words:

priloge.opomba priloge.opomba Ncfsn

who/fsf/fos/97.7 who/fsf/fos/97.7 Rgp

zavarovalnica(-e) zavarovalnica(-e) Ncmsi

  • Weak tagging model (likes verbs!):

3 anion anion Ncmsa--n

4 anion anion Ncmsn

1 anion anion Npmsn

3 anion anion Vmp--smp

6 aniona anion Ncmsg

8 anione anion Ncmpa

1 anioni anioenAfpmsny

1 anioni anion Ncmpn

1 anioni anioniVmp--pmp

1 anioni anioniti Vmip3s--n

Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation

conclusions
Conclusions
  • Presented processing with totale onACQUIS-sl and a quick evaluation
  • Further work:
    • methodology of semi-manual annotation (model tweaking)
    • “lexical priming” in totale
  • Translations and collocates

Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation