Learning morphological disambiguation rules for turkish
Download
1 / 22

Learning Morphological Disambiguation Rules for Turkish - PowerPoint PPT Presentation


  • 84 Views
  • Uploaded on

Learning Morphological Disambiguation Rules for Turkish. Deniz Yuret Ferhan T ü re Ko ç University, İ stanbul. Overview. Turkish morphology The morphological disambiguation task The Greedy Prepend Algorithm Training Evaluation. Turkish Morphology.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Learning Morphological Disambiguation Rules for Turkish' - cutler


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Learning morphological disambiguation rules for turkish

Learning Morphological Disambiguation Rules for Turkish

Deniz Yuret

Ferhan Türe

Koç University, İstanbul


Overview
Overview

  • Turkish morphology

  • The morphological disambiguation task

  • The Greedy Prepend Algorithm

  • Training

  • Evaluation


Turkish morphology
Turkish Morphology

  • Turkish is an agglutinative language: Many syntactic phenomena expressed by function words and word order in English are expressed by morphology in Turkish.

    I will be able to go.

    (go) + (able to) + (will) + (I)

    git + ebil + ecek + im

    Gidebileceğim.


Fun with turkish morphology

Avrupa Europe

lı European

laş become

tır make

ama not able to

dık we were

larımız those that

dan from

mış were

sınız you

Fun with Turkish Morphology

Avrupalılaştıramadıklarımızdanmışsınız


So how long can words be
So how long can words be?

  • uyu– sleep

  • uyut – make X sleep

  • uyuttur – have Y make X sleep

  • uyutturt – have Z have Y make X sleep

  • uyutturttur – have W have Z have Y make X sleep

  • uyutturtturt – have Q have W have Z …


Morphological analyzer for turkish
Morphological Analyzer for Turkish

masalı

  • masal+Noun+A3sg+Pnon+Acc (= the story)

  • masal+Noun+A3sg+P3sg+Nom (= his story)

  • masa+Noun+A3sg+Pnon+Nom^DB+Adj+With (= with tables)

  • Oflazer, K. (1994). Two-level description of Turkish morphology. Literary and Linguistic Computing

  • Oflazer, K., Hakkani-Tür, D. Z., and Tür, G. (1999)Design for a turkish treebank. EACL’99

  • Kenneth R. Beesley and Lauri Karttunen, Finite State Morphology, CSLI Publications, 2003


Features igs and tags

126 unique features

9129 unique IGs

∞ unique tags

11084 distinct tags observed in 1M word training corpus

stem

features

features

IG

inflectional group (IG)

derivational

boundary

tag

Features, IGs and Tags

masa+Noun+A3sg+Pnon+Nom^DB+Adj+With


Why not just do pos tagging
Why not just do POS tagging?

from Oflazer (1999)


Why not just do pos tagging1
Why not just do POS tagging?

  • Inflectional groups can independently act as heads or modifiers in syntactic dependencies.

  • Full morphological analysis is essential for further syntactic analysis.


Morphological disambiguation
Morphological disambiguation

  • Ambiguity rare in English:

    lives = live+s or life+s

  • More serious in Turkish:

    42.1% of the tokens ambiguous

    1.8 parses per token on average

    3.8 parses for ambiguous tokens


Morphological disambiguation1
Morphological disambiguation

  • Task: pick correct parse given context

    • masal+Noun+A3sg+Pnon+Acc

    • masal+Noun+A3sg+P3sg+Nom

    • masa+Noun+A3sg+Pnon+Nom^DB+Adj+With

  • Uzun masalı anlat Tell the long story

  • Uzun masalı bitti His long story ended

  • Uzun masalı oda Room with long table


Morphological disambiguation2
Morphological disambiguation

  • Task: pick correct parse given context

    • masal+Noun+A3sg+Pnon+Acc

    • masal+Noun+A3sg+P3sg+Nom

    • masa+Noun+A3sg+Pnon+Nom^DB+Adj+With

      Key Idea

      Build a separate classifier for each feature.


Decision lists

If (W = çok) and (R1 = +DA)

Then W has +Det

If (L1 = pek)

Then W has +Det

If (W = +AzI)

Then W does not have +Det

If (W = çok)

Then W does not have +Det

If TRUE

Then W has +Det

“pek çok alanda” (R1)

“pek çok insan” (R2)

“insan çok daha” (R4)

Decision Lists


Greedy prepend algorithm
Greedy Prepend Algorithm

GPA(data)

1 dlist = NIL

2 default-class = Most-Common-Class(data)

3 rule = [If TRUE Then default-class]

4 while Gain(rule, dlist, data) > 0

5 do dlist = prepend(rule, dlist)

6 rule = Max-Gain-Rule(dlist, data)

7 return dlist


Training data
Training Data

  • 1M words of news material

  • Semi automatically disambiguated

  • Created 126 separate training sets, one for each feature

  • Each training set only contains instances which have the corresponding feature in at least one of their parses


Input attributes
Input attributes

For a five word window:

  • The exact word string (e.g. W=Ali'nin)

  • The lowercase version (e.g. W=ali'nin)

  • All suffixes (e.g. W=+n, W=+In, W=+nIn, W=+'nIn, etc.)

  • Character types (e.g. Ali'nin would be described with W=UPPER-FIRST, W=LOWER-MID, W=APOS-MID, W=LOWERLAST)

    Average 40 features per instance.


Sample decision lists
Sample decision lists

+Acc

0

1 W=+InI

1 W=+yI

1 W=UPPER0

1 W=+IzI

1 L1=~bu

1 W=~onu

1 R1=+mAK

1 W=~beni

0 W=~günü

1 W=+InlArI

1 W=~onlarý

0 W=+olAyI

0 W=~sorunu

… (672 rules)

+Prop

1

0 W=STFIRST

0 W==Türk

1 W=STFIRST R1=UCFIRST

0 L1==.

0 W=+AnAl

1 R1==,

0 W=+yAD

1 W=UPPER0

0 W=+lAD

0 W=+AK

1 R1=UPPER

0 W==Milli

1 W=STFIRST R1=UPPER0

… (3476 rules)



Combining models
Combining models

  • masal+Noun+A3sg+P3sg+Nom

  • masal+Noun+A3sg+Pnon+Acc

  • Decision list results and confidence (only distinguishing features necessary):

    • P3sg = yes (89.53%)

    • Nom = no (93.92%)

    • Pnon = no (95.03%)

    • Acc = yes (89.24%)

  • score(P3sg+Nom) = 0.8953 x (1 – 0.9392)

  • score(Pnon+Acc) = (1 – 0.9503) x 0.8924


Evaluation
Evaluation

  • Test corpus: 1000 words, hand tagged

  • Accuracy: 95.87% (conf. int: 94.57-97.08)

  • Better than the training data !?


Other experiments
Other Experiments

  • Retraining on own output: 96.03%

  • Training on unambiguous data: 82.57%

  • Forget disambiguation, let’s do tagging with a single decision list: 91.23%, 10000 rules


Contributions
Contributions

  • Learning morphological disambiguation rules using GPA decision list learner.

  • Reducing data sparseness and increase noise tolerance using separate models for individual output features.

  • ECOC, WSD, etc.


ad