paramor
Download
Skip this Video
Download Presentation
ParaMor

Loading in 2 Seconds...

play fullscreen
1 / 68

ParaMor - PowerPoint PPT Presentation


  • 156 Views
  • Uploaded on

ParaMor. Across Mor phology. Finding Para digms. C hristian M onson. Turkish Morphology – Beads on a String. present progressive. 2 nd person singular. take. pass ive. negative. You are not being taken. Turkish Morphology – Beads on a String. götür. ül. m. ü yor. s u n.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'ParaMor' - Jimmy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
paramor

ParaMor

Across Morphology

Finding Paradigms

Christian Monson

turkish morphology beads on a string
Turkish Morphology – Beads on a String

present progressive

2nd person singular

take

passive

negative

You are not being taken

turkish morphology beads on a string3
Turkish Morphology – Beads on a String

götür

ül

m

üyor

sun

present progressive

2nd person singular

take

passive

negative

You are not being taken

applications of computational morphology
Applications of Computational Morphology
  • Machine Translation
    • Turkish-English (Oflazer, 2007)
    • Czech-English (Goldwater and McClosky, 2005)
  • Speech Recognition
    • Finnish (Creutz, 2006)
  • Information Retrieval
challenges of computational morphology
Challenges of Computational Morphology
  • Time Consuming for a New Language
    • Kemal Oflazer estimates
      • 3-4 months to build basic Turkish analyzer
      • Plus lexicon development and maintenance
  • Expertise Needed
    • Greenlandic
      • Official language of Greenland
      • Agglutinative Inuit language
      • 50,000 speakers
      • Per Langaard
the solution
The Solution

Raw

Text

Unsupervised Morphology Induction

paramor paradigm morphology
ParaMor – Paradigm Morphology
  • ParaMor
    • Unsupervised morphology induction system
  • Paradigm
    • The natural structure of morphology
paradigms the structure of morphology
Paradigms – The Structure of Morphology

Tense & Mood

Person & Number

Stem

Voice

Polarity

ül

m

üyor

sun

götür

present progressive

2nd person singular

take

passive

negative

paradigms the structure of morphology9
Paradigms – The Structure of Morphology

Tense & Mood

Person & Number

Stem

Voice

Polarity

ül

m

üyor

um

götür

um

present progressive

take

passive

negative

1st person singular

paradigms the structure of morphology10
Paradigms – The Structure of Morphology

Tense & Mood

Person & Number

Stem

Voice

Polarity

ül

m

üyor

um

götür

um

Ø

present progressive

take

passive

negative

3rd person singular

paradigms the structure of morphology11
Paradigms – The Structure of Morphology

Tense & Mood

Person & Number

Stem

Voice

Polarity

ül

m

üyor

um

götür

um

Ø

uz

present progressive

take

passive

negative

1st person plural

paradigms the structure of morphology12
Paradigms – The Structure of Morphology

Tense & Mood

Person & Number

Stem

Voice

Polarity

ül

m

üyor

um

götür

um

Ø

uz

present progressive

take

passive

negative

paradigms the structure of morphology13
Paradigms – The Structure of Morphology

Tense & Mood

Person & Number

Stem

Voice

Polarity

ül

m

üyor

um

götür

yecek

um

Ø

uz

take

passive

negative

future

paradigms the structure of morphology14
Paradigms – The Structure of Morphology

Tense & Mood

Person & Number

Stem

Voice

Polarity

ül

m

üyor

um

götür

yecek

um

Ø

uz

take

passive

negative

paradigms the structure of morphology15
Paradigms – The Structure of Morphology

Tense & Mood

Person & Number

Stem

Voice

Polarity

ül

m

üyor

um

yecek

um

Ø

uz

paradigms the structure of morphology16
Paradigms – The Structure of Morphology

Paradigms

ül

m

üyor

um

yecek

um

Ø

uz

paradigms the structure of morphology17
Paradigms – The Structure of Morphology

Paradigms

  • Paradigm
    • Set of mutually replaceable strings

ül

m

üyor

um

yecek

um

Ø

uz

paradigms the structure of morphology18
Paradigms – The Structure of Morphology

Paradigm

  • Paradigm
    • Set of mutually replaceable strings

ül

m

üyor

um

yecek

um

Ø

uz

the paramor algorithm
The ParaMor Algorithm
  • Identify suffix paradigms in 3 steps
the paramor algorithm20
The ParaMor Algorithm
  • Identify suffix paradigms in 3 steps
    • Search for candidate paradigms
the paramor algorithm21
The ParaMor Algorithm
  • Identify suffix paradigms in 3 steps
    • Search for candidate paradigms
    • Cluster candidates modeling the same paradigm
the paramor algorithm22
The ParaMor Algorithm
  • Identify suffix paradigms in 3 steps
    • Search for candidate paradigms
    • Cluster candidates modeling the same paradigm
    • Filter
the paramor algorithm23
The ParaMor Algorithm
  • Identify suffix paradigms in 3 steps
    • Search for candidate paradigms
    • Cluster candidates modeling the same paradigm
    • Filter
  • Segment words
    • Using the discovered paradigms
slide24

Search for Candidate Paradigms

  • All character boundaries are candidate morpheme boundaries
slide25

Search for Candidate Paradigms

  • Begin search with the most frequent word-final string

Spanish

autorizaciones

buscabamos

costas

importadoras

vallas

s

10662

slide26

Search for Candidate Paradigms

  • Identify the most frequent mutually replaceable string
    • Stems that occur with one suffix in a paradigm will likely occur with other suffixes in that paradigm

Spanish

autorizaciones

buscabamos

costas

importadoras

vallas

Ø s

5501

s

10662

slide27

Search for Candidate Paradigms

  • Stop adding suffixes
    • When the most frequent mutually replaceable string severly decreases the stem count.

Ø r s

287

autorizaciones

buscabamos

costas

importadoras

vallas

Ø s

5501

s

10662

slide28

Search for Candidate Paradigms

  • Move on to the next most frequent word-final string

Ø r s

287

Ø s

5501

a

8981

s

10662

slide29

Search for Candidate Paradigms

a as o os

892

a o os

1410

Ø r s

287

a o

2304

Ø s

5501

a

8981

s

10662

slide30

Search for Candidate Paradigms

Ø dadas do dos n ndo r ron

118

a as o os

892

Ø do n r

354

Ø n r

509

a o os

1410

Ø r s

287

Ø n

1874

a o

2304

Ø s

5501

n

6051

a

8981

s

10662

slide31

Search for Candidate Paradigms

Ø dadas do dos n ndo r ron

118

a as o os

892

Ø do n r

354

Ø n r

509

a o os

1410

Ø r s

287

Ø es

874

Ø n

1874

a o

2304

Ø s

5501

es

2751

n

6051

a

8981

s

10662

slide32

Search for Candidate Paradigms

a adaadasadoados an ar aronó

149

Ø dadas do dos n ndo r ron

118

a an ar ó

353

a as o os

892

Ø do n r

354

a an ar

413

Ø n r

509

a o os

1410

Ø r s

287

a an

1049

Ø es

874

Ø n

1874

a o

2304

Ø s

5501

an

1786

es

2751

n

6051

a

8981

s

10662

slide33

Search for Candidate Paradigms

ra rada radas

rado rados ran

rarraronró

23

a adaadasadoados an ar aronó

149

Ø dadas do dos n ndo r ron

118

strada stradas strado strar stró

7

a an ar ó

353

a as o os

892

strada strado strar stró

8

rada radas rado rados

53

Ø do n r

354

strada strado stró

9

rada rado

rados

67

a an ar

413

Ø n r

509

a o os

1410

Ø r s

287

strada strado

12

rada rado

89

a an

1049

Ø es

874

Ø n

1874

a o

2304

Ø s

5501

strado

15

rado

167

an

1786

es

2751

n

6051

a

8981

s

10662

...

...

cluster candidates per paradigm
Cluster Candidates per Paradigm

15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó

23 Stems: anunci, apoy, confirm, consider, declar, …

345 Covered Types

15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó

22 Stems: anunci, aplic, apoy, celebr, concentr, …

330 Covered Types

cluster candidates per paradigm35
Cluster Candidates per Paradigm

16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría ó

Cosine Similarity: 0.664

451 Covered Types

15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó

23 Stems: anunci, apoy, confirm, consider, declar, …

345 Covered Types

15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó

22 Stems: anunci, aplic, apoy, celebr, concentr, …

330 Covered Types

cluster candidates per paradigm36
Cluster Candidates per Paradigm

16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría ó

Cosine Similarity: 0.664

451 Covered Types

15: a aba aban ada adas ado ados an ando ar aron arse ará arán ó

25 Stems: anunci, aplic, apoy, celebr, consider, …

375 Covered Types

15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó

23 Stems: anunci, apoy, confirm, consider, declar, …

345 Covered Types

15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó

22 Stems: anunci, aplic, apoy, celebr, concentr, …

330 Covered Types

cluster candidates per paradigm37
Cluster Candidates per Paradigm

17: a aba aban ada adas ado ados an ando ar ara aron arse ará arán aría ó

Cosine Similarity: 0.715

532 Covered Types

16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría ó

Cosine Similarity: 0.664

451 Covered Types

15: a aba aban ada adas ado ados an ando ar aron arse ará arán ó

25 Stems: anunci, aplic, apoy, celebr, consider, …

375 Covered Types

15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó

23 Stems: anunci, apoy, confirm, consider, declar, …

345 Covered Types

15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó

22 Stems: anunci, aplic, apoy, celebr, concentr, …

330 Covered Types

slide38

Filter Candidate Paradigms

  • 2 types of filtering
    • Remove small unclustered candidate paradigms
    • Remove candidates modeling unlikely morpheme boundaries (Harris, 1955)
slide40

Segment Words Using Paradigms

administradas

a ada adas ado ados an ar aron ó ...

slide41

Segment Words Using Paradigms

administrada

administradas

a adaadas ado ados an ar aron ó ...

slide42

Segment Words Using Paradigms

administrada

administradas

administr +adas

a ada adas ado ados an ar aron ó ...

slide43

Segment Words Using Paradigms

administrada

administradas

administr +adas

aas o os

slide44

Segment Words Using Paradigms

administrada

administradas

administr +adas, administrad +as

Old way: Separate alternative analysis

aas o os

slide45

Segment Words Using Paradigms

administrada

administradas

administr +adas, administrad +as

New way: Augment the current segmentation

administr +ad +as

aas o os

slide46

Segment Words Using Paradigms

administradaØ

administradas

administr +adas, administrad +as, administrada +s

administr +ad +a +s

Øs

slide47

Morpho Challenge 2007

  • Peer operated competition
    • For unsupervised morphology induction algorithms
  • 4 languages
    • English
    • German
    • Finnish
    • Turkish
slide48

ParaMor in Morpho Challenge 2007

  • Developed on Spanish
    • ParaMor’s free parameters were frozen
slide49

2 Methods of Evaluation

  • Linguistic

Segmentations compared to a morphologically analyzed lexicon

slide50

2 Methods of Evaluation

  • Linguistic

Segmentations compared to a morphologically analyzed lexicon

slide51

2 Methods of Evaluation

  • Task based

Information retrieval

    • Short two-sentence queries
    • About international news topics
    • Binary relevance assessments
    • About 50 queries and 20Krelevance judgements for each language.
linguistic evaluation
Linguistic Evaluation

F1

47.2

Bernhard 2

Morfessor

linguistic evaluation53
Linguistic Evaluation

F1

50.6

47.2

Bernhard 2

Morfessor

ParaMor

linguistic evaluation54
Linguistic Evaluation

F1

50.6

50.7

47.2

ParaMor & Morfessor

Bernhard 2

Morfessor

Bernhard 2

Morfessor

ParaMor

linguistic evaluation55
Linguistic Evaluation

60.8

F1

50.7

ParaMor & Morfessor

Bernhard 2

Morfessor

ParaMor

linguistic evaluation56
Linguistic Evaluation

60.8

56.3

F1

ParaMor & Morfessor

Bernhard 2

Morfessor

ParaMor

linguistic evaluation57
Linguistic Evaluation

60.8

56.3

52.9

53.4

F1

ParaMor & Morfessor

ParaMor & Morfessor

Bernhard 2

Bernhard 2

Morfessor

ParaMor

Morfessor

ParaMor

Bernhard 2

linguistic evaluation58
Linguistic Evaluation

60.8

56.3

53.4

52.9

F1

ParaMor & Morfessor

ParaMor & Morfessor

Bernhard 2

Bernhard 2

Morfessor

ParaMor

Morfessor

ParaMor

Bernhard 2

linguistic evaluation59
Linguistic Evaluation

60.8

56.3

53.4

52.9

F1

48.2

48.5

ParaMor & Morf.

ParaMor & Morfessor

ParaMor & Morfessor

Bernhard 2

Morfessor

ParaMor

Bernhard 2

Morfessor

ParaMor

Morfessor

ParaMor

Bernhard 2

linguistic evaluation60
Linguistic Evaluation

60.8

56.3

53.4

52.9

F1

52.0

48.2

48.5

ParaMor & Morf.

ParaMor & Morfessor

ParaMor & Morfessor

ParaMor & Morfessor

Bernhard 2

Morfessor

ParaMor

Morfessor

ParaMor

Bernhard 2

Morfessor

ParaMor

Morfessor

ParaMor

Bernhard 2

24.7

ir evaluation tf idf
IR Evaluation (TF/IDF)

Average Precision

28.9

P & M

26.4

27.0 – No Morphological Analysis

Morf.

Par.

McNamee

ir evaluation tf idf62
IR Evaluation (TF/IDF)

Average Precision

29.3

28.9

P & M

27.0 – No Morphological Analysis

Morf.

ParaMor

McNamee

ir evaluation tf idf63
IR Evaluation (TF/IDF)

Average Precision

38.3

32.1

29.3

30.7 – No Morphological Analysis

28.9

P & M

ParaMor & M.

Morfessor Baseline

Morfessor

Morf.

ParaMor

ParaMor

McNamee

ir evaluation tf idf64
IR Evaluation (TF/IDF)

Average Precision

38.3

38.2

29.3

30.7 – No Morphological Analysis

28.9

P & M

ParaMor & M.

Morfessor Baseline

Morfessor

Morf.

ParaMor

ParaMor

McNamee

ir evaluation tf idf65
IR Evaluation (TF/IDF)

Average Precision

41.2

38.8

38.2

37.2

32.0 – No Morphological Analysis

29.3

28.9

Morfessor Baseline

P & M

ParaMor & Morfessor

ParaMor & Morfessor

Morfessor Baseline

Morfessor

ParaMor

Morf.

ParaMor

ParaMor

Morfessor

McNamee

slide66

ParaMor: State-of-the-Art Unsupervised Morphology Induction System

  • Combined system among the best in Morpho Challenge 2007
  • Consistent across languages
  • Better than no morphology
    • Task based (IR) measure
slide67

Many Future Directions

  • Improve Performance
    • F1 of 50-60%is state-of-the-art!
    • Inflection classes
    • Morphophonology
  • Beyond beads-on-a-string