Paramor
Download
1 / 68

ParaMor - PowerPoint PPT Presentation


  • 156 Views
  • Updated On :

ParaMor. Across Mor phology. Finding Para digms. C hristian M onson. Turkish Morphology – Beads on a String. present progressive. 2 nd person singular. take. pass ive. negative. You are not being taken. Turkish Morphology – Beads on a String. götür. ül. m. ü yor. s u n.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'ParaMor' - Jimmy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Paramor l.jpg

ParaMor

Across Morphology

Finding Paradigms

Christian Monson


Turkish morphology beads on a string l.jpg
Turkish Morphology – Beads on a String

present progressive

2nd person singular

take

passive

negative

You are not being taken


Turkish morphology beads on a string3 l.jpg
Turkish Morphology – Beads on a String

götür

ül

m

üyor

sun

present progressive

2nd person singular

take

passive

negative

You are not being taken


Applications of computational morphology l.jpg
Applications of Computational Morphology

  • Machine Translation

    • Turkish-English (Oflazer, 2007)

    • Czech-English (Goldwater and McClosky, 2005)

  • Speech Recognition

    • Finnish (Creutz, 2006)

  • Information Retrieval


Challenges of computational morphology l.jpg
Challenges of Computational Morphology

  • Time Consuming for a New Language

    • Kemal Oflazer estimates

      • 3-4 months to build basic Turkish analyzer

      • Plus lexicon development and maintenance

  • Expertise Needed

    • Greenlandic

      • Official language of Greenland

      • Agglutinative Inuit language

      • 50,000 speakers

      • Per Langaard


The solution l.jpg
The Solution

Raw

Text

Unsupervised Morphology Induction


Paramor paradigm morphology l.jpg
ParaMor – Paradigm Morphology

  • ParaMor

    • Unsupervised morphology induction system

  • Paradigm

    • The natural structure of morphology


Paradigms the structure of morphology l.jpg
Paradigms – The Structure of Morphology

Tense & Mood

Person & Number

Stem

Voice

Polarity

ül

m

üyor

sun

götür

present progressive

2nd person singular

take

passive

negative


Paradigms the structure of morphology9 l.jpg
Paradigms – The Structure of Morphology

Tense & Mood

Person & Number

Stem

Voice

Polarity

ül

m

üyor

um

götür

um

present progressive

take

passive

negative

1st person singular


Paradigms the structure of morphology10 l.jpg
Paradigms – The Structure of Morphology

Tense & Mood

Person & Number

Stem

Voice

Polarity

ül

m

üyor

um

götür

um

Ø

present progressive

take

passive

negative

3rd person singular


Paradigms the structure of morphology11 l.jpg
Paradigms – The Structure of Morphology

Tense & Mood

Person & Number

Stem

Voice

Polarity

ül

m

üyor

um

götür

um

Ø

uz

present progressive

take

passive

negative

1st person plural


Paradigms the structure of morphology12 l.jpg
Paradigms – The Structure of Morphology

Tense & Mood

Person & Number

Stem

Voice

Polarity

ül

m

üyor

um

götür

um

Ø

uz

present progressive

take

passive

negative


Paradigms the structure of morphology13 l.jpg
Paradigms – The Structure of Morphology

Tense & Mood

Person & Number

Stem

Voice

Polarity

ül

m

üyor

um

götür

yecek

um

Ø

uz

take

passive

negative

future


Paradigms the structure of morphology14 l.jpg
Paradigms – The Structure of Morphology

Tense & Mood

Person & Number

Stem

Voice

Polarity

ül

m

üyor

um

götür

yecek

um

Ø

uz

take

passive

negative


Paradigms the structure of morphology15 l.jpg
Paradigms – The Structure of Morphology

Tense & Mood

Person & Number

Stem

Voice

Polarity

ül

m

üyor

um

yecek

um

Ø

uz


Paradigms the structure of morphology16 l.jpg
Paradigms – The Structure of Morphology

Paradigms

ül

m

üyor

um

yecek

um

Ø

uz


Paradigms the structure of morphology17 l.jpg
Paradigms – The Structure of Morphology

Paradigms

  • Paradigm

    • Set of mutually replaceable strings

ül

m

üyor

um

yecek

um

Ø

uz


Paradigms the structure of morphology18 l.jpg
Paradigms – The Structure of Morphology

Paradigm

  • Paradigm

    • Set of mutually replaceable strings

ül

m

üyor

um

yecek

um

Ø

uz


The paramor algorithm l.jpg
The ParaMor Algorithm

  • Identify suffix paradigms in 3 steps


The paramor algorithm20 l.jpg
The ParaMor Algorithm

  • Identify suffix paradigms in 3 steps

    • Search for candidate paradigms


The paramor algorithm21 l.jpg
The ParaMor Algorithm

  • Identify suffix paradigms in 3 steps

    • Search for candidate paradigms

    • Cluster candidates modeling the same paradigm


The paramor algorithm22 l.jpg
The ParaMor Algorithm

  • Identify suffix paradigms in 3 steps

    • Search for candidate paradigms

    • Cluster candidates modeling the same paradigm

    • Filter


The paramor algorithm23 l.jpg
The ParaMor Algorithm

  • Identify suffix paradigms in 3 steps

    • Search for candidate paradigms

    • Cluster candidates modeling the same paradigm

    • Filter

  • Segment words

    • Using the discovered paradigms


Slide24 l.jpg

Search for Candidate Paradigms

  • All character boundaries are candidate morpheme boundaries


Slide25 l.jpg

Search for Candidate Paradigms

  • Begin search with the most frequent word-final string

Spanish

autorizaciones

buscabamos

costas

importadoras

vallas

s

10662


Slide26 l.jpg

Search for Candidate Paradigms

  • Identify the most frequent mutually replaceable string

    • Stems that occur with one suffix in a paradigm will likely occur with other suffixes in that paradigm

Spanish

autorizaciones

buscabamos

costas

importadoras

vallas

Ø s

5501

s

10662


Slide27 l.jpg

Search for Candidate Paradigms

  • Stop adding suffixes

    • When the most frequent mutually replaceable string severly decreases the stem count.

Ø r s

287

autorizaciones

buscabamos

costas

importadoras

vallas

Ø s

5501

s

10662


Slide28 l.jpg

Search for Candidate Paradigms

  • Move on to the next most frequent word-final string

Ø r s

287

Ø s

5501

a

8981

s

10662


Slide29 l.jpg

Search for Candidate Paradigms

a as o os

892

a o os

1410

Ø r s

287

a o

2304

Ø s

5501

a

8981

s

10662


Slide30 l.jpg

Search for Candidate Paradigms

Ø dadas do dos n ndo r ron

118

a as o os

892

Ø do n r

354

Ø n r

509

a o os

1410

Ø r s

287

Ø n

1874

a o

2304

Ø s

5501

n

6051

a

8981

s

10662


Slide31 l.jpg

Search for Candidate Paradigms

Ø dadas do dos n ndo r ron

118

a as o os

892

Ø do n r

354

Ø n r

509

a o os

1410

Ø r s

287

Ø es

874

Ø n

1874

a o

2304

Ø s

5501

es

2751

n

6051

a

8981

s

10662


Slide32 l.jpg

Search for Candidate Paradigms

a adaadasadoados an ar aronó

149

Ø dadas do dos n ndo r ron

118

a an ar ó

353

a as o os

892

Ø do n r

354

a an ar

413

Ø n r

509

a o os

1410

Ø r s

287

a an

1049

Ø es

874

Ø n

1874

a o

2304

Ø s

5501

an

1786

es

2751

n

6051

a

8981

s

10662


Slide33 l.jpg

Search for Candidate Paradigms

ra rada radas

rado rados ran

rarraronró

23

a adaadasadoados an ar aronó

149

Ø dadas do dos n ndo r ron

118

strada stradas strado strar stró

7

a an ar ó

353

a as o os

892

strada strado strar stró

8

rada radas rado rados

53

Ø do n r

354

strada strado stró

9

rada rado

rados

67

a an ar

413

Ø n r

509

a o os

1410

Ø r s

287

strada strado

12

rada rado

89

a an

1049

Ø es

874

Ø n

1874

a o

2304

Ø s

5501

strado

15

rado

167

an

1786

es

2751

n

6051

a

8981

s

10662

...

...


Cluster candidates per paradigm l.jpg
Cluster Candidates per Paradigm

15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó

23 Stems: anunci, apoy, confirm, consider, declar, …

345 Covered Types

15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó

22 Stems: anunci, aplic, apoy, celebr, concentr, …

330 Covered Types


Cluster candidates per paradigm35 l.jpg
Cluster Candidates per Paradigm

16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría ó

Cosine Similarity: 0.664

451 Covered Types

15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó

23 Stems: anunci, apoy, confirm, consider, declar, …

345 Covered Types

15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó

22 Stems: anunci, aplic, apoy, celebr, concentr, …

330 Covered Types


Cluster candidates per paradigm36 l.jpg
Cluster Candidates per Paradigm

16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría ó

Cosine Similarity: 0.664

451 Covered Types

15: a aba aban ada adas ado ados an ando ar aron arse ará arán ó

25 Stems: anunci, aplic, apoy, celebr, consider, …

375 Covered Types

15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó

23 Stems: anunci, apoy, confirm, consider, declar, …

345 Covered Types

15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó

22 Stems: anunci, aplic, apoy, celebr, concentr, …

330 Covered Types


Cluster candidates per paradigm37 l.jpg
Cluster Candidates per Paradigm

17: a aba aban ada adas ado ados an ando ar ara aron arse ará arán aría ó

Cosine Similarity: 0.715

532 Covered Types

16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría ó

Cosine Similarity: 0.664

451 Covered Types

15: a aba aban ada adas ado ados an ando ar aron arse ará arán ó

25 Stems: anunci, aplic, apoy, celebr, consider, …

375 Covered Types

15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó

23 Stems: anunci, apoy, confirm, consider, declar, …

345 Covered Types

15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó

22 Stems: anunci, aplic, apoy, celebr, concentr, …

330 Covered Types


Slide38 l.jpg

Filter Candidate Paradigms

  • 2 types of filtering

    • Remove small unclustered candidate paradigms

    • Remove candidates modeling unlikely morpheme boundaries (Harris, 1955)



Slide40 l.jpg

Segment Words Using Paradigms

administradas

a ada adas ado ados an ar aron ó ...


Slide41 l.jpg

Segment Words Using Paradigms

administrada

administradas

a adaadas ado ados an ar aron ó ...


Slide42 l.jpg

Segment Words Using Paradigms

administrada

administradas

administr +adas

a ada adas ado ados an ar aron ó ...


Slide43 l.jpg

Segment Words Using Paradigms

administrada

administradas

administr +adas

aas o os


Slide44 l.jpg

Segment Words Using Paradigms

administrada

administradas

administr +adas, administrad +as

Old way: Separate alternative analysis

aas o os


Slide45 l.jpg

Segment Words Using Paradigms

administrada

administradas

administr +adas, administrad +as

New way: Augment the current segmentation

administr +ad +as

aas o os


Slide46 l.jpg

Segment Words Using Paradigms

administradaØ

administradas

administr +adas, administrad +as, administrada +s

administr +ad +a +s

Øs


Slide47 l.jpg

Morpho Challenge 2007

  • Peer operated competition

    • For unsupervised morphology induction algorithms

  • 4 languages

    • English

    • German

    • Finnish

    • Turkish


Slide48 l.jpg

ParaMor in Morpho Challenge 2007

  • Developed on Spanish

    • ParaMor’s free parameters were frozen


Slide49 l.jpg

2 Methods of Evaluation

  • Linguistic

    Segmentations compared to a morphologically analyzed lexicon


Slide50 l.jpg

2 Methods of Evaluation

  • Linguistic

    Segmentations compared to a morphologically analyzed lexicon


Slide51 l.jpg

2 Methods of Evaluation

  • Task based

    Information retrieval

    • Short two-sentence queries

    • About international news topics

    • Binary relevance assessments

    • About 50 queries and 20Krelevance judgements for each language.


Linguistic evaluation l.jpg
Linguistic Evaluation

F1

47.2

Bernhard 2

Morfessor


Linguistic evaluation53 l.jpg
Linguistic Evaluation

F1

50.6

47.2

Bernhard 2

Morfessor

ParaMor


Linguistic evaluation54 l.jpg
Linguistic Evaluation

F1

50.6

50.7

47.2

ParaMor & Morfessor

Bernhard 2

Morfessor

Bernhard 2

Morfessor

ParaMor


Linguistic evaluation55 l.jpg
Linguistic Evaluation

60.8

F1

50.7

ParaMor & Morfessor

Bernhard 2

Morfessor

ParaMor


Linguistic evaluation56 l.jpg
Linguistic Evaluation

60.8

56.3

F1

ParaMor & Morfessor

Bernhard 2

Morfessor

ParaMor


Linguistic evaluation57 l.jpg
Linguistic Evaluation

60.8

56.3

52.9

53.4

F1

ParaMor & Morfessor

ParaMor & Morfessor

Bernhard 2

Bernhard 2

Morfessor

ParaMor

Morfessor

ParaMor

Bernhard 2


Linguistic evaluation58 l.jpg
Linguistic Evaluation

60.8

56.3

53.4

52.9

F1

ParaMor & Morfessor

ParaMor & Morfessor

Bernhard 2

Bernhard 2

Morfessor

ParaMor

Morfessor

ParaMor

Bernhard 2


Linguistic evaluation59 l.jpg
Linguistic Evaluation

60.8

56.3

53.4

52.9

F1

48.2

48.5

ParaMor & Morf.

ParaMor & Morfessor

ParaMor & Morfessor

Bernhard 2

Morfessor

ParaMor

Bernhard 2

Morfessor

ParaMor

Morfessor

ParaMor

Bernhard 2


Linguistic evaluation60 l.jpg
Linguistic Evaluation

60.8

56.3

53.4

52.9

F1

52.0

48.2

48.5

ParaMor & Morf.

ParaMor & Morfessor

ParaMor & Morfessor

ParaMor & Morfessor

Bernhard 2

Morfessor

ParaMor

Morfessor

ParaMor

Bernhard 2

Morfessor

ParaMor

Morfessor

ParaMor

Bernhard 2

24.7


Ir evaluation tf idf l.jpg
IR Evaluation (TF/IDF)

Average Precision

28.9

P & M

26.4

27.0 – No Morphological Analysis

Morf.

Par.

McNamee


Ir evaluation tf idf62 l.jpg
IR Evaluation (TF/IDF)

Average Precision

29.3

28.9

P & M

27.0 – No Morphological Analysis

Morf.

ParaMor

McNamee


Ir evaluation tf idf63 l.jpg
IR Evaluation (TF/IDF)

Average Precision

38.3

32.1

29.3

30.7 – No Morphological Analysis

28.9

P & M

ParaMor & M.

Morfessor Baseline

Morfessor

Morf.

ParaMor

ParaMor

McNamee


Ir evaluation tf idf64 l.jpg
IR Evaluation (TF/IDF)

Average Precision

38.3

38.2

29.3

30.7 – No Morphological Analysis

28.9

P & M

ParaMor & M.

Morfessor Baseline

Morfessor

Morf.

ParaMor

ParaMor

McNamee


Ir evaluation tf idf65 l.jpg
IR Evaluation (TF/IDF)

Average Precision

41.2

38.8

38.2

37.2

32.0 – No Morphological Analysis

29.3

28.9

Morfessor Baseline

P & M

ParaMor & Morfessor

ParaMor & Morfessor

Morfessor Baseline

Morfessor

ParaMor

Morf.

ParaMor

ParaMor

Morfessor

McNamee


Slide66 l.jpg

ParaMor: State-of-the-Art Unsupervised Morphology Induction System

  • Combined system among the best in Morpho Challenge 2007

  • Consistent across languages

  • Better than no morphology

    • Task based (IR) measure


Slide67 l.jpg

Many Future Directions System

  • Improve Performance

    • F1 of 50-60%is state-of-the-art!

    • Inflection classes

    • Morphophonology

  • Beyond beads-on-a-string


Slide68 l.jpg

Thank You! System