towards unsupervised induction of morphophonological rules l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Towards unsupervised induction of morphophonological rules PowerPoint Presentation
Download Presentation
Towards unsupervised induction of morphophonological rules

Loading in 2 Seconds...

play fullscreen
1 / 25

Towards unsupervised induction of morphophonological rules - PowerPoint PPT Presentation


  • 435 Views
  • Uploaded on

Towards unsupervised induction of morphophonological rules. Erwin Chan University of Pennsylvania Morphochallenge workshop 19 Sept 2007. Goals of unsup morphology induction. Provide analysis of input data 2. Analyzer for unseen data

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Towards unsupervised induction of morphophonological rules' - salena


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
towards unsupervised induction of morphophonological rules

Towards unsupervised inductionof morphophonological rules

Erwin Chan

University of Pennsylvania

Morphochallenge workshop

19 Sept 2007

goals of unsup morphology induction
Goals of unsup morphology induction
  • Provide analysis of input data

2. Analyzer for unseen data

Key task: generalize analysis of input data by inducing phonological characteristics

example inducing phonology english plural nouns
Example: inducing phonology(English plural nouns)

2. Induce

segmentation

process.es

witness.es

match.es

hatch.es

maid.s

fern.s

mate.s

3. Induce phonology

es: ends in ch or sh

s: other characters

4. Apply to

novel words

bench.es

fate.s

foe.s

wish.es

1. Input corpus

processes

witnesses

matches

hatches

maids

ferns

mates

base and transforms model of morphological paradigms
Base-and-transforms model of morphological paradigms

Apply transforms to base forms to generate inflections

Lexeme 1

Lexeme 2

Lexeme 3

t1

t2

t1

t2

t1

t2

base 1

base 2

base 3

t3

t3

t3

t4

t4

t4

t5

t5

t5

base forms
Base forms
  • Base form serves as lexical entry for all inflections of a lexeme

e.g. base of {help, helps, helping, helped} is help

  • Same fine-grained POS type for all lexemes

e.g. “nominative singular” for all nouns

transforms
Transforms
  • Generates inflected form from base
  • Format: ( A, B )

A, B: simple regular expressions

A: characters in base to replace

B: characters in inflected to replace

transform examples
Transform examples

Base form

eat

time

time

hang

Inflected

eating

times

timing

hung

Transform

( $, ing )

( $, s )

( e, ing )

( *a*, *u* )

 non-concat

comparison to phonological rules
Comparison to phonological rules
  • Standard rewrite rule: A  B / C _ D

1. A  B: rewrite operation

2. C _ D: phonological context of application

  • A transform is an ungeneralized rule

A  B / { set of base forms }

  • Future work: induce phonological rules

Learn generalized phonological properties of base forms

compare with stem suffix model
Compare with stem-suffix model
  • Stem-suffix
    • saves = save + s
    • saving = sav + ing

Drawback: multiple lexical representations

  • Base-transform
    • saves = save + ( $,s )
    • saving = save + ( e,ing )
limitations of model
Limitations of model
  • Simple morphotactic structure:
    • assumes one suffix
    • a word is either a base form,

or inflected from a base form

  • Does not account for:
    • agglutination
    • compounds
    • prefixing
    • irregulars, suppletion
distribution of morphological forms
Distribution of morphological forms
  • What information is available in corpora for learning?
  • Is there structure within the distribution of morphological forms that a learner can exploit?
  • Examine annotated corpora for several languages
spanish newswire verbs
Spanish newswire verbs

Sparse data

Log(freq)

Lemma

Inflection

high frequency of base form
Most frequent inflection (in types) often matches intuitions of what inflection a base form should be

Slovene: A.Pos.Nom.Sg.Indef

N.Nom.Sg

V.Main.Ind.Pres.3.Sg

Swedish: A.Pos.Sg.Indef.Nom

N.Sg.Indef.Nom

V.Inf.Act

Spanish: A.Sg

N.Sg

V.Inf

High frequency of base form
goals of induction algorithm
Goals of induction algorithm
  • Select words from corpus to be base forms
  • Formulate transforms

Technique: take advantage of high type

frequency of base inflectional category

slide16

Start state

End state

Transforms =

{($,s), ($,’s), …}

Transforms = {}

Base forms

base

Inflected forms

inflected

unmodeled

unmodeled

greedy algorithm
Greedy algorithm

At each iteration,

  • construct potential transforms
  • add the transform(s) that accounts for most data
slide18

Sources of words for transform

Current grammar

New transform

base

base

inflected

inflected

unmodeled

choose direction of transform
Table for ( $, s )

Base greater: 3750, Inflected greater: 817

Choose ( $, s ) instead of ( s, $ )

Choose direction of transform
morphochallenge english data
Morphochallenge English data
  • High number of word types ( ~250,000 )

leads to spurious transforms

  • ( $, a )

(music, musica) (naam,naama)

(nucci,nuccia) (retin,retina)

(mash,masha) (gab,gaba)

  • ( $, o )

(rutili,rutilio) (lazar,lazaro)

(vern,verno) (berk,berko)

(rikky,rikkyo) (economic,economico)

summary
Summary
  • Base-and-transforms model of morphological paradigms
    • First step towards learning morphophonological rules
    • More linguistically satisfying than stem-and-suffix
  • Algorithm:
    • learn inventory of base forms
    • learn transforms (base-specific rules)
  • Exploits high freq. of base inflectional category
more slides available
More slides available…
  • Longer version of this presentation
    • base forms simplify POS induction
  • Different system: transforms in parallel
    • Slovene, Spanish