- 70 Views
- Uploaded on
- Presentation posted in: General

Minimally Supervised Morphological Analysis by Multimodal Alignment

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Minimally Supervised Morphological Analysis by Multimodal Alignment

David Yarowsky

and

Richard Wicentowski

- The Algorithm capable of inducing inflectional morphological analyses of regular and highly irregular forms.
- The Algorithm combines four original alignment models based on:
- Relative corpus frequency.
- Contextual Similarity.
- Weighted string similarity.
- Incrementally retrained inflectional transduction probabilities.

- Task definition.
- Required and Optional resources.
- The Algorithm.
- Empirical Evaluation.

Consider this task as three steps:

- Estimate a probabilistic alignment between inflected forms and root forms.
- Train a supervised morphological analysis learner on a weighted subset of these aligned pairs.
- Use the result from step 2 to iteratively refine the alignment in step 1.

- Definitions:

- The target output of step 1:

- For the given language we need:
- A table of the inflectional Part of Speech (POS).
- A list of the canonical suffixes.

- A large text corpus.

- A list of the candidate noun, verb and adjective roots (from dictionary), and any rough mechanism for identifying the candidates POS of the remaining vocabulary. (not based on morphological analysis).
- A list of the consonants and vowels.

- A list of common function words.
- A distance/similarity tables generated on previously studied languages.

Not essential

If available

- Combines four original alignment models:
- Alignment by Frequency Similarity.
- Alignment by Context Similarity.
- Alignment by Weighted Levenshtein Distance.
- Alignment by Morphological Transformation Probabilities.

?

?

sing

sing

take

singed

taked

VBD

VBD

?

sang

VBD

- The motivating dilemma:

- This Table is based on relative corpus frequency:

- A problem: the true alignments between inflections are unknown in advance.
- A simplifying assumption: the frequency ratios between inflections and roots is not significantly different between regular and irregular morphological processes.

- Similarity between regular and irregular forms:

- The expected frequency should also be estimable from the frequency of any of the other inflectional variants.
- VBD/VBG and VBD/VBZ could also be used as estimators.

- Based on contextual similarity of the candidate form.
- Computing similarity between vectors of weighted and filtered context features.
Clustering inflectional variants of verbs (e.g. sipped, sipping, and sip).

CWsubj(AUX|NEG)*VkeywordDET?CW*CWobj

eating

the

apple

Shlomo

is

- Example:

- Consider overall stem edit distance.
- A cost matrix with initial distance costs:
initially set to (0.5,0.6,1.0,0.98)

The goal is to generalize a mapping function via a generative probabilistic model.

- Result table:

<root>+<stem change>+<suffix><inflection>

P(inflection | root,suffix,POS)=P(stemchange | root,suffix,POS)

unique

Example:

Example:

P(solidified | solidify, +ed, VBD)

= P(yi | solidify, +ed, VBD)

≈ 1P(yi | ify, +ed)

+ (1-1)( 2P(yi | fy, +ed)

+ (1-2)( 3P(yi | y, +ed)

+ (1-3)( 4P(yi | +ed)

+ (1-4) P(yi)

POS can be deleted

- No single model is sufficiently effective on its own.
- The Frequency, Levenshtein and Context Similarity models retain equal relative weight.
- The Morphological Transformation Similarity model increases in relative weight.

- Example:

- The final alignment is based on the pigeonhole principle.
- For a given POS a root shouldn't have more than one inflection norshould multiple inflections in the same POS share the same root.

- Performance: