Unsupervised Discovery of Morphemes

1 / 22

# Unsupervised Discovery of Morphemes - PowerPoint PPT Presentation

linja-auton. autonkuljettajallakaan. linja-. auton. kuljettajallakaan. auto. n. kuljettajalla. kaan. kuljettaja. lla. Unsupervised Discovery of Morphemes. Presented by: Miri Vilkhov &amp; Daniel Feinstein. Aim:.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Unsupervised Discovery of Morphemes' - lumina

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

linja-auton

autonkuljettajallakaan

linja-

auton

kuljettajallakaan

auto

n

kuljettajalla

kaan

kuljettaja

lla

Unsupervised Discovery of Morphemes

Presented by:

Miri Vilkhov & Daniel Feinstein

Aim:

To find optimal segmentation of the input text into morpheme-like units (morphs) by using unsupervised algorithms.

• The first method is based on the Minimum Description Length (MDL) principle.
• The second method is based on the Maximum Likelihood (ML) principle.

Two segmentation techniques:

input text

Segmentation method

Output data

Definitions
• Input text:flat text that contains only an alphabet language letters and spaces.
• Word: is a sequence of letters bounded by spaces or start/end of the input text.
• Output:vocabulary of morphs (codebook)
• Morph Type:definition of a morph in the codebook
• Morph Token: instance of a morph type in the input text
Method 1:

Recursive Segmentation

and

MDL Cost

MDL Cost function

C = Cost(Input text) + Cost(Codebook)

• m1,…mn:sequence of morph tokens that makes up the input text.
• l(mi):the length of morph mi
• k:number of bits to code a character
• p(mi):token count of mi divided by total count of morph tokens.
Search Algorithm

For each word in input text do

{

If word has been observed before then

{

1. Remove word from the data structure

2. Remove word’s morphs from the codebook

}

Segmentation (word)

}

1. Recursive segmentation

Segmentation (string = c1,…cn)

{

1. Evaluate every possible split of the string into 2 parts.

2. Select the split (or no split) with min(MDL cost).

string split index is i.

3. If “no split” (i=0) selected

Codebook = Codebook U {string}

Else

Segmentation (c1,..,ci);

Segmentation (ci+1,..,cn);

}

Codebook:

• affect
• s

affect

ion

s

The morphs are:affect , ion , s

Example

• The order of splits can be represented as a binary tree.

affections

affect

ions

ion

s

Problem:

Words encountered in the beginning and not observed since may have a “wrong” segmentation, since at some point more suitable morphs have entered the codebook.

• Solution:

“Dreaming” stage.

“Dreaming”

At regular intervals do:

• Stop reading words from the input
• Go over the words already encountered in random order.
• Resegment these words.
Method 2:

Sequential Segmentation

and

ML cost

Method 2:
• Pre-processing: list of words and the frequencies of each word in the corpus.
• The total cost consists of the input text only

Cost(Input text) = Σ –logp(mi)

morth tokens

• mi:morph tokens that makes up the input text.
• p(mi):token count of mi divided by total count of morph tokens.
Search Algorithm – Sequential Segmentation
• Initialize: Split words into morths at random intervals. (used Poisson distribution)
• Repeat for a number of iterations:
• Estimate morph probability
• Re-segment the text using the Viterbi Algorithm for finding segmentation with lowest cost.
• If not the last iteration: Evaluate the segmentation against Rejection Criteria. If not accepted, segment this word randomly (as in 1)
Rejection criteria

Reject the segmentation of a word if it

contains one of the following:

• Rare morph: morph that was used in only one word type in the previous iteration.
• Sequence of one-letter morphs example: carefu + l + l + y

Back to Algoritm

Open issues – Method 2
• Why the coast function is defined?
• What is the iteration stage?
• How do the resegmentation works?
• How this method gives us the right morphs?

Back to Algoritm

Evaluation Measures

Correspondence with linguistic morphemes. Using Goldsmith’s program Linguistica.

Efficiency of compression of the data. Can be evaluated using MDL cost function.

Computational efficiency. Can be estimated from the running time of the program.

Experiments

&

Results

Correct and complete segmentation (i.e. all relevant morphemes were identified).

• Correct but incomplete segmentation (i.e. not all morphemes were identified).
• Incorrect segmentation (i.e. some proposed boundaries didn’t correspond to an actual morphemes).
Conclusions

Recursive splitting and MDL cost performed

better.

(method 1 is the best based on results)