Unsupervised Discovery of Morphemes

linja-auton autonkuljettajallakaan linja- auton kuljettajallakaan auto n kuljettajalla kaan kuljettaja lla Unsupervised Discovery of Morphemes Presented by: Miri Vilkhov & Daniel Feinstein

Aim: To find optimal segmentation of the input text into morpheme-like units (morphs) by using unsupervised algorithms. • The first method is based on the Minimum Description Length (MDL) principle. • The second method is based on the Maximum Likelihood (ML) principle. Two segmentation techniques:

input text Segmentation method Output data

Definitions • Input text:flat text that contains only an alphabet language letters and spaces. • Word: is a sequence of letters bounded by spaces or start/end of the input text. • Output:vocabulary of morphs (codebook) • Morph Type:definition of a morph in the codebook • Morph Token: instance of a morph type in the input text

Method 1: Recursive Segmentation and MDL Cost

MDL Cost function C = Cost(Input text) + Cost(Codebook) • m1,…mn:sequence of morph tokens that makes up the input text. • l(mi):the length of morph mi • k:number of bits to code a character • p(mi):token count of mi divided by total count of morph tokens.

Search Algorithm For each word in input text do { If word has been observed before then { 1. Remove word from the data structure 2. Remove word’s morphs from the codebook } Segmentation (word) }

1. Recursive segmentation Segmentation (string = c1,…cn) { 1. Evaluate every possible split of the string into 2 parts. 2. Select the split (or no split) with min(MDL cost). string split index is i. 3. If “no split” (i=0) selected Codebook = Codebook U {string} Else Segmentation (c1,..,ci); Segmentation (ci+1,..,cn); }

Codebook: • affect • s • … affect ion s The morphs are:affect , ion , s Example • The order of splits can be represented as a binary tree. affections affect ions ion s

Problem: Words encountered in the beginning and not observed since may have a “wrong” segmentation, since at some point more suitable morphs have entered the codebook. • Solution: “Dreaming” stage.

“Dreaming” At regular intervals do: • Stop reading words from the input • Go over the words already encountered in random order. • Resegment these words.

Method 2: Sequential Segmentation and ML cost

Method 2: • Pre-processing: list of words and the frequencies of each word in the corpus. • The total cost consists of the input text only Cost(Input text) = Σ –logp(mi) morth tokens • mi:morph tokens that makes up the input text. • p(mi):token count of mi divided by total count of morph tokens.

Search Algorithm – Sequential Segmentation • Initialize: Split words into morths at random intervals. (used Poisson distribution) • Repeat for a number of iterations: • Estimate morph probability • Re-segment the text using the Viterbi Algorithm for finding segmentation with lowest cost. • If not the last iteration: Evaluate the segmentation against Rejection Criteria. If not accepted, segment this word randomly (as in 1)

Rejection criteria Reject the segmentation of a word if it contains one of the following: • Rare morph: morph that was used in only one word type in the previous iteration. • Sequence of one-letter morphs example: carefu + l + l + y Back to Algoritm

Open issues – Method 2 • Why the coast function is defined? • What is the iteration stage? • How do the resegmentation works? • How this method gives us the right morphs? Back to Algoritm

Evaluation Measures Correspondence with linguistic morphemes. Using Goldsmith’s program Linguistica. Efficiency of compression of the data. Can be evaluated using MDL cost function. Computational efficiency. Can be estimated from the running time of the program.

Experiments & Results

Correct and complete segmentation (i.e. all relevant morphemes were identified). • Correct but incomplete segmentation (i.e. not all morphemes were identified). • Incorrect segmentation (i.e. some proposed boundaries didn’t correspond to an actual morphemes).

Conclusions Recursive splitting and MDL cost performed better. (method 1 is the best based on results)

The END!!!

Unsupervised Discovery of Morphemes

Unsupervised Discovery of Morphemes

Presentation Transcript

Kinds of Morphemes

Morphemes

Morphemes Are Marvelous

Morphemes

Morphemes

Unsupervised Commonality Discovery in Images

MORPHEMES

Mouth Morphemes

Unsupervised object discovery via self- organisation

Vocabulary through Morphemes

Greek Morphemes

Warm Ups: Morphemes

Unsupervised Segmentation of Words into Morphemes Morpho Challenge Workshop 2006

Types of morphemes

Down with Morphemes!

Unsupervised Constraint Driven Learning for Transliteration Discovery

Identifying Morphemes

Derivational morphemes

Results 3.1 Unsupervised sound discovery V: Vowels

Unsupervised and weakly-supervised discovery of events in video (and audio)

Translating words: morphemes

Joint Unsupervised Structure Discovery and Information Extraction