220 likes | 504 Views
linja-auton. autonkuljettajallakaan. linja-. auton. kuljettajallakaan. auto. n. kuljettajalla. kaan. kuljettaja. lla. Unsupervised Discovery of Morphemes. Presented by: Miri Vilkhov & Daniel Feinstein. Aim:.
E N D
linja-auton autonkuljettajallakaan linja- auton kuljettajallakaan auto n kuljettajalla kaan kuljettaja lla Unsupervised Discovery of Morphemes Presented by: Miri Vilkhov & Daniel Feinstein
Aim: To find optimal segmentation of the input text into morpheme-like units (morphs) by using unsupervised algorithms. • The first method is based on the Minimum Description Length (MDL) principle. • The second method is based on the Maximum Likelihood (ML) principle. Two segmentation techniques:
input text Segmentation method Output data
Definitions • Input text:flat text that contains only an alphabet language letters and spaces. • Word: is a sequence of letters bounded by spaces or start/end of the input text. • Output:vocabulary of morphs (codebook) • Morph Type:definition of a morph in the codebook • Morph Token: instance of a morph type in the input text
Method 1: Recursive Segmentation and MDL Cost
MDL Cost function C = Cost(Input text) + Cost(Codebook) • m1,…mn:sequence of morph tokens that makes up the input text. • l(mi):the length of morph mi • k:number of bits to code a character • p(mi):token count of mi divided by total count of morph tokens.
Search Algorithm For each word in input text do { If word has been observed before then { 1. Remove word from the data structure 2. Remove word’s morphs from the codebook } Segmentation (word) }
1. Recursive segmentation Segmentation (string = c1,…cn) { 1. Evaluate every possible split of the string into 2 parts. 2. Select the split (or no split) with min(MDL cost). string split index is i. 3. If “no split” (i=0) selected Codebook = Codebook U {string} Else Segmentation (c1,..,ci); Segmentation (ci+1,..,cn); }
Codebook: • affect • s • … affect ion s The morphs are:affect , ion , s Example • The order of splits can be represented as a binary tree. affections affect ions ion s
Problem: Words encountered in the beginning and not observed since may have a “wrong” segmentation, since at some point more suitable morphs have entered the codebook. • Solution: “Dreaming” stage.
“Dreaming” At regular intervals do: • Stop reading words from the input • Go over the words already encountered in random order. • Resegment these words.
Method 2: Sequential Segmentation and ML cost
Method 2: • Pre-processing: list of words and the frequencies of each word in the corpus. • The total cost consists of the input text only Cost(Input text) = Σ –logp(mi) morth tokens • mi:morph tokens that makes up the input text. • p(mi):token count of mi divided by total count of morph tokens.
Search Algorithm – Sequential Segmentation • Initialize: Split words into morths at random intervals. (used Poisson distribution) • Repeat for a number of iterations: • Estimate morph probability • Re-segment the text using the Viterbi Algorithm for finding segmentation with lowest cost. • If not the last iteration: Evaluate the segmentation against Rejection Criteria. If not accepted, segment this word randomly (as in 1)
Rejection criteria Reject the segmentation of a word if it contains one of the following: • Rare morph: morph that was used in only one word type in the previous iteration. • Sequence of one-letter morphs example: carefu + l + l + y Back to Algoritm
Open issues – Method 2 • Why the coast function is defined? • What is the iteration stage? • How do the resegmentation works? • How this method gives us the right morphs? Back to Algoritm
Evaluation Measures Correspondence with linguistic morphemes. Using Goldsmith’s program Linguistica. Efficiency of compression of the data. Can be evaluated using MDL cost function. Computational efficiency. Can be estimated from the running time of the program.
Experiments & Results
Correct and complete segmentation (i.e. all relevant morphemes were identified). • Correct but incomplete segmentation (i.e. not all morphemes were identified). • Incorrect segmentation (i.e. some proposed boundaries didn’t correspond to an actual morphemes).
Conclusions Recursive splitting and MDL cost performed better. (method 1 is the best based on results)