1 / 22

Unsupervised Discovery of Morphemes

linja-auton. autonkuljettajallakaan. linja-. auton. kuljettajallakaan. auto. n. kuljettajalla. kaan. kuljettaja. lla. Unsupervised Discovery of Morphemes. Presented by: Miri Vilkhov & Daniel Feinstein. Aim:.

lumina
Download Presentation

Unsupervised Discovery of Morphemes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. linja-auton autonkuljettajallakaan linja- auton kuljettajallakaan auto n kuljettajalla kaan kuljettaja lla Unsupervised Discovery of Morphemes Presented by: Miri Vilkhov & Daniel Feinstein

  2. Aim: To find optimal segmentation of the input text into morpheme-like units (morphs) by using unsupervised algorithms. • The first method is based on the Minimum Description Length (MDL) principle. • The second method is based on the Maximum Likelihood (ML) principle. Two segmentation techniques:

  3. input text Segmentation method Output data

  4. Definitions • Input text:flat text that contains only an alphabet language letters and spaces. • Word: is a sequence of letters bounded by spaces or start/end of the input text. • Output:vocabulary of morphs (codebook) • Morph Type:definition of a morph in the codebook • Morph Token: instance of a morph type in the input text

  5. Method 1: Recursive Segmentation and MDL Cost

  6. MDL Cost function C = Cost(Input text) + Cost(Codebook) • m1,…mn:sequence of morph tokens that makes up the input text. • l(mi):the length of morph mi • k:number of bits to code a character • p(mi):token count of mi divided by total count of morph tokens.

  7. Search Algorithm For each word in input text do { If word has been observed before then { 1. Remove word from the data structure 2. Remove word’s morphs from the codebook } Segmentation (word) }

  8. 1. Recursive segmentation Segmentation (string = c1,…cn) { 1. Evaluate every possible split of the string into 2 parts. 2. Select the split (or no split) with min(MDL cost). string split index is i. 3. If “no split” (i=0) selected Codebook = Codebook U {string} Else Segmentation (c1,..,ci); Segmentation (ci+1,..,cn); }

  9. Codebook: • affect • s • … affect ion s The morphs are:affect , ion , s Example • The order of splits can be represented as a binary tree. affections affect ions ion s

  10. Problem: Words encountered in the beginning and not observed since may have a “wrong” segmentation, since at some point more suitable morphs have entered the codebook. • Solution: “Dreaming” stage.

  11. “Dreaming” At regular intervals do: • Stop reading words from the input • Go over the words already encountered in random order. • Resegment these words.

  12. Method 2: Sequential Segmentation and ML cost

  13. Method 2: • Pre-processing: list of words and the frequencies of each word in the corpus. • The total cost consists of the input text only Cost(Input text) = Σ –logp(mi) morth tokens • mi:morph tokens that makes up the input text. • p(mi):token count of mi divided by total count of morph tokens.

  14. Search Algorithm – Sequential Segmentation • Initialize: Split words into morths at random intervals. (used Poisson distribution) • Repeat for a number of iterations: • Estimate morph probability • Re-segment the text using the Viterbi Algorithm for finding segmentation with lowest cost. • If not the last iteration: Evaluate the segmentation against Rejection Criteria. If not accepted, segment this word randomly (as in 1)

  15. Rejection criteria Reject the segmentation of a word if it contains one of the following: • Rare morph: morph that was used in only one word type in the previous iteration. • Sequence of one-letter morphs example: carefu + l + l + y Back to Algoritm

  16. Open issues – Method 2 • Why the coast function is defined? • What is the iteration stage? • How do the resegmentation works? • How this method gives us the right morphs? Back to Algoritm

  17. Evaluation Measures Correspondence with linguistic morphemes. Using Goldsmith’s program Linguistica. Efficiency of compression of the data. Can be evaluated using MDL cost function. Computational efficiency. Can be estimated from the running time of the program.

  18. Experiments & Results

  19. Correct and complete segmentation (i.e. all relevant morphemes were identified). • Correct but incomplete segmentation (i.e. not all morphemes were identified). • Incorrect segmentation (i.e. some proposed boundaries didn’t correspond to an actual morphemes).

  20. Conclusions Recursive splitting and MDL cost performed better. (method 1 is the best based on results)

  21. The END!!!

More Related