1 / 10

Linguistica

Linguistica. INPUT: text file as input, typically 5,000 to 1,000,000 words OUTPUT: partial morphological analysis of most of the words in the corpus Unsupervised No dictionary No morphological rules MDL Framework (Rissanen 1989). The Problem.

lori
Download Presentation

Linguistica

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning Morphology

  2. Linguistica • INPUT: text file as input, typically 5,000 to 1,000,000 words • OUTPUT: partial morphological analysis of most of the words in the corpus • Unsupervised • No dictionary • No morphological rules • MDL Framework (Rissanen 1989) Learning Morphology

  3. The Problem • Determination of the correct mophological split for individual words into stem and suffixes. • Establishment of accurate categories of stems based on the range of suffixes they accept. Learning Morphology

  4. Four Approaches • Identify morpheme boundaries (and hence morphemes) on the basis of degree of predictibility of n+1st letter given the first n letters. (Z.Harris, 1955, 1967) • Identify bigrams and trigrams that have a high probability of being morpheme-internal • Discovery of patterns of phonological relationships between pairs of related words • Seek analysis that is globally most concise (Goldsmith 2001) Learning Morphology

  5. Minimum Description Length Model: 4 Components • A model of a set of data that assigns a probability distribution to the sample space fron which the data is drawn. • The model can be used to assign a compressed length to the data using information-theoretic notions. • The model can itself be assigned a length. • The optimal analysis of the data is the one for which the sum of the length of the compressed data and the length of the model is the smallest. • In other words, we seek a minimally compact representation of both the model and the data simultaneously. Learning Morphology

  6. An Example Model • List of stems • The set of unanalysed words plus the material that precedes the final suffix of any unanalysed word • List of suffixes that occur with at least one stem • List of signatures • Each stem is associated with a list of observed suffixes. This is the stem’s signature. This list is created using pointers Learning Morphology

  7. STEMS:9 cat dog hat John jump laugh sav the walk AFFIXES:6 NULL ed ing s e es MDL Example Learning Morphology

  8. MDL Example: Signatures S1: ptr(cat) ptr(NULL) ptr(dog) ptr(s) ptr(hat) S2: ptr(sav) ptr(e) ptr(es) ptr(ing) S3: ptr(jump) ptr(NULL) ptr(laugh) ptr(ed) ptr(walk) ptr(ing) ptr(s) S4: ptr(John) ptr(the) Learning Morphology

  9. Notation t a stem f a suffix s signature T set of stems in corpus F set of suffixes in corpus S set of signatures in corpus <T>, <F>, <S> cardinalities of T,F,S [t],[f] frequency of t, f in corpus W set of words in the corpus [W] length of the corpus <W> vocabulary size Learning Morphology

  10. A signature comprises two lists: • List of pointers to stems • List of pointers to suffixes To specify a list of length N need L(N) bits where L(N) ~= log2(N) A pointer to a stem t is of length –log(P(t)) where P(t) = [t]/[W] Learning Morphology

More Related