learning linguistic structure n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Learning linguistic structure PowerPoint Presentation
Download Presentation
Learning linguistic structure

Loading in 2 Seconds...

play fullscreen
1 / 120

Learning linguistic structure - PowerPoint PPT Presentation


  • 99 Views
  • Updated on

Learning linguistic structure. John Goldsmith Computer Science Department University of Chicago February 7, 2003. A large part of the field of computational linguistics has moved during the 1990s from developing grammars, speech recognition engines, etc., that simply work , to

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Learning linguistic structure


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
    Presentation Transcript
    1. Learning linguistic structure John Goldsmith Computer Science Department University of Chicago February 7, 2003

    2. A large part of the field of computational linguistics has moved during the 1990s from • developing grammars, speech recognition engines, etc., that simply work, to • developing systems that learn language-specific parameters from large amounts of data.

    3. Credo… • The application of statistically-driven methods of data analysis, when applied to natural language data, will produce results which shed light on linguistic structure.

    4. Unsupervised learning Input: large texts in a natural language, with no prior knowledge of the language.

    5. A bit more about the goal • What’s the input? • “Data” – which comes to the learner, in acoustic form, unsegmented: • Sentences not broken up into words • Words not broken up into their components (morphemes). • Words not assigned to lexical categories (noun, verb, article, etc.) With a meaning representation?

    6. Idealization of the language-learning scheme • Segment the soundstream into words; the words form the lexicon of the language. • Discover internal structure of words; this is the morphology of the language. • Infer a set of lexical categories for words; each word is assigned to (at least) one lexical category. • Infer a set of phrase-structure rules for the language.

    7. Idealization? • While these tasks are individually coherent, we make no assumption that any one must be completed before another can be begun.

    8. Today’s task • To develop an algorithm capable of learning the morphology of a language, given knowledge of the words of the language, and of a large sample of utterances.

    9. Goals Given a corpus, learn: The set of word-roots, prefixes, and suffixes, and principles of combinations; Principles of automatic alternations (e.g., e drops before the suffixes –ing,–ity and –ed, but not before –s) Some suffixes have one grammatical function (-ness) while others have more (e.g., -s: song-s versus sing-s).

    10. Why? Practical applications: • Automatic stemming for multilingual information retrieval • A corpus broken into morphemes is far superior to a corpus broken into words for statistically-driven machine translation • Develop morphologies for speech recognition automatically

    11. Theoretically There is a strong bias currently in linguistics to underestimate the difficulty of language learning – For example, to identify language learning with the selection of a phrase-structure grammar, or with the independent setting of a small number of parameters.

    12. Morphology • The learning of morphology is a very difficult task, in the sense that every word W of length |W| can potentially be divided into 1, 2, …, L morphemes mi, constrained only by S|mi| = |W| – and that’s ignoring labeling (which is the stem, which the affix). • The number of potential morphologies for a given corpus is enormous.

    13. So the task is a reality check for discussions of language learning

    14. Ideally We would like to pose the problem of grammar-selection as an optimization problem, and cut our task into two parts: • Specification of the objective function to be optimized, and • Development of practical search techniques to find optima in reasonable time.

    15. Current status • Linguistica: a C++ Windows-based program available for download at http://humanities.uchicago.edu/faculty/goldsmith/Linguistica2000 • Technical discussion in Computational Linguistics (June 2001) • Good results with 5,000 words, very fine-grained results with 500,000 words (corpus length, not lexicon count), especially in European languages.

    16. Today’s talk • Specify the task in explicit terms • Minimum Description Length analysis: what it is, and why it is reasonable for this task; how it provides our optimization criteria. • Search heuristics: (1) bootstrap heuristic, and (2) incremental heuristics. • Morphology assigns a probability distribution over its words. • Computing the length of the morphology.

    17. Today’s talk (continued) 6. Results 7. Some work in progress: learning syntax to learn about morphology

    18. Given a text (but no prior knowledge of its language), we want: • List of stems, suffixes, and prefixes • List of signatures. • A signature: a list of all suffixes (prefixes) appearing in a given corpus with a given stem. • Hence, a stem in a corpus has a unique signature. • A signature has a unique set of stems associated with it

    19. Example of signature in English • NULL.ed.ing.s ask call point summarizes: ask asked asking asks call called calling calls point pointed pointing points

    20. We would like to characterize the discovery of a signature as an optimization problem • Reasonable tack: formulate the problem in terms of Minimum Description Length (Rissanen, 1989)

    21. Today’s talk • Specify the task in explicit terms • Minimum Description Length analysis: what it is, and why it is reasonable for this task; how it provides our optimization criteria. • Search heuristics: (1) bootstrap heuristic, and (2) incremental heuristics. • Morphology assigns a probability distribution over its words. • Computing the length of the morphology.

    22. Minimum Description Length (MDL) • Jorma Rissanen: Stochastic Complexity in Statistical Inquiry (1989) • Work by Michael Brent and Carl de Marcken on word-discovery using MDL in the mid-1990s.

    23. Essence of MDL If we are given • a corpus, and • a probabilistic morphology, which technically means that we are given a distribution over certain strings of stems and affixes. Then we can compute an over-all measure (“description length”) which we can seek to minimize over the space of all possible analyses.

    24. Description length of a corpus C, given a morphology M The length, in bits, of the shortest formulation of the morphology expressible on a given Turing machine + Optimal compressed length of the corpus, using that morphology .

    25. Probabilistic morphology • To serve this function, the morphology must assign a distribution over the set of words it generates, so that the optimal compressed length of an actual, occurring corpus (the one we’re learning from) is -1 * log probability it assigns.

    26. Essence of MDL… • The goodness of the morphology is also measured by how compact the morphology is. • We can measure the compactness of a morphology in information theoretic bits.

    27. How can we measure the compactness of a morphology? • Let’s consider a naïve version of description length: count the number of letters. • This naïve version is nonetheless helpful in seeing the intuition involved.

    28. Naive Minimum Description Length Corpus: jump, jumps, jumping laugh, laughed, laughing sing, sang, singing the, dog, dogs total: 62 letters Analysis: Stems: jump laugh sing sang dog (20 letters) Suffixes: s ing ed (6 letters) Unanalyzed: the (3 letters) total: 29 letters. Notice that the description length goes UP if we analyze sing into s+ing

    29. Essence of MDL… The best overall theory of a corpus is the one for which the sum of • -1 * log prob (corpus) + • length of the morphology (that’s the description length) is the smallest.

    30. Essence of MDL…

    31. Overall logic • Search through morphology space for the morphology which provides the smallest description length.

    32. Brief foreshadowing of our calculation of the length of the morphology • A morphology is composed of three lists: a list of stems, a list of suffixes (say), and a list of ways in which the two can be combined (“signatures”). Information content of a list =

    33. Stem list

    34. Today’s talk • Specify the task in explicit terms • Minimum Description Length analysis: what it is, and why it is reasonable for this task; how it provides our optimization criteria. • Search heuristics: (1) bootstrap heuristic, and (2) incremental heuristics. • Morphology assigns a probability distribution over its words. • Computing the length of the morphology.

    35. Bootstrap heuristic • Find a method to locate likely places to cut a word. • Allow no more than 1 cut per word (i.e., maximum of 2 morphemes). • Assume this is stem + suffix. • Associate with each stem an alphabetized list of its suffixes; call this its signature. • Accept only those word analyses associated with robust signatures…

    36. …where a robust signature is one with a minimum of 5 stems (and at least two suffixes). Robust signatures are pieces of secure structure.

    37. Heuristic to find likely cuts… Best is a modification of a good idea of Zellig Harris (1955): Current variant: Cut words at certain peaks of successor frequency. Problems: can over-cut; can under-cut; and can put cuts too far to the right (“aborti-” problem). [Not a problem!]

    38. Successor frequency n g o v e r Empirically, only one letter follows “gover”: “n”

    39. Successor frequency e i m g o v e r n o s # Empirically, 6 letters follows “govern”: “n”

    40. Successor frequency g o v e r n m e Empirically, 1 letter follows “governm”: “e” g o v e r 1 n 6 m 1 e peak of successor frequency

    41. Lots of errors… 9 18 11 6 4 1 2 1 1 2 1 1 c o n s e r v a t i v e s wrong right wrong

    42. Even so… We set conditions: Accept cuts with stems at least 5 letters in length; Demand that successor frequency be a clear peak: 1… N … 1 (e.g. govern-ment) Then for each stem, collect all of its suffixes into a signature; and accept only signatures with at least 5 stems to it.

    43. Words->SuccessorFreq1(GetStems_Suffixed(), GetSuffixes(), GetSignatures(), SF1 ); CheckSignatures(); ExtendKnownStemsToKnownSuffixes(); TakeSignaturesFindStems(); ExtendKnownStemsToKnownSuffixes(); FromStemsFindSuffixes(); ExtendKnownStemsToKnownSuffixes(); LooseFit(); CheckSignatures();

    44. 2. Incremental heuristics • Enormous amount of detail being skipped…let’s look at one simple case: • Loose fit: suffixes and signatures to split: Collect any string that precedes a known suffix. • Find all of its apparent suffixes, and use MDL to decide if it’s worth it to do the analysis.

    45. Using MDL to judge a potential stem and potential signature Suppose we find: act, acted, action, acts. We have the suffixes NULL, ed, ion, and s, but not the signature NULL.ed.ion.s Let’s compute cost versus savings of signature NULL.ed.ion.s

    46. savings Savings: Stem savings: 3 copies of the stem act: that’s 3 x 3 = 9 letters = 40.5 bits (taking 4.5 bits/letter). Suffix savings: ed, ing, s: 6 letters, another 27 bits. Total of 67.5 bits--

    47. Cost of NULL.ed.ing.s • A pointer to each suffix: To give a feel for this: Total cost of suffix list: about 30 bits. Cost of pointer to signature: total cost is -- all the stems using it chip in to pay for its cost, though.

    48. Cost of signature: about 43 bits • Savings: about 67 bits • Slight worsening in the compressed length of these 4 words. so MDL says: Do it! Analyze the words as stem + suffix. Notice that the cost of the analysis would have been higher if one or more of the suffixes had not already “existed”.

    49. Today’s talk • Specify the task in explicit terms • Minimum Description Length analysis: what it is, and why it is reasonable for this task; how it provides our optimization criteria. • Search heuristics: (1) bootstrap heuristic, and (2) incremental heuristics. • Morphology assigns a probability distribution over its words. • Computing the length of the morphology.

    50. Frequency of analyzed word W is analyzed as belonging to Signature s,stem T and suffix F. [x] means the count of x’s in the corpus (token count) Where [W] is the total number of words. Actually what we care about is the log of this: