1 / 75

Morphology & FSTs

Morphology & FSTs. Shallow Processing Techniques for NLP Ling570 October 17, 2011. Roadmap. Two-level morphology summary Unsupervised morphology. Combining FST Lexicon & Rules. Two-level morphological system: ‘Cascade’ Transducer from Lexicon to Intermediate

albina
Download Presentation

Morphology & FSTs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Morphology & FSTs Shallow Processing Techniques for NLPLing570 October 17, 2011

  2. Roadmap • Two-level morphology summary • Unsupervised morphology

  3. Combining FST Lexicon & Rules • Two-level morphological system: ‘Cascade’ • Transducer from Lexicon to Intermediate • Rule transducers from Intermediate to Surface

  4. Integrating the Lexicon • Replace classes with stems

  5. Using the E-insertion FST (fox,fox): q0, q0,q0,q1, accept (fox#,fox#): q0.q0.q0.q1,q0, accept (fox^s#,foxes#): q0,q0,q0,q1,q2,q3,q4,q0,accept (fox^s,foxs): q0,q0,q0,q1 ,q2,q5,reject (fox^z#,foxz#) ?

  6. Issues • What do you think of creating all the rules for a languages – by hand? • Time-consuming, complicated

  7. Issues • What do you think of creating all the rules for a languages – by hand? • Time-consuming, complicated • Proposed approach: • Unsupervised morphology induction

  8. Issues • What do you think of creating all the rules for a languages – by hand? • Time-consuming, complicated • Proposed approach: • Unsupervised morphology induction • Potentially useful for many applications • IR, MT

  9. Unsupervised Morphology • Start from tokenized text (or word frequencies) • talk 60 • talked 120 • walked 40 • walk 30

  10. Unsupervised Morphology • Start from tokenized text (or word frequencies) • talk 60 • talked 120 • walked 40 • walk 30 • Treat as coding/compression problem • Find most compact representation of lexicon • Popular model MDL (Minimum Description Length) • Smallest total encoding: • Weighted combination of lexicon size & ‘rules’

  11. Approach • Generate initial model: • Base set of words, compute MDL length

  12. Approach • Generate initial model: • Base set of words, compute MDL length • Iterate: • Generate a new set of words + some model to create a smaller description size

  13. Approach • Generate initial model: • Base set of words, compute MDL length • Iterate: • Generate a new set of words + some model to create a smaller description size • E.g. for talk, talked, walk, walked • 4 words

  14. Approach • Generate initial model: • Base set of words, compute MDL length • Iterate: • Generate a new set of words + some model to create a smaller description size • E.g. for talk, talked, walk, walked • 4 words • 2 words (talk, walk) + 1 affix (-ed) + combination info • 2 words (t,w) + 2 affixes (alk,-ed) + combination info

  15. Successful Applications • Inducing word classes (e.g. N,V) by affix patterns • Unsupervised morphological analysis for MT • Word segmentation in CJK • Word text/sound segmentation in English

  16. Unit #1 Summary

  17. Formal Languages • Formal Languages and Grammars • Chomsky hierarchy • Languages and the grammars that accept/generate

  18. Formal Languages • Formal Languages and Grammars • Chomsky hierarchy • Languages and the grammars that accept/generate • Equivalences • Regular languages • Regular grammars • Regular expressions • Finite State Automata

  19. Finite-State Automata & Transducers • Finite-State Automata: • Deterministic & non-deterministic automata • Equivalence and conversion • Probabilistic & weighted FSAs

  20. Finite-State Automata & Transducers • Finite-State Automata: • Deterministic & non-deterministic automata • Equivalence and conversion • Probabilistic & weighted FSAs • Packages and operations: Carmel

  21. Finite-State Automata & Transducers • Finite-State Automata: • Deterministic & non-deterministic automata • Equivalence and conversion • Probabilistic & weighted FSAs • Packages and operations: Carmel • FSTs & regular relations • Closures and equivalences • Composition, inversion

  22. FSA/FST Applications • Range of applications: • Parsing • Translation • Tokenization…

  23. FSA/FST Applications • Range of applications: • Parsing • Translation • Tokenization… • Morphology: • Lexicon: cat: N, +Sg; -s: Pl • Morphotactics: N+PL • Orthographic rules: fox + s  foxes • Parsing & Generation

  24. Implementation • Tokenizers • FSA acceptors • FST acceptors/translators • Orthographic rule as FST

  25. Language Modeling

  26. Roadmap • Motivation: • LM applications • N-grams • Training and Testing • Evaluation: • Perplexity

  27. Predicting Words • Given a sequence of words, the next word is (somewhat) predictable: • I’d like to place a collect …..

  28. Predicting Words • Given a sequence of words, the next word is (somewhat) predictable: • I’d like to place a collect ….. • Ngram models: Predict next word given previous N • Language models (LMs): • Statistical models of word sequences

  29. Predicting Words • Given a sequence of words, the next word is (somewhat) predictable: • I’d like to place a collect ….. • Ngram models: Predict next word given previous N • Language models (LMs): • Statistical models of word sequences • Approach: • Build model of word sequences from corpus • Given alternative sequences, select the most probable

  30. N-gram LM Applications • Used in • Speech recognition • Spelling correction • Augmentative communication • Part-of-speech tagging • Machine translation • Information retrieval

  31. Terminology • Corpus (pl. corpora): • Online collection of text of speech • E.g. Brown corpus: 1M word, balanced text collection • E.g. Switchboard: 240 hrs of speech; ~3M words

  32. Terminology • Corpus (pl. corpora): • Online collection of text of speech • E.g. Brown corpus: 1M word, balanced text collection • E.g. Switchboard: 240 hrs of speech; ~3M words • Wordform: • Full inflected or derived form of word: cats, glottalized

  33. Terminology • Corpus (pl. corpora): • Online collection of text of speech • E.g. Brown corpus: 1M word, balanced text collection • E.g. Switchboard: 240 hrs of speech; ~3M words • Wordform: • Full inflected or derived form of word: cats, glottalized • Word types: # of distinct words in corpus

  34. Terminology • Corpus (pl. corpora): • Online collection of text of speech • E.g. Brown corpus: 1M word, balanced text collection • E.g. Switchboard: 240 hrs of speech; ~3M words • Wordform: • Full inflected or derived form of word: cats, glottalized • Word types: # of distinct words in corpus • Word tokens: total # of words in corpus

  35. Corpus Counts • Estimate probabilities by counts in large collections of text/speech • Should we count: • Wordformvslemma ? • Case? Punctuation? Disfluency? • Type vs Token ?

  36. Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars.

  37. Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars. • Word types (excluding punct):

  38. Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars. • Word types (excluding punct): 14 • Word tokens (“ ):

  39. Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars. • Word types (excluding punct): 14 • Word tokens (“ ): 16. • I do uh main- mainly business data processing • Utterance (spoken “sentence” equivalent)

  40. Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars. • Word types (excluding punct): 14 • Word tokens (“ ): 16. • I do uh main- mainly business data processing • Utterance (spoken “sentence” equivalent) • What about: • Disfluencies • main-: fragment • uh: filler (aka filled pause)

  41. Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars. • Word types (excluding punct): 14 • Word tokens (“ ): 16. • I do uh main- mainly business data processing • Utterance (spoken “sentence” equivalent) • What about: • Disfluencies • main-: fragment • uh: filler (aka filled pause) • Keep, depending on app.: can help prediction; uh vs um

  42. LM Task • Training: • Given a corpus of text, learn probabilities of word sequences

  43. LM Task • Training: • Given a corpus of text, learn probabilities of word sequences • Testing: • Given trained LM and new text, determine sequence probabilities, or • Select most probable sequence among alternatives

  44. LM Task • Training: • Given a corpus of text, learn probabilities of word sequences • Testing: • Given trained LM and new text, determine sequence probabilities, or • Select most probable sequence among alternatives • LM types: • Basic, Class-based, Structured

  45. Word Prediction • Goal: • Given some history, what is probability of some next word? • Formally, P(w|h) • e.g. P(call|I’d like to place a collect)

  46. Word Prediction • Goal: • Given some history, what is probability of some next word? • Formally, P(w|h) • e.g. P(call|I’d like to place a collect) • How can we compute?

  47. Word Prediction • Goal: • Given some history, what is probability of some next word? • Formally, P(w|h) • e.g. P(call|I’d like to place a collect) • How can we compute? • Relative frequency in a corpus • C(I’d like to place a collect call)/C(I’d like to place a collect) • Issues?

  48. Word Prediction • Goal: • Given some history, what is probability of some next word? • Formally, P(w|h) • e.g. P(call|I’d like to place a collect) • How can we compute? • Relative frequency in a corpus • C(I’d like to place a collect call)/C(I’d like to place a collect) • Issues? • Zero counts: language is productive! • Joint word sequence probability of length N: • Count of all sequences of length N & count of that sequence

  49. Word Sequence Probability • Notation: • P(Xi=the) written as P(the) • P(w1w2w3…wn) =

  50. Word Sequence Probability • Notation: • P(Xi=the) written as P(the) • P(w1w2w3…wn) = • Compute probability of word sequence by chain rule • Links to word prediction by history

More Related