1 / 111

Language Modeling

Language Modeling. Roadmap. Motivation: LM applications N-grams Training and Testing Evaluation: Perplexity Entropy Smoothing ( next class): Laplace smoothing Good-Turing smoothing Interpolation & backoff. Predicting Words.

belle
Download Presentation

Language Modeling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Language Modeling

  2. Roadmap • Motivation: • LM applications • N-grams • Training and Testing • Evaluation: • Perplexity • Entropy • Smoothing (next class): • Laplace smoothing • Good-Turing smoothing • Interpolation & backoff

  3. Predicting Words • Given a sequence of words, the next word is (somewhat) predictable: • I’d like to place a collect …..

  4. Predicting Words • Given a sequence of words, the next word is (somewhat) predictable: • I’d like to place a collect ….. • Ngram models: Predict next word given previous N • Language models (LMs): • Statistical models of word sequences

  5. Predicting Words • Given a sequence of words, the next word is (somewhat) predictable: • I’d like to place a collect ….. • Ngram models: Predict next word given previous N • Language models (LMs): • Statistical models of word sequences • Approach: • Build model of word sequences from corpus • Given alternative sequences, select the most probable

  6. Predicting Sequences • Given an n-gram model, we can also answer questions about prob. of a sequence • Comparative probabilities of sequences, e.g., in MT In: GesternhabeichmeineMutter angerufen. Out: Yesterday have I my mother called. Yesterday I have my mom called. Yesterday I have called my mom. Yesterday I has called my mom. Yesterday I call my mom. I called my mother yesterday.

  7. N-gram LM Applications • Used in • Speech recognition • Spelling correction • Part-of-speech tagging • Machine translation • Information retrieval • Language Identification

  8. Terminology • Corpus (pl. corpora): • Online collection of text of speech • E.g. Brown corpus: 1M word, balanced text collection • E.g. Switchboard: 240 hrs of speech; ~3M words

  9. Terminology • Corpus (pl. corpora): • Online collection of text of speech • E.g. Brown corpus: 1M word, balanced text collection • E.g. Switchboard: 240 hrs of speech; ~3M words • Wordform: • Full inflected or derived form of word: cats, glottalized

  10. Terminology • Corpus (pl. corpora): • Online collection of text of speech • E.g. Brown corpus: 1M word, balanced text collection • E.g. Switchboard: 240 hrs of speech; ~3M words • Wordform: • Full inflected or derived form of word: cats, glottalized • Word types: # of distinct words in corpus

  11. Terminology • Corpus (pl. corpora): • Online collection of text of speech • E.g. Brown corpus: 1M word, balanced text collection • E.g. Switchboard: 240 hrs of speech; ~3M words • Wordform: • Full inflected or derived form of word: cats, glottalized • Word types: # of distinct words in corpus • Word tokens: total # of words in corpus

  12. Corpus Counts • Estimate probabilities by counts in large collections of text/speech • Should we count: • Wordformvslemma ? • Case? Punctuation? Disfluency? • Type vs Token ?

  13. Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars.

  14. Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars. • Word types (excluding punct):

  15. Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars. • Word types (excluding punct): 14 • Word tokens (“ ):

  16. Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars. • Word types (excluding punct): 14 • Word tokens (“ ): 16. • I do uh main- mainly business data processing • Utterance (spoken “sentence” equivalent)

  17. Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars. • Word types (excluding punct): 14 • Word tokens (“ ): 16. • I do uh main- mainly business data processing • Utterance (spoken “sentence” equivalent) • What about: • Disfluencies • main-: fragment • uh: filler (aka filled pause)

  18. Words, Counts and Prediction • They picnicked by the pool, then lay back on the grass and looked at the stars. • Word types (excluding punct): 14 • Word tokens (“ ): 16. • I do uh main- mainly business data processing • Utterance (spoken “sentence” equivalent) • What about: • Disfluencies • main-: fragment • uh: filler (aka filled pause) • Keep, depending on app.: can help prediction; uh vs um

  19. LM Task • Training: • Given a corpus of text, learn probabilities of word sequences

  20. LM Task • Training: • Given a corpus of text, learn probabilities of word sequences • Testing: • Given trained LM and new text, determine sequence probabilities, or • Select most probable sequence among alternatives

  21. Word Prediction • Goal: • Given some history, what is probability of some next word? • Formally, P(w|h) • e.g. P(call|I’d like to place a collect)

  22. Word Prediction • Goal: • Given some history, what is probability of some next word? • Formally, P(w|h) • e.g. P(call|I’d like to place a collect) • How can we compute?

  23. Word Prediction • Goal: • Given some history, what is probability of some next word? • Formally, P(w|h) • e.g. P(call|I’d like to place a collect) • How can we compute? • Relative frequency in a corpus • C(I’d like to place a collect call)/C(I’d like to place a collect) • Issues?

  24. Word Prediction • Goal: • Given some history, what is probability of some next word? • Formally, P(w|h) • e.g. P(call|I’d like to place a collect) • How can we compute? • Relative frequency in a corpus • C(I’d like to place a collect call)/C(I’d like to place a collect) • Issues? • Zero counts: language is productive! • Joint word sequence probability of length N: • Count of all sequences of length N & count of that sequence

  25. Word Sequence Probability • Notation: • P(Xi=the) written as P(the) • P(w1w2w3…wn) =

  26. Word Sequence Probability • Notation: • P(Xi=the) written as P(the) • P(w1w2w3…wn) = • Compute probability of word sequence by chain rule • Links to word prediction by history

  27. Word Sequence Probability • Notation: • P(Xi=the) written as P(the) • P(w1w2w3…wn) = • Compute probability of word sequence by chain rule • Links to word prediction by history • Issues?

  28. Word Sequence Probability • Notation: • P(Xi=the) written as P(the) • P(w1w2w3…wn) = • Compute probability of word sequence by chain rule • Links to word prediction by history • Issues? • Potentially infinite history

  29. Word Sequence Probability • Notation: • P(Xi=the) written as P(the) • P(w1w2w3…wn) = • Compute probability of word sequence by chain rule • Links to word prediction by history • Issues? • Language infinitely productive: we don’t know how to compute the exact probability

  30. Markov Assumptions • Exact computation requires too much data • And we may not have all the data (even on the Web!)

  31. Markov Assumptions • Exact computation requires too much data • And we may not have all the data (even on the Web!) • Approximate probability given all prior words • Assume finitehistory

  32. Markov Assumptions • Exact computation requires too much data • And we may not have all the data (even on the Web!) • Approximate probability given all prior words • Assume finitehistory • Unigram: Probability of word in isolation (0th order) • Bigram: Probability of word given 1 previous • First-order Markov • Trigram: Probability of word given 2 previous

  33. Markov Assumptions • Exact computation requires too much data • And we may not have all the data (even on the Web!) • Approximate probability given all prior words • Assume finitehistory • Unigram: Probability of word in isolation (0th order) • Bigram: Probability of word given 1 previous • First-order Markov • Trigram: Probability of word given 2 previous • N-gram approximation Bigram sequence

  34. Unigram Models • P(w1w2…w3)~

  35. Unigram Models • P(w1w2…w3) ~ P(w1)*P(w2)*…*P(wn) • Training: • Estimate P(w) given corpus

  36. Unigram Models • P(w1w2…w3) ~ P(w1)*P(w2)*…*P(wn) • Training: • Estimate P(w) given corpus • Relative frequency:

  37. Unigram Models • P(w1w2…w3) ~ P(w1)*P(w2)*…*P(wn) • Training: • Estimate P(w) given corpus • Relative frequency: P(w) = C(w)/N, N=# tokens in corpus • How many parameters?

  38. Unigram Models • P(w1w2…w3) ~ P(w1)*P(w2)*…*P(wn) • Training: • Estimate P(w) given corpus • Relative frequency: P(w) = C(w)/N, N=# tokens in corpus • How many parameters? • Testing: For sentence s, compute P(s)

  39. Bigram Models • P(w1w2…w3) = P(BOS w1w2….wnEOS)

  40. Bigram Models • P(w1w2…w3) = P(BOS w1w2….wnEOS) • ~ P(BOS)*P(w1|BOS)*P(w2|w1)*…*P(wn|wn-1)*P(EOS|wn) • Training: • Relative frequency:

  41. Bigram Models • P(w1w2…w3) = P(BOS w1w2….wnEOS) • ~ P(BOS)*P(w1|BOS)*P(w2|w1)*…*P(wn|wn-1)*P(EOS|wn) • Training: • Relative frequency: P(wi|wi-1) = C(wi-1wi)/C(wi-1) • How many parameters?

  42. Bigram Models • P(w1w2…w3) = P(BOS w1w2….wnEOS) • ~ P(BOS)*P(w1|BOS)*P(w2|w1)*…*P(wn|wn-1)*P(EOS|wn) • Training: • Relative frequency: P(wi|wi-1) = C(wi-1wi)/C(wi-1) • How many parameters? • Testing: For sentence s, compute P(s) • Model with PFA: • Input symbols? Probabilities on arcs? States?

  43. Trigram Models • P(w1w2…w3) = P(BOS w1w2….wnEOS)

  44. Trigram Models • P(w1w2…w3) = P(BOS w1w2….wnEOS) • ~ P(BOS)*P(w1|BOS)*P(w2|BOS,w1)*…*P(wn|wn-2,wn-1)*P(EOS|wn-1,wn) • Training: • P(wi|wi-2,wi-1)

  45. Trigram Models • P(w1w2…w3) = P(BOS w1w2….wnEOS) • ~ P(BOS)*P(w1|BOS)*P(w2|BOS,w1)*…*P(wn|wn-2,wn-1)*P(EOS|wn-1,wn) • Training: • P(wi|wi-2,wi-1) = C(wi-2 wi-1wi)/C(wi-2wi-1) • How many parameters?

  46. Trigram Models • P(w1w2…w3) = P(BOS w1w2….wnEOS) • ~ P(BOS)*P(w1|BOS)*P(w2|BOS,w1)*…*P(wn|wn-2,wn-1)*P(EOS|wn-1,wn) • Training: • P(wi|wi-2,wi-1) = C(wi-2 wi-1wi)/C(wi-2wi-1) • How many parameters? • How many states?

  47. Recap • Ngrams: • # FSA states: |V|n-1 • # Model parameters:

  48. Recap • Ngrams: • # FSA states: |V|n-1 • # Model parameters: |V|n • Issues:

  49. Recap • Ngrams: • # FSA states: |V|n-1 • # Model parameters: |V|n • Issues: • Data sparseness, Out-of-vocabulary elements (OOV) •  Smoothing • Mismatches between training & test data • Other Language Models

  50. Maximum Likelihood Estimation (MLE) • MLE estimate – normalize counts from corpus between 0 and 1 • C(xy), normalize all counts for bigrams sharing x: • Since C(wn-1w) = C(wn-1), then

More Related