1 / 29

Language Modeling: Probabilistic Word Sequences

Language Modeling (LM) is the art of determining the probability of word sequences. This article explores n-gram language modeling, data sparseness smoothing techniques, and different smoothing models such as Katz and Kneser-Ney smoothing.

Download Presentation

Language Modeling: Probabilistic Word Sequences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 語言模型簡介 Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13

  2. What is Language Modeling? • Language Modeling (LM) is the art of determining the probability of word sequences • Given a word sequence W, , the probability can be decomposed into a product of conditional probability:

  3. n-gram Language Modeling • The parameters of is very large • |V|i, V denotes the vocabulary • n-gram assumption • the probability of word wi only depends on previous n-1 words • Trigram

  4. n-gram Language Modeling • Maximum likelihood estimate • C(wi-2wi-1wi) represent the number of occurrences of wi-2wi-1wi in the training corpus, and similarly for C(wi-2wi-1) • There are many three word sequences that never occur in the training corpus, consider the sequence “party on Tuesday”, what is P(Tuesday | party on)? • Data sparseness

  5. Smoothing • The training corpus might not contain any instances of the phrase, so C(party on Tuesday) would be 0, while there might still be 20 instances of the phrase “party on” P(Tuesday | party on) = 0 • Smoothing techniques take some probability away from some occurrences • Imagine we have “party on Stan Chen’s birthday” in the training data and occurs only one time

  6. Smoothing • By taking some probability away from some words, such as “Stan” and redistributing it to other words, such as “Tuesday”, zero probabilities can be avoided • Katz smoothingJelinek-Mercer smoothing (deleted interpolation)Kneser-Ney smoothing

  7. Smoothing: simple models • Add-one smoothing • For example, pretend each trigram occurs once more than it actually does • Add delta smoothing

  8. Simply Interpolation where 0≦, ≦1 • In practice, the uniform distribution are also interpolatedthis ensures that no word is assigned probability 0

  9. Katz smoothing • Katz smoothing is based on the Good-Turing formula • Let nr represent the number of n-grams that occur r times,the discounted count: • The probability estimate for a n-gram with r counts:N is the size of the training data • The size of the training data remains the same

  10. (r+1)nr+1=0 Katz smoothing • Let N represent the total size of the training set, the left-over probability will be equal to n1/N Sum=n1

  11. Katz backoff smoothing

  12. Katz backoff smoothing • Consider a bigram model of a phrase such as Pkatz(Francisco | on). Since the phrase San Francisco is fairly common, the unigram probability will also be fairly high. • This means that using Katz smoothing, the probabilitywill also be fairly high. But, the word Francisco occurs in exceedingly few contexts, and its probability of occurring in a new one is very low

  13. Kneser-Ney smoothing • KN smoothing uses a modified backoff distribution based on the number of contexts each word occurs in, rather than the number of occurrences of the word. Thus, the probability PKN(Francisco | on) would be fairly low, while for a word like Tuesday that occurs in many contexts, PKN(Tuesday | on) would be relatively high, even if the phrase on Tuesday did not occur in the training data

  14. Kneser-Ney smoothing • Backoff Kneser-Ney smoothing where |{v|C(vwi)>0}| is the number of words v that wi can occur in the context, D is the discount,  is a normalization constant such that the probabilities sum to 1

  15. Kneser-Ney smoothing V={a,b,c,d} b b c a a c d a a b b b b c c a a b b c c c c d d a d c

  16. Kneser-Ney smoothing • Interpolated models always combine both the higher-order and the lower-order distribution • Interpolated Kneser-Ney smoothingwhere (wi-1) is a normalization constant such that the probabilities sum to 1

  17. Kneser-Ney smoothing • Multiple discounts, one for one counts, another for tow counts, and another for three or more counts. But it have too many parameters • Modified Kneser-Ney smoothing

  18. Jelinek-mercer smoothing • Combines different N-gram orders by linearly interpolating all three models whenever computing trigram

  19. absolute discounting • Absolute discounting subtracting a fixed discount D<=1 from each nonzero count

  20. Witten-Bell Discounting • Key Concept—Things Seen Once: Use the count of things you’ve seen once to help estimate the count of things you’ve never seen • So we estimate the total probability mass of all the zero N-grams with the number of types divided by the number of tokens plus observed types: N : the number of tokensT : observed types

  21. Witten-Bell Discounting • T/(N+T) gives the total “probability of unseen N-grams”, we need to divide this up among all the zero N-grams • We could just choose to divide it equally Z is the total number of N-grams with count zero

  22. Witten-Bell Discounting Alternatively, we can represent the smoothed counts directly as:

  23. Witten-Bell Discounting

  24. Witten-Bell Discounting • For bigramT: the number of bigram types, N: the number of bigram token

  25. Evaluation • A LM that assigned equal probability to 100 words would have perplexity 100

  26. Evaluation • In general, the perplexity of a LM is equal to the geometric average of the inverse probability of the words measured on test data:

  27. Evaluation • “true” model for any data source will have the lowest possible perplexity • The lower the perplexity of our model, the closer it is, in some sense, to the true model • Entropy, which is simply log2 of perplexity • Entropy is the average number of bits per word that would be necessary to encode the test data using an optimal coder

  28. Evaluation • entropy : 54perplexity : 3216 50% • entropy : 54.5perplexity : 32 29.3%

More Related