290 likes | 303 Views
Language Modeling (LM) is the art of determining the probability of word sequences. This article explores n-gram language modeling, data sparseness smoothing techniques, and different smoothing models such as Katz and Kneser-Ney smoothing.
E N D
語言模型簡介 Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
What is Language Modeling? • Language Modeling (LM) is the art of determining the probability of word sequences • Given a word sequence W, , the probability can be decomposed into a product of conditional probability:
n-gram Language Modeling • The parameters of is very large • |V|i, V denotes the vocabulary • n-gram assumption • the probability of word wi only depends on previous n-1 words • Trigram
n-gram Language Modeling • Maximum likelihood estimate • C(wi-2wi-1wi) represent the number of occurrences of wi-2wi-1wi in the training corpus, and similarly for C(wi-2wi-1) • There are many three word sequences that never occur in the training corpus, consider the sequence “party on Tuesday”, what is P(Tuesday | party on)? • Data sparseness
Smoothing • The training corpus might not contain any instances of the phrase, so C(party on Tuesday) would be 0, while there might still be 20 instances of the phrase “party on” P(Tuesday | party on) = 0 • Smoothing techniques take some probability away from some occurrences • Imagine we have “party on Stan Chen’s birthday” in the training data and occurs only one time
Smoothing • By taking some probability away from some words, such as “Stan” and redistributing it to other words, such as “Tuesday”, zero probabilities can be avoided • Katz smoothingJelinek-Mercer smoothing (deleted interpolation)Kneser-Ney smoothing
Smoothing: simple models • Add-one smoothing • For example, pretend each trigram occurs once more than it actually does • Add delta smoothing
Simply Interpolation where 0≦, ≦1 • In practice, the uniform distribution are also interpolatedthis ensures that no word is assigned probability 0
Katz smoothing • Katz smoothing is based on the Good-Turing formula • Let nr represent the number of n-grams that occur r times,the discounted count: • The probability estimate for a n-gram with r counts:N is the size of the training data • The size of the training data remains the same
(r+1)nr+1=0 Katz smoothing • Let N represent the total size of the training set, the left-over probability will be equal to n1/N Sum=n1
Katz backoff smoothing • Consider a bigram model of a phrase such as Pkatz(Francisco | on). Since the phrase San Francisco is fairly common, the unigram probability will also be fairly high. • This means that using Katz smoothing, the probabilitywill also be fairly high. But, the word Francisco occurs in exceedingly few contexts, and its probability of occurring in a new one is very low
Kneser-Ney smoothing • KN smoothing uses a modified backoff distribution based on the number of contexts each word occurs in, rather than the number of occurrences of the word. Thus, the probability PKN(Francisco | on) would be fairly low, while for a word like Tuesday that occurs in many contexts, PKN(Tuesday | on) would be relatively high, even if the phrase on Tuesday did not occur in the training data
Kneser-Ney smoothing • Backoff Kneser-Ney smoothing where |{v|C(vwi)>0}| is the number of words v that wi can occur in the context, D is the discount, is a normalization constant such that the probabilities sum to 1
Kneser-Ney smoothing V={a,b,c,d} b b c a a c d a a b b b b c c a a b b c c c c d d a d c
Kneser-Ney smoothing • Interpolated models always combine both the higher-order and the lower-order distribution • Interpolated Kneser-Ney smoothingwhere (wi-1) is a normalization constant such that the probabilities sum to 1
Kneser-Ney smoothing • Multiple discounts, one for one counts, another for tow counts, and another for three or more counts. But it have too many parameters • Modified Kneser-Ney smoothing
Jelinek-mercer smoothing • Combines different N-gram orders by linearly interpolating all three models whenever computing trigram
absolute discounting • Absolute discounting subtracting a fixed discount D<=1 from each nonzero count
Witten-Bell Discounting • Key Concept—Things Seen Once: Use the count of things you’ve seen once to help estimate the count of things you’ve never seen • So we estimate the total probability mass of all the zero N-grams with the number of types divided by the number of tokens plus observed types: N : the number of tokensT : observed types
Witten-Bell Discounting • T/(N+T) gives the total “probability of unseen N-grams”, we need to divide this up among all the zero N-grams • We could just choose to divide it equally Z is the total number of N-grams with count zero
Witten-Bell Discounting Alternatively, we can represent the smoothed counts directly as:
Witten-Bell Discounting • For bigramT: the number of bigram types, N: the number of bigram token
Evaluation • A LM that assigned equal probability to 100 words would have perplexity 100
Evaluation • In general, the perplexity of a LM is equal to the geometric average of the inverse probability of the words measured on test data:
Evaluation • “true” model for any data source will have the lowest possible perplexity • The lower the perplexity of our model, the closer it is, in some sense, to the true model • Entropy, which is simply log2 of perplexity • Entropy is the average number of bits per word that would be necessary to encode the test data using an optimal coder
Evaluation • entropy : 54perplexity : 3216 50% • entropy : 54.5perplexity : 32 29.3%