Language Modeling: Probabilistic Word Sequences

語言模型簡介 Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13

What is Language Modeling? • Language Modeling (LM) is the art of determining the probability of word sequences • Given a word sequence W, , the probability can be decomposed into a product of conditional probability:

n-gram Language Modeling • The parameters of is very large • |V|i, V denotes the vocabulary • n-gram assumption • the probability of word wi only depends on previous n-1 words • Trigram

n-gram Language Modeling • Maximum likelihood estimate • C(wi-2wi-1wi) represent the number of occurrences of wi-2wi-1wi in the training corpus, and similarly for C(wi-2wi-1) • There are many three word sequences that never occur in the training corpus, consider the sequence “party on Tuesday”, what is P(Tuesday | party on)? • Data sparseness

Smoothing • The training corpus might not contain any instances of the phrase, so C(party on Tuesday) would be 0, while there might still be 20 instances of the phrase “party on” P(Tuesday | party on) = 0 • Smoothing techniques take some probability away from some occurrences • Imagine we have “party on Stan Chen’s birthday” in the training data and occurs only one time

Smoothing • By taking some probability away from some words, such as “Stan” and redistributing it to other words, such as “Tuesday”, zero probabilities can be avoided • Katz smoothingJelinek-Mercer smoothing (deleted interpolation)Kneser-Ney smoothing

Smoothing: simple models • Add-one smoothing • For example, pretend each trigram occurs once more than it actually does • Add delta smoothing

Simply Interpolation where 0≦, ≦1 • In practice, the uniform distribution are also interpolatedthis ensures that no word is assigned probability 0

Katz smoothing • Katz smoothing is based on the Good-Turing formula • Let nr represent the number of n-grams that occur r times,the discounted count: • The probability estimate for a n-gram with r counts:N is the size of the training data • The size of the training data remains the same

(r+1)nr+1=0 Katz smoothing • Let N represent the total size of the training set, the left-over probability will be equal to n1/N Sum=n1

Katz backoff smoothing

Katz backoff smoothing • Consider a bigram model of a phrase such as Pkatz(Francisco | on). Since the phrase San Francisco is fairly common, the unigram probability will also be fairly high. • This means that using Katz smoothing, the probabilitywill also be fairly high. But, the word Francisco occurs in exceedingly few contexts, and its probability of occurring in a new one is very low

Kneser-Ney smoothing • KN smoothing uses a modified backoff distribution based on the number of contexts each word occurs in, rather than the number of occurrences of the word. Thus, the probability PKN(Francisco | on) would be fairly low, while for a word like Tuesday that occurs in many contexts, PKN(Tuesday | on) would be relatively high, even if the phrase on Tuesday did not occur in the training data

Kneser-Ney smoothing • Backoff Kneser-Ney smoothing where |{v|C(vwi)>0}| is the number of words v that wi can occur in the context, D is the discount,  is a normalization constant such that the probabilities sum to 1

Kneser-Ney smoothing V={a,b,c,d} b b c a a c d a a b b b b c c a a b b c c c c d d a d c

Kneser-Ney smoothing • Interpolated models always combine both the higher-order and the lower-order distribution • Interpolated Kneser-Ney smoothingwhere (wi-1) is a normalization constant such that the probabilities sum to 1

Kneser-Ney smoothing • Multiple discounts, one for one counts, another for tow counts, and another for three or more counts. But it have too many parameters • Modified Kneser-Ney smoothing

Jelinek-mercer smoothing • Combines different N-gram orders by linearly interpolating all three models whenever computing trigram

absolute discounting • Absolute discounting subtracting a fixed discount D<=1 from each nonzero count

Witten-Bell Discounting • Key Concept—Things Seen Once: Use the count of things you’ve seen once to help estimate the count of things you’ve never seen • So we estimate the total probability mass of all the zero N-grams with the number of types divided by the number of tokens plus observed types: N : the number of tokensT : observed types

Witten-Bell Discounting • T/(N+T) gives the total “probability of unseen N-grams”, we need to divide this up among all the zero N-grams • We could just choose to divide it equally Z is the total number of N-grams with count zero

Witten-Bell Discounting Alternatively, we can represent the smoothed counts directly as:

Witten-Bell Discounting

Witten-Bell Discounting • For bigramT: the number of bigram types, N: the number of bigram token

Evaluation • A LM that assigned equal probability to 100 words would have perplexity 100

Evaluation • In general, the perplexity of a LM is equal to the geometric average of the inverse probability of the words measured on test data:

Evaluation • “true” model for any data source will have the lowest possible perplexity • The lower the perplexity of our model, the closer it is, in some sense, to the true model • Entropy, which is simply log2 of perplexity • Entropy is the average number of bits per word that would be necessary to encode the test data using an optimal coder

Evaluation • entropy : 54perplexity : 3216 50% • entropy : 54.5perplexity : 32 29.3%

Language Modeling: Probabilistic Word Sequences

Language Modeling: Probabilistic Word Sequences

Presentation Transcript