Chapter6. Statistical Inference : n-gram Model over Sparse Data

Foundations of Statistic Natural Language Processing Chapter6. Statistical Inference : n-gram Model over Sparse Data Pusan national university 2014. 4. 22 Myoungjin, Jung

Introduction • Object of Statistical NLP • Do statistical inference for the field of natural language. • Statistical inference (크게 두 가지 과정으로 나눔) • 1. Taking some data generated by unknown probability distribution. (말뭉치 필요) • 2. Making some inferences about this distribution. (해당 말뭉치로 확률분포 추론) • Divides the problem into three areas : (통계적 언어처리의 3가지 과정) • 1. Dividing the training data into equivalence class. • 2. Finding a good statistical estimator for each equivalence class. • 3. Combining multiple estimators.

Bins : Forming Equivalence Classes • Reliability vs Discrimination Ex)“large green ___________” tree? mountain? frog? car? “swallowed the large green ________” pill? broccoli? • smaller n: more instances in training data, better statistical estimates (more reliability) • larger n: more information about the context of the specific instance (greater discrimination)

Bins : Forming Equivalence Classes • N-gram models • “n-gram” = sequence of n words • predicting the next word : • Markov assumption • Only the prior local context - the last few words – affects the next word. • Selecting an n : Vocabulary size = 20,000 words

Bins : Forming Equivalence Classes • Probability dist. : P(s) where s : sentence • Ex. P(If you’re going to San Francisco, be sure ……) = P(If) * P(you’re|If) * P(going|If you’re) * P(to|If you’re going) * …… • Markov assumption • Only the last n-1 words are relevant for a prediction • Ex. With n=5 P(sure|If you’re going to San Francisco, be) = P(sure|San Francisco , be)

Bins : Forming Equivalence Classes • N-gram : Sequence of length n with a count • Ex. 5-gram : If you’re going to San • Sequence naming : • Markov assumption formalized : • P() = P(|) P(|) n-1 words

Bins : Forming Equivalence Classes • Instead of P(s) : • only one conditional prob. P(|) • Simplify P(|) to P(|) • n-1 n-1 • NWP() = arg max P(|) • Set of all words in the corpus • next word prediction

Bins : Forming Equivalence Classes • Ex. The easiest way : • (|)= = • P(San|If you’re going to) = • =

Statistical Estimators • Given the observed training data. • How do you develop a model (probability distribution) to predict future events? (더 좋은 확률의 추정) • Probability estimate • target feature • Estimating the unknown probability distribution of n-grams.

Statistical Estimators • Notation for the statistical estimation chapter.

Statistical Estimators • Example - Instances in the training corpus: “inferior to ________”

Maximum Likelihood Estimation (MLE) • Definition • Using the relative frequency as a probability estimate. • Example : • In corpus, found 10 training instances of the word “comes across” • 8 times they were followed by “as” : P(as) = 0.8 • Once by “more” and “a” : P(more) = 0.1 , P(a) = 0.1 • Not among the above 3 word : P(x) = 0.0 • Formula

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) • Word (N=79 : ,…,) 1-gram • 2-gram • 3-gram • P(the)=7/79, P(bigram|the)=2/7, P(model|the,bigram)=2/2 • P(bigram)=2/79

Laplace’s law, Lidstone’s law and the Jeffreys-Perks law • Laplace’s law(1814; 1995) • Add a little bit of probability space to unseen events

Laplace’s law, Lidstone’s law and the Jeffreys-Perks law • Word (N=79 : B=seen(51)+unseen(70)=121)

Laplace’s law, Lidstone’s law and the Jeffreys-Perks law • Page 202-203 (Associated Press[AP] newswire yielded a vocabulary) • unseen event에 대한 약간의 확률 공간을 추가but 너무 많은 공간을 추가하였다. • 44milion의 경우 voca400653 발생 -> 160,000,000,000 bigram 발생 • Bins의 개수가 training instance보다 많아지게 되는 문제가 발생 • Lap law는 unseen event에 대한 확률공간을 위해 분모에 B를 삽입 하였지만 결과적으로 약 46.5%의 확률공간을 unseen event에 주게 되었다. • N0 * P lap(.) = 74.671,100,000 * 0,000137/22,000,000 = 0.465

Laplace’s law, Lidstone’s law and the Jeffreys-Perks law • Lidstone’s law(1920) and the Jeffreys-Perks law(1973) • Lidstone’s Law • add some positive value • Jeffreys-Perks Law • = 0.5 • Called ELE (Expected Likelihood Estimation)

Lidstone’s law • Using Lidstone’s law, instead of adding one, add some smaller value , • where the parameter is in the range . And .

Lidstone’s law • Here, gives the maximum likelihood estimate, • gives the Laplace’s law, • if tends to then we have the uniform estimate . • represents the trust we have in relative frequencies. • implies more trust in relative frequencies than the Laplace's law • while represents less trust in relative frequencies. • In practice, people use values of in the range , • a common value being . (Jeffreys-Perks law)

Jeffreys-Perks law • Using Lidstone’slaw, *A: , B: , C:

Held out estimation(Jelinek and Mercer, 1985)

Held out estimation(Jelinek and Mercer, 1985) • [Full text ( ) : , respectively], unseen word : I don't know. • [Word ( ) : , unseen word : 70, respectively] : (Training Data) • [Word ( ) : , unseen word : 51- , respectively] : (Held out Data) • (1-gram) • Traingdata : , , ( ) • Held out data : • , , ( )

Held out estimation(Jelinek and Mercer, 1985) • training text에서 r번 나온 bigram이 추가적으로 추출한 text (further text)에서 몇 번 나오는가를 알아보는 것. • Held out estimation : training text에서 r번 출현되어진 bigram이 더 많은 text에서는 얼마나 출현 할 것인가를 예측하는 방법. • Test data(training data에 독립적)는 전체 데이터의 5-10%이지만 신뢰하기에 충분하다. • 우리는 데이터를 training data의 test data로 나누기를 원한다. (검증된 데이터와 검증안된 데이터) • Held out data (10%) • N-gram의 held-out estimation을 통해 held-out data를 얻는다.

A B train validate Model 1 validate train Model 2 + Model 1 Model 2 Final Model Cross-validation (deleted estimation; Jelinek and Mercer, 1985) • Use data for both training and validation • Divide test data into 2 parts • Train on A, validate on B • Train on B, validate on A • Combine two models

Cross-validation (deleted estimation; Jelinek and Mercer, 1985) • Cross validation : training data is used both as • initial training data • held out data • On large training corpora, deleted estimation works better than held-out estimation

Cross-validation (deleted estimation; Jelinek and Mercer, 1985) • [Full text ( ) : , respectively], unseen word : I don't know. • [Word ( ) : , unseen word : 70, respectively] : (Training Data) • [A-part word ( ) : , unseen word : 101, respectively] • [B-part word ( ) : , unseen word : 90- , respectively]

Cross-validation (deleted estimation; Jelinek and Mercer, 1985) • [B-part word ( ) : , unseen word : 90- , respectively] • [A-part word ( ) : , unseen word : 101+ , respectively] 28

Cross-validation (deleted estimation; Jelinek and Mercer, 1985) • Held out estimation개념으로 우리가 training data를 두 부분으로 나눔으로써 같은 효과를 얻는다. 이러한 메소드를cross validation이라 한다. • 더 효과적인 방법. 두 방법을 합치므로써Nr0, Nr1의 차이를 감소 시킨다. • 큰 training curpus내의 deleted estimation은 held-out estimation보다 더 신뢰적이다.

Good-Turing estimation(Good, 1953) : [binomial distribution] • Idea: re-estimate probability mass assigned to N-grams with zero counts • Adjust actual counts to expected counts with formula (r* is an adjusted frequency) (E denotes the expectation of random variable)

Good-Turing estimation(Good, 1953) : [binomial distribution] • If • If 이 작을 시 : 이 클 시 : So, over-estimator 된 것을 under-estimator 시킴

Note • 단점 : over-estimator • [two discounting models] (Ney and Essen, 1993; Ney et al., 1994) • Absolute discounting , over-estimator를 만큼 다운시킴. , • Linear discounting , 를 이용하여 를 조절 ,

Note • 단점 : over-estimator • [Natural Law of Succession] (Ristad, 1995)

Combining Estimators • Basic Idea • Consider how to combine multiple probability estimate from various different models • How can you develop a model to utilize different length n-grams as appropriate? • Simple linear interpolation where and . • Combination of trigram, bigram and unigram

Combining Estimators • [Katz’s backing-off] (Katz, 1987) • Example

Combining Estimators • [Katz’s backing-off] (Katz, 1987) • If sequence unseen : use shorter sequence • Ex. If P(San | going to) = 0, Use P(San | to) = τ() if c() > 0 Λ*() if c() = 0 weight lower order prob. higher order prob.

Combining Estimators • [General linear interpolation] where and

Combining Estimators • Interpolated smoothing • = τ() + Λ*() • higher order prob. Weight • lower order prob. • Seems to work better than back-off smoothing

Note • Witten Bell smoothing = *() + (1-)*() = • Where = |{}|

Note • Absolute discounting • Like Jelinek-Mercer, involves interpolation of higher- and lower- order models • But instead of multiplying the higher-order by a , we subtract a fixed discount [0,1] from each nonzero count : • = + + (1-)*() • To make it sum to 1: • (1-)=* • Choose using held-out estimation.

Note • KN smoothing (1995) • An extension of absolute discounting with a clever way of constructing the lower-order (backoff) model • Idea: the lower-order model is signficant only when count is small or zero in the higher-order model, and so should be optimized for that purpose. = + **()

Note • An empirical study of smoothing techniques for language modeling (1999) • For a bigram model, we would like to select a smoothed dist. that satisfies the following constraint on unigram marginals for all : • (1) (제약조건) • (2) (1)번으로부터 • = • (3) (2)번으로부터 • =

Note • =*[ + **()] = + *() = + () = + ()

Note • = |{}| • = = |{}| = • ()=

Note • Generlizing to higher-order models, we have that • (|)= Where = |{}| = |{}| =

Chapter6. Statistical Inference : n-gram Model over Sparse Data