1 / 45

Chapter6. Statistical Inference : n-gram Model over Sparse Data

Foundations of Statistic Natural Language Processing. Chapter6. Statistical Inference : n-gram Model over Sparse Data. Pusan national university 2014. 4. 22 Myoungjin , Jung. Introduction. Object of Statistical NLP Do statistical inference for the field of natural language.

kevina
Download Presentation

Chapter6. Statistical Inference : n-gram Model over Sparse Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Foundations of Statistic Natural Language Processing Chapter6. Statistical Inference : n-gram Model over Sparse Data Pusan national university 2014. 4. 22 Myoungjin, Jung

  2. Introduction • Object of Statistical NLP • Do statistical inference for the field of natural language. • Statistical inference (크게 두 가지 과정으로 나눔) • 1. Taking some data generated by unknown probability distribution. (말뭉치 필요) • 2. Making some inferences about this distribution. (해당 말뭉치로 확률분포 추론) • Divides the problem into three areas : (통계적 언어처리의 3가지 과정) • 1. Dividing the training data into equivalence class. • 2. Finding a good statistical estimator for each equivalence class. • 3. Combining multiple estimators.

  3. Bins : Forming Equivalence Classes • Reliability vs Discrimination Ex)“large green ___________” tree? mountain? frog? car? “swallowed the large green ________” pill? broccoli? • smaller n: more instances in training data, better statistical estimates (more reliability) • larger n: more information about the context of the specific instance (greater discrimination)

  4. Bins : Forming Equivalence Classes • N-gram models • “n-gram” = sequence of n words • predicting the next word : • Markov assumption • Only the prior local context - the last few words – affects the next word. • Selecting an n : Vocabulary size = 20,000 words

  5. Bins : Forming Equivalence Classes • Probability dist. : P(s) where s : sentence • Ex. P(If you’re going to San Francisco, be sure ……) = P(If) * P(you’re|If) * P(going|If you’re) * P(to|If you’re going) * …… • Markov assumption • Only the last n-1 words are relevant for a prediction • Ex. With n=5 P(sure|If you’re going to San Francisco, be) = P(sure|San Francisco , be)

  6. Bins : Forming Equivalence Classes • N-gram : Sequence of length n with a count • Ex. 5-gram : If you’re going to San • Sequence naming : • Markov assumption formalized : • P() = P(|) P(|) n-1 words

  7. Bins : Forming Equivalence Classes • Instead of P(s) : • only one conditional prob. P(|) • Simplify P(|) to P(|) • n-1 n-1 • NWP() = arg max P(|) • Set of all words in the corpus • next word prediction

  8. Bins : Forming Equivalence Classes • Ex. The easiest way : • (|)= = • P(San|If you’re going to) = • =

  9. Statistical Estimators • Given the observed training data. • How do you develop a model (probability distribution) to predict future events? (더 좋은 확률의 추정) • Probability estimate • target feature • Estimating the unknown probability distribution of n-grams.

  10. Statistical Estimators • Notation for the statistical estimation chapter.

  11. Statistical Estimators • Example - Instances in the training corpus: “inferior to ________”

  12. Maximum Likelihood Estimation (MLE) • Definition • Using the relative frequency as a probability estimate. • Example : • In corpus, found 10 training instances of the word “comes across” • 8 times they were followed by “as” : P(as) = 0.8 • Once by “more” and “a” : P(more) = 0.1 , P(a) = 0.1 • Not among the above 3 word : P(x) = 0.0 • Formula

  13. Maximum Likelihood Estimation (MLE)

  14. Maximum Likelihood Estimation (MLE) • Word (N=79 : ,…,) 1-gram • 2-gram • 3-gram • P(the)=7/79, P(bigram|the)=2/7, P(model|the,bigram)=2/2 • P(bigram)=2/79

  15. Laplace’s law, Lidstone’s law and the Jeffreys-Perks law • Laplace’s law(1814; 1995) • Add a little bit of probability space to unseen events

  16. Laplace’s law, Lidstone’s law and the Jeffreys-Perks law • Word (N=79 : B=seen(51)+unseen(70)=121)

  17. Laplace’s law, Lidstone’s law and the Jeffreys-Perks law • Page 202-203 (Associated Press[AP] newswire yielded a vocabulary) • unseen event에 대한 약간의 확률 공간을 추가but 너무 많은 공간을 추가하였다. • 44milion의 경우 voca400653 발생 -> 160,000,000,000 bigram 발생 • Bins의 개수가 training instance보다 많아지게 되는 문제가 발생 • Lap law는 unseen event에 대한 확률공간을 위해 분모에 B를 삽입 하였지만 결과적으로 약 46.5%의 확률공간을 unseen event에 주게 되었다. • N0 * P lap(.) = 74.671,100,000 * 0,000137/22,000,000 = 0.465

  18. Laplace’s law, Lidstone’s law and the Jeffreys-Perks law • Lidstone’s law(1920) and the Jeffreys-Perks law(1973) • Lidstone’s Law • add some positive value • Jeffreys-Perks Law • = 0.5 • Called ELE (Expected Likelihood Estimation)

  19. Lidstone’s law • Using Lidstone’s law, instead of adding one, add some smaller value , • where the parameter is in the range . And .

  20. Lidstone’s law • Here, gives the maximum likelihood estimate, • gives the Laplace’s law, • if tends to then we have the uniform estimate . • represents the trust we have in relative frequencies. • implies more trust in relative frequencies than the Laplace's law • while represents less trust in relative frequencies. • In practice, people use values of in the range , • a common value being . (Jeffreys-Perks law)

  21. Jeffreys-Perks law • Using Lidstone’slaw, *A: , B: , C:

  22. Held out estimation(Jelinek and Mercer, 1985)

  23. Held out estimation(Jelinek and Mercer, 1985) • [Full text ( ) : , respectively], unseen word : I don't know. • [Word ( ) : , unseen word : 70, respectively] : (Training Data) • [Word ( ) : , unseen word : 51- , respectively] : (Held out Data) • (1-gram) • Traingdata : , , ( ) • Held out data : • , , ( )

  24. Held out estimation(Jelinek and Mercer, 1985) • training text에서 r번 나온 bigram이 추가적으로 추출한 text (further text)에서 몇 번 나오는가를 알아보는 것. • Held out estimation : training text에서 r번 출현되어진 bigram이 더 많은 text에서는 얼마나 출현 할 것인가를 예측하는 방법. • Test data(training data에 독립적)는 전체 데이터의 5-10%이지만 신뢰하기에 충분하다. • 우리는 데이터를 training data의 test data로 나누기를 원한다. (검증된 데이터와 검증안된 데이터) • Held out data (10%) • N-gram의 held-out estimation을 통해 held-out data를 얻는다.

  25. A B train validate Model 1 validate train Model 2 + Model 1 Model 2 Final Model Cross-validation (deleted estimation; Jelinek and Mercer, 1985) • Use data for both training and validation • Divide test data into 2 parts • Train on A, validate on B • Train on B, validate on A • Combine two models

  26. Cross-validation (deleted estimation; Jelinek and Mercer, 1985) • Cross validation : training data is used both as • initial training data • held out data • On large training corpora, deleted estimation works better than held-out estimation

  27. Cross-validation (deleted estimation; Jelinek and Mercer, 1985) • [Full text ( ) : , respectively], unseen word : I don't know. • [Word ( ) : , unseen word : 70, respectively] : (Training Data) • [A-part word ( ) : , unseen word : 101, respectively] • [B-part word ( ) : , unseen word : 90- , respectively]

  28. Cross-validation (deleted estimation; Jelinek and Mercer, 1985) • [B-part word ( ) : , unseen word : 90- , respectively] • [A-part word ( ) : , unseen word : 101+ , respectively] 28

  29. Cross-validation (deleted estimation; Jelinek and Mercer, 1985) • Held out estimation개념으로 우리가 training data를 두 부분으로 나눔으로써 같은 효과를 얻는다. 이러한 메소드를cross validation이라 한다. • 더 효과적인 방법. 두 방법을 합치므로써Nr0, Nr1의 차이를 감소 시킨다. • 큰 training curpus내의 deleted estimation은 held-out estimation보다 더 신뢰적이다.

  30. Good-Turing estimation(Good, 1953) : [binomial distribution] • Idea: re-estimate probability mass assigned to N-grams with zero counts • Adjust actual counts to expected counts with formula (r* is an adjusted frequency) (E denotes the expectation of random variable)

  31. Good-Turing estimation(Good, 1953) : [binomial distribution] • If • If 이 작을 시 : 이 클 시 : So, over-estimator 된 것을 under-estimator 시킴

  32. Note • 단점 : over-estimator • [two discounting models] (Ney and Essen, 1993; Ney et al., 1994) • Absolute discounting , over-estimator를 만큼 다운시킴. , • Linear discounting , 를 이용하여 를 조절 ,

  33. Note • 단점 : over-estimator • [Natural Law of Succession] (Ristad, 1995)

  34. Combining Estimators • Basic Idea • Consider how to combine multiple probability estimate from various different models • How can you develop a model to utilize different length n-grams as appropriate? • Simple linear interpolation where and . • Combination of trigram, bigram and unigram

  35. Combining Estimators • [Katz’s backing-off] (Katz, 1987) • Example

  36. Combining Estimators • [Katz’s backing-off] (Katz, 1987) • If sequence unseen : use shorter sequence • Ex. If P(San | going to) = 0, Use P(San | to) = τ() if c() > 0 Λ*() if c() = 0 weight lower order prob. higher order prob.

  37. Combining Estimators • [General linear interpolation] where and

  38. Combining Estimators • Interpolated smoothing • = τ() + Λ*() • higher order prob. Weight • lower order prob. • Seems to work better than back-off smoothing

  39. Note • Witten Bell smoothing = *() + (1-)*() = • Where = |{}|

  40. Note • Absolute discounting • Like Jelinek-Mercer, involves interpolation of higher- and lower- order models • But instead of multiplying the higher-order by a , we subtract a fixed discount [0,1] from each nonzero count : • = + + (1-)*() • To make it sum to 1: • (1-)=* • Choose using held-out estimation.

  41. Note • KN smoothing (1995) • An extension of absolute discounting with a clever way of constructing the lower-order (backoff) model • Idea: the lower-order model is signficant only when count is small or zero in the higher-order model, and so should be optimized for that purpose. = + **()

  42. Note • An empirical study of smoothing techniques for language modeling (1999) • For a bigram model, we would like to select a smoothed dist. that satisfies the following constraint on unigram marginals for all : • (1) (제약조건) • (2) (1)번으로부터 • = • (3) (2)번으로부터 • =

  43. Note • =*[ + **()] = + *() = + () = + ()

  44. Note • = |{}| • = = |{}| = • ()=

  45. Note • Generlizing to higher-order models, we have that • (|)= Where = |{}| = |{}| =

More Related