Statistical Methods

Statistical Methods Allen’s Chapter 7 J&M’s Chapters 8 and 12

Statistical Methods • Large data sets (Corpora) of natural languages allow using statistical methods that were not possible before • Brown Corpus includes about 1000000 words with POS • Penn Treebank contains full syntactic annotations

Basic Probability Theory • Random Variable ranges over a predefined set of values TOSS = {h or t} • If E is a random variable with possible values of {e1 … en} then • P(ei)  0, for all i • P(ei )  1, for all i • i=1,n P(ei ) =1

An Example • R is a random variable with values {Win, Lose} • Harry is horse had 100 race with 20 Win • P(Win) = 0.2 and P(Lose) = 0.8 • In 30 races it was raining, and Harry won 15 of those races; so in rain P(Win) = 0.5 • This is captured by Conditional Probability • P(A | B) = P(A & B) / P(B) • P(Win | Rain) = 0.15 / 0.3 = 0.5

Independent Events • P(A | B) = P(A) • P(A & B) = P(A) * P(B) • Assume • L is a random variable with values {F , E} • P(F) = 0.6 • P( Win & F) = 0.12 Then P(Win | F) = 0.12 / 0.6 = 0.2 = P(Win) So Win and F are independent But Win and Rain are not independent • P(Win & Rain) = 0.15  P(Win) * P(Rain) = 0.06

Part of Speech Tagging • Determining the most likely category of each word in a sentence with ambiguous words • Example: finding POS of words that can be either nouns and/or verbs • Need two random variables: • C that ranges over POS {N, V} • W that ranges over all possible words

Part of Speech Tagging (Cont.) • We don’t have true probabilities • We can estimate using large data sets • Suppose: • There is a Corpus with 1273000 words • There is 1000 uses of flies: 400 with an noun sense, and 600 with a verb sense P(flies) = 1000 / 1273000 = 0.0008 P(flies & N) = 400 / 1273000 = 0.0003 P(flies & V) = 600 / 1273000 = 0.0005 P(V | flies) = P(V & flies) / P(flies) = 0.0005 / 0.0008 = 0.625 So in %60 occasions flies is a verb

Estimating Probabilities • We want to use probability to predict the future events • Using the information P(V | flies) = 0.625 to predict that the next “flies” is more likely to be a verb • This is called Maximum Likelihood estimation (MLE) • Generally the larger the data set we use, the more accuracy we get

Estimating Probabilities (Cont.) • Estimating the outcome probability of tossing a coin (i.e., 0.5) • Acceptable margin of error : (0.25 - 0.75) • The more tests performed, the more accurate estimation • 2 trials: %50 chance of reaching acceptable result • 3 trials: %75 chance • 4 trials: %87.5 chance • 8 trials: %93 chance • 12trials: %95 chance …

Estimating tossing a coin outcome

Estimating Probabilities (Cont.) • So the larger data set the better, but • The problem of sparse data • Brown Corpus contains about a million words • but there is only 49000 different words, • so one expect each word occurs about 20 times, • But over 40000 words occur less than 5 times.

Estimating Probabilities (Cont.) • For a random variable X with a set of values Vi, computed from counting number times X = xi • P(X = xi)  Vi / i Vi • Maximum Likelihood Estimation (MLE) uses Vi = |xi| • Expected likelihood Estimation (ELE) Uses Vi = |xi| + 0.5

MLE vs ELE • Suppose a word w doesn’t occur in the Corpus • We want to estimate w occurring in one of 40 classes L1 … L40 • We have a random variable X, • X = xi only if w appears in word category Li • By MLE, P(Li | w) is undefined because the divisor is zero • ELE , P(Li | w)  0.5 / 20 = 0.025 • Suppose w occurs 5 times (4 times as an noun and once as a verb) • By MLE, P(N |w) = 4/5 = 0.8, • By ELE, P(N | w) =4.5/25 = 0.18

Evaluation • Data set is divided into: • Training set (%80-%90 of the data) • Test set (%10-%20) • Cross-Validation: • Repeatedly removing different parts of corpus as the test set, • Training on the reminder of the corpus, • Then evaluating the new test set.

Noisy Channel real language X p(X) * p(Y | X) noisy channel X  Y  noisy language Y p(X|Y) want to recover xX from yY choose x that maximizes p(x | y)

Part of speech tagging Simplest Algorithm: choose the interpretation that occurs most frequently “flies” in the sample corpus was %60 a verb This algorithm success rate is %90 Over %50 of words appearing in most corpora are unambiguous To improve the success rate, Use the tags before or after the word under examination If “flies” is preceded by the word “the” it is definitely a noun

Part of speech tagging (Cont.) • General form of the POS Problem: • There is a sequence of words w1 … wt, and • We want to find a sequence of lexical categories C1 … Ct, such that • P(C1 … Ct | w1 … wt) is maximized • Using the Bayes rule: • P(C1 … Ct) * P(w1 … wt | C1 … Ct) / P(w1 … wt) • The problem is reduced to finding C1 … Ct, such that • P(C1 … Ct) * P(w1 … wt | C1 … Ct) is maximized • But no effective method for calculating the probability of these long sequences accurately exists, as it would require too much data • The probabilities can be estimated by some independence assumptions

Part of speech tagging (Cont.) • Using the information about • The previous word category: bigram • Or two previous word categories: trigram • Or n-1 previous word categories: n-gram • Using the bigram model • P(C1 … Ct)  i=1,t P(Ci | Ci-1) • P(Art N V N) = P(Art, ) * P(N | ART) * P( V | N) * P(N | V) • P(w1 … wt | C1 … Ct)  i=1,t P(wi | Ci) • Therefore we are looking for a sequence C1 … Ct such that i=1,t P(Ci | Ci-1) * P(wi | Ci) is maximized

Part of speech tagging (Cont.) • The information needed by this new formula can be extracted from the corpus • P(Ci = V | Ci-1 = N) = Count( N at position i-1 & V at position i) / Count (N at position i-1) (Fig. 7-4) • P( the | Art) = Count(# times the is an Art) / Count(# times an Art occurs) (Fig. 7-6)

Using an Artificial corpus • An artificial corpus generated with 300 sentences of categories Art, N, V, P • 1998 words, 833 nouns, 300 verbs, 558 article, and 307 propositions, • To deal with the problem of the problem of the sparse data, a minimum probability of 0.0001 is assumed

Bigram probabilities from the generated corpus

Word counts in the generated corpus

Lexical-generation probabilities (Fig. 7-6)

Part of speech tagging (Cont.) • How to find the sequence C1 … Ct that maximizes i=1,t P(Ci | Ci-1) * P(wi | Ci) • Brute Force search: Finding all possible sequences With N categories and T words, there are NT sequences • Using the independence assumption and bigram probabilities, the probability wi to be in category Ci depends only on Ci-1 • The process can be modeled by a special form of probabilistic finite state machine (Fig. 7-7)

Markov Chain • Probability of a sequence of 4 words being in cats: ART N V N 0.71 * 1 * 0.43 * 0.35 = 0.107 • The representation is accurate only if the probability of a category occurring depends only the one category before it. • This called the Markov assumption • The network is called Markov chain

Hidden Markov Model (HMM) • Markov network can be extended to include the lexical-generation probabilities, too. • Each node could have an output probability for its every possible corresponding output • node N is associated with a probability table indicating, for each word, how likely that word is to be selected if we randomly select a noun • The output probabilities are exactly the lexical-generation probabilities shown in fig 7-6 • Markov network with output probabilities is called Hidden Markov Model (HMM)

Hidden Markov Model (HMM) • The word hidden indicates that for a specific sequence of words, it is not clear what state the Markov model is in • For instance, the word “flies” could be generated from state N with a probability of 0.25, or from state V with a probability of 0.076 • Now, it is not trivial to compute the probability of a sequence of words from the network • But, given a particular sequence of tags, the probability that it generates a particular output is easily computed

Hidden Markov Model (HMM) • The probability that the sequence N V ART N generates the output Flies Like a flower is: • The probability of path N V ART N is 0.29 * 0.43 * 0.65 * 1 = 0.081 • The probability of the output being Flies like a flower is P(flies | N) * P(like | V) * P(a | ART) * P(flower | N) = 0.025 * 0.1 * 0.36 * 0.063 = 5.4 * 10-5 • The likelihood that HMM would generate the sentence is 0.000054 * 0.081 = 4.374 * 10-6 • Therefore, the probability of a sentence w1 … wt, given a sequence C1 … Ct, is i=1,t P(Ci | Ci-1) * P(wi | Ci)

Markov Chain

Viterbi Algorithm

Flies like a flower • SEQSCORE(i, 1) = P(flies | Li) * P(Li |  ) • P(flies/V) = 0.076 * 0.0001 = 7.6 * 10-6 • P(flies/N) = 0.035 * 0.29 = 0.00725 • P(like/V) = max( P(flies/N) * P(V | N), P(flies/V) * P(V | V)) * • P(like | V) • = max (0.00725 * 0.43, 7.6 * 10-6 * 0.0001) * 0.1 = 0.00031

Flies like a flower

Flies like a flower • Brute force search steps are NT • Viterbi algorithm steps are K* T * N2

Getting Reliable Statistics (smoothing) • Suppose we have 40 categories • To collect unigrams, at least 40 samples, one for each category, are needed • For bigrams, 1600 samples are needed • For trigerams, 64000 samples are needed • For 4-grams, 2560000 samples are needed • P(Ci | C … Ci-1) = 1P(Ci) + 2P(Ci | Ci-1) + 3P(Ci | Ci-1Ci-2) • 1+ 2 + 3 = 1

Statistical Parsing • Corpus-based methods offer new ways to control parsers • We could use statistical methods to identify the common structures of a Language • We can choose the most likely interpretation when a sentence is ambiguous • This might lead to much more efficient parsers that are almost deterministic

Statistical Parsing • What is the input of an statistical parser? • Input is the output of a POS tagging algorithm • If POSs are accurate, lexical ambiguity is removed • But if tagging is wrong, parser cannot find the correct interpretation, or, may find a valid but implausible interpretation • With %95 accuracy, the chance of correctly tagging a sentence of 8 words is 0.67, and that of 15 words is 0.46

Obtaining Lexical probabilities A better approach is: • computing the probability that each word appears in the possible lexical categories. • combining these probabilities with some method of assigning probabilities to rule use in the grammar The context independent Lexical category of a word w be Lj can be estimated by: P(Lj | w) = count (Lj & w) / i=1, N count( Li & w)

Context-independent lexical categories • P(Lj | w) = count (Lj & w) / i=1,N count( Li & w) • P(Art | the) = 300 /303 =0.99 • P(N | flies) = 21 / 44 = 0.48

Context dependent lexical probabilites • A better estimate can be obtained by computing how likely it is that category Li occurs at position t, in all sequences of the input w1 … wt • Instead of just finding the sequence with the maximum probability, we add up the probabilities of all sequences that end in wt/Li • The probability that flies is a noun in the sentence The flies like flowers is calculated by adding the probability of all sequences that end with flies as a noun

Context-dependent lexical probabilities • Similarly, there are three nonzero sequences ending with flies as a V with a total value of 1.13 * 10 -5 • P(The flies) = 9.58 * 10 -3 + 1.13 * 10 -5 = 9.591 * 10 -3 • P(flies/N | The flies) = P(flies/N & The flies) / P(The flies) = 9.58 * 10 -3 / 9.591 * 10 -3 = 0.9988 • P(flies/V | The flies) = P(flies/V & The flies) / P(The flies) = 1.13 * 10 -5 / 9.591 * 10 -3 = 0.0012

Forward Probabilities • The probability of producing the words w1 … wt and ending is state wt/Li is called the forward probabilityi(t) and is defined as: • i(t) = P(wt/Li & w1 … wt) • In the flies like flowers, 2(3) is the sum of values computed for all sequences ending in a V (the second category) in position 3, for the input the flies like • P(wt/Li | w1 … wt) = P(wt/Li & w1 … wt) / P(w1 … wt)  i(t) / j=1, N j(t)

Forward Probabilities

Context dependent lexical Probabilities

Backward Probability • Backward probability, j(t)), is the probability of producing the sequence wt … wT beginning from the state wt/Lj • P(wt/Li) (i(t) * i(t) ) / j=1, N (j(t) * i(t))

Probabilistic Context-free Grammars • CFGs can be generalized to PCFGs • We need some statistics on rule use • The simplest approach is to count the number of times each rule is used in a corpus with parsed sentences • If category C has rules R1 … Rm, then P(Rj | C) = count(# times Rj used) / i=1,m count(# times Ri used)

Probabilistic Context-free Grammars

Statistical Methods