1 / 83

CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009

CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A. In practice: make assumptions on the distribution’s type.

aya
Download Presentation

CS546: Machine Learning and Natural Language Probabilistic Classification Feb 24, 26 2009

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS546: Machine Learning and Natural LanguageProbabilistic ClassificationFeb 24, 26 2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A

  2. In practice: make assumptions on the distribution’s type • In practice: a decision policy depends on the assumptions 2: Generative Model • Model the problem of text correction as that of generating correct sentences. • Goal: learn a model of the language; use it to predict. PARADIGM • Learn a probability distribution over all sentences • Use it to estimate which sentence is more likely. Pr(I saw the girl it the park) <> Pr(I saw the girl in the park) [In the same paradigm we sometimes learn a conditional probability distribution]

  3. Before: Error Driven Learning Discriminative Learning • Consider a distribution D over space XY • X - the instance space; Y - set of labels. (e.g. +/-1) • Given a sample {(x,y)}1m,, and a loss function L(x,y) Find hH that minimizes i=1,mL(h(xi),yi) • L can be: L(h(x),y)=1, h(x)y, o/w L(h(x),y) = 0 (0-1 loss) L(h(x),y)= (h(x)-y)2 , (L2) L(h(x),y)=exp{- yh(x)} • Find an algorithm that minimizes average loss; then, we know that things will be okay (as a function of H).

  4. Bayesian Decision Theory • Goal: find the best hypothesis from some space H of hypotheses, given the observed data D. • Define best to be: most probable hypothesis in H • In order to do that, we need to assume a probability distribution over the class H. • In addition, we need to know something about the relation between the data observed and the hypotheses (E.g., a coin problem.) • As we will see, we will be Bayesian about other things, e.g., the parameters of the model

  5. Basics of Bayesian Learning • P(h) - the prior probability of a hypothesis h (prior over H) Reflects background knowledge; before data is observed. If no information - uniform distribution. • P(D) - The probability that this sample of the Data is observed. (No knowledge of the hypothesis) • P(D|h): The probability of observing the sample D, given that the hypothesis h holds • P(h|D): The posterior probability of h. The probability h holds, given that D has been observed.

  6. Bayes Theorem • P(h|D) increases with P(h) and with P(D|h) • P(h|D) decreases with P(D)

  7. Learning Scenario • The learner considers a set of candidate hypotheses H (models), and attempts to find the most probable one h H, given the observed data. • Such maximally probable hypothesis is called maximum a posteriori hypothesis (MAP); Bayes theorem is used to compute it:

  8. Learning Scenario (2) • We may assume that a priori, hypotheses are equally probable • We get the Maximum Likelihood hypothesis: • Here we just look for the hypothesis that best explains the data

  9. Bayes Optimal Classifier • How should we use the general formalism? • What should H be? • H can be a collection of functions. Given the training data, choose an optimal function. Then, given new data, evaluate the selected function on it. • H can be a collection of possible predictions. Given the data, try to directly choose the optimal prediction. • H can be a collection of (conditional) probability distributions. • Could be different! • Specific examples we will discuss: • Naive Bayes: a maximum likelihood based algorithm; • Max Entropy: seemingly, a different selection criteria; • Hidden Markov Models

  10. Bayesian Classifier • f:XV, finite set of values • Instances x X can be described as a collection of features • Given an example, assign it the most probable value in V

  11. Bayesian Classifier • f:XV, finite set of values • Instances x X can be described as a collection of features • Given an example, assign it the most probable value in V • Bayes Rule: • Notational convention: P(y) means P(Y=y)

  12. Bayesian Classifier • Given training data we can estimate the two terms. • Estimating P(vj) is easy. For each value vj count how many times • it appears in the training data. • However, it is not feasible to estimate • In this case we have to estimate, for each target value, • the probability of each instance (most of which will not occur) • In order to use a Bayesian classifiers in practice, we need to make • assumptions that will allow us to estimate these quantities.

  13. Naive Bayes • Assumption: feature values are independent given the target value

  14. Naive Bayes • Assumption: feature values are independent given the target value • Generative model: • First choose a value vjV according to P(vj) • For each vj: choose x1 x2 …, xnaccording to P(xk |vj)

  15. Naive Bayes • Assumption: feature values are independent given the target value • Learning method: Estimate n|V| parameters and use them to • compute the new value. (how to estimate?)

  16. Naive Bayes • Assumption: feature values are independent given the target value • Learning method: Estimate n|V| parameters and use them to • compute the new value. • This is learning without search. Given a collection of training examples, • you just compute the best hypothesis (given the assumptions) • This is learning without trying to achieve consistency or even • approximate consistency. • Why does it work?

  17. Conditional Independence • Notice that the features values are conditionally independent, • given the target value, and are not required to be independent. • Example: • f(x,y)=xy over the product distribution defined by • p(x=0)=p(x=1)=1/2 and p(y=0)=p(y=1)=1/2 • The distribution is defined so that x and y are independent: • p(x,y) = p(x)p(y) (Interpretation - for every value of x and y) • But, given that f(x,y)=0: • p(x=1|f=0) = p(y=1|f=0) = 1/3 • p(x=1,y=1 | f=0) = 0 • so x and y are not conditionally independent.

  18. Conditional Independence • The other direction also does not hold. • x and y can be conditionally independent but not independent. • f=0: p(x=1|f=0) =1, p(y=1|f=0) = 0 • f=1: p(x=1|f=1) =0, p(y=1|f=1) = 1 • and assume, say, that p(f=0) = p(f=1)=1/2 • Given the value of f, x and y are independent. • What about unconditional independence ?

  19. Conditional Independence • The other direction also does not hold. • x and y can be conditionally independent but not independent. • f=0: p(x=1|f=0) =1, p(y=1|f=0) = 0 • f=1: p(x=1|f=0) =0, p(y=1|f=1) = 1 • and assume, say, that p(f=0) = p(f=1)=1/2 • Given the value of f, x and y are independent. • What about unconditional independence ? • p(x=1) = p(x=1|f=0)p(f=0)+p(x=1|f=1)p(f=1) = 0.5+0=0.5 • p(y=1) = p(y=1|f=0)p(f=0)+p(y=1|f=1)p(f=1) = 0.5+0=0.5 • But, • p(x=1, y=1)=p(x=1,y=1|f=0)p(f=0)+p(x=1,y=1|f=1)p(f=1) = 0 • so x and y are not independent.

  20. Example Day Outlook Temperature Humidity WindPlayTennis 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Weak Yes 14 Rain Mild High Strong No

  21. Estimating Probabilities • How do we estimate P(observation | v) ?

  22. Example • Compute P(PlayTennis= yes); P(PlayTennis= no)= • Compute P(outlook= s/oc/r | PlayTennis= yes/no) (6 numbers) • Compute P(Temp= h/mild/cool | PlayTennis= yes/no) (6 numbers) • Compute P(humidity= hi/nor | PlayTennis= yes/no) (4 numbers) • Compute P(wind= w/st | PlayTennis= yes/no) (4 numbers)

  23. Example • Compute P(PlayTennis= yes); P(PlayTennis= no)= • Compute P(outlook= s/oc/r | PlayTennis= yes/no) (6 numbers) • Compute P(Temp= h/mild/cool | PlayTennis= yes/no) (6 numbers) • Compute P(humidity= hi/nor | PlayTennis= yes/no) (4 numbers) • Compute P(wind= w/st | PlayTennis= yes/no) (4 numbers) • Given a new instance: • (Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong) • Predict:PlayTennis= ?

  24. Example • Given: (Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong) • P(PlayTennis= yes)=9/14=0.64 P(PlayTennis= no)=5/14=0.36 • P(outlook = sunny|yes)= 2/9 P(outlook = sunny|no)= 3/5 • P(temp = cool | yes) = 3/9 P(temp = cool | no) = 1/5 • P(humidity = hi |yes) = 3/9 P(humidity = hi |no) = 4/5 • P(wind = strong | yes) = 3/9 P(wind = strong | no)= 3/5 • P(yes|…..) ~ 0.0053 P(no|…..) ~ 0.0206

  25. Example • Given: (Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong) • P(PlayTennis= yes)=9/14=0.64 P(PlayTennis= no)=5/14=0.36 • P(outlook = sunny|yes)= 2/9 P(outlook = sunny|no)= 3/5 • P(temp = cool | yes) = 3/9 P(temp = cool | yes) = 1/5 • P(humidity = hi |yes) = 3/9 P(humidity = hi |yes) = 4/5 • P(wind = strong | yes) = 3/9 P(wind = strong | no)= 3/5 • P(yes|…..) ~ 0.0053 P(no|…..) ~ 0.0206 • What is we were asked about Outlook=OC ?

  26. Example • Given: (Outlook=sunny; Temperature=cool; Humidity=high; Wind=strong) • P(PlayTennis= yes)=9/14=0.64 P(PlayTennis= no)=5/14=0.36 • P(outlook = sunny|yes)= 2/9 P(outlook = sunny|no)= 3/5 • P(temp = cool | yes) = 3/9 P(temp = cool | no) = 1/5 • P(humidity = hi |yes) = 3/9 P(humidity = hi |no) = 4/5 • P(wind = strong | yes) = 3/9 P(wind = strong | no)= 3/5 • P(yes|…..) ~ 0.0053 P(no|…..) ~ 0.0206 • P(no|instance) = 0.0206/(0.0053+0.0206)=0.795

  27. Naive Bayes: Two Classes • Notice that the naïve Bayes method seems to be giving a method • for predicting rather than an explicit classifier. • In the case of two classes, v{0,1} we predict that v=1 iff:

  28. Naive Bayes: Two Classes • Notice that the naïve Bayes method gives a method for predicting • rather than an explicit classifier. • In the case of two classes, v{0,1} we predict that v=1 iff:

  29. Naive Bayes: Two Classes • In the case of two classes, v{0,1} we predict that v=1 iff:

  30. Naïve Bayes: Two Classes • In the case of two classes, v{0,1} we predict that v=1 iff:

  31. Naïve Bayes: Two Classes • In the case of two classes, v{0,1} we predict that v=1 iff: • We get that the optimal Bayes behavior is given by a linear separator • with

  32. Why does it work? • We have not addressed the question of why does this Classifier • perform well, given that the assumptions are unlikely to be • satisfied. • The linear form of the classifiers provides some hints.

  33. Naïve Bayes: Two Classes • In the case of two classes we have that:

  34. Naïve Bayes: Two Classes • In the case of two classes we have that: • but since • We get (plug in (2) in (1); some algebra): • Which is simply the logistic (sigmoid) function used in the • neural network representation.

  35. Another look at Naive Bayes Graphical model. It encodes the NB independence assumption in the edge structure (siblings are independent given parents) Linear Statistical Queries Model

  36. Hidden Markov Model (HMM) • HMM is a probabilistic generative model • It models how an observed sequence is generated • Let’s call each position in a sequence a time step • At each time step, there are two variables • Current state (hidden) • Observation

  37. s1 s2 s3 s4 s5 s6 o1 o2 o3 o4 o5 o6 HMM • Elements • Initial state probability P(s1) • Transition probability P(st|st-1) • Observation probability P(ot|st) • As before, the graphical model is an encoding of the independence assumptions • Consider POS tagging: O – words, S – POS tags

  38. s1=B s2=I s3=O s4=B s5=I s6=O o1 Mr. o2 Brown o3 blamed o4 Mr. o5 Bob o6 for HMM for Shallow Parsing • States: • {B, I, O} • Observations: • Actual words and/or part-of-speech tags

  39. s1=B s2=I s3=O s4=B s5=I s6=O o1 Mr. o2 Brown o3 blamed o4 Mr. o5 Bob o6 for HMM for Shallow Parsing • Given a sentences, we can ask what the most likely state sequence is Transition probabilty: P(st=B|st-1=B),P(st=I|st-1=B),P(st=O|st-1=B), P(st=B|st-1=I),P(st=I|st-1=I),P(st=O|st-1=I), … Initial state probability: P(s1=B),P(s1=I),P(s1=O) Observation Probability: P(ot=‘Mr.’|st=B),P(ot=‘Brown’|st=B),…, P(ot=‘Mr.’|st=I),P(ot=‘Brown’|st=I),…, …

  40. ( ) P s ; s ; : : : ; s ; o ; o ; : : : ; o 1 1 1 1 k k ¡ k k ¡ = ( ) j P o o ; o ; : : : ; o ; s ; s ; : : : ; s 1 2 1 1 1 k k ¡ k ¡ k k ¡ ( ) ¢ P o ; o ; : : : ; o ; s ; s ; : : : ; s 1 1 1 2 1 k ¡ k ¡ k k ¡ = ( ) ( ) j ¢ P o s P o ; o ; : : : ; o ; s ; s ; : : : ; s 1 1 1 2 1 k k k ¡ k ¡ k k ¡ = ( ) ( ) j ¢ j P o s P s s ; s ; : : : ; s ; o ; o ; : : : ; o 1 1 1 2 1 2 k k k k ¡ k ¡ k ¡ k ¡ ( ) ¢ P s ; s ; : : : ; s ; o ; o ; : : : ; o 1 1 1 2 1 2 k ¡ k ¡ k ¡ k ¡ = ( ) ( ) j ¢ j P o s P s s 1 k k k k ¡ ( ) ¢ P s ; s ; : : : ; s ; o ; o ; : : : ; o 1 2 1 1 2 1 k ¡ k ¡ k ¡ k ¡ 1 k ¡ Y = ( ) [ ( ) ( )] ( ) j ¢ j ¢ j ¢ P o s P s s P o s P s +1 1 t t t k k t =1 t Finding most likely state sequence in HMM (1)

  41. a rg max ( ) j P s ; s ; : : : ; s o ; o ; : : : ; o 1 1 1 1 k k ¡ k k ¡ s ;s ;::: ;s 1 1 k k ¡ ( ) P s ; s ; : : : ; s ; o ; o ; : : : ; o 1 1 1 1 k k ¡ k k ¡ = a rg max ( ) s ;s ;::: ;s P o ; o ; : : : ; o 1 1 1 1 k k ¡ k k ¡ = a rg max ( ) P s ; s ; : : : ; s ; o ; o ; : : : ; o 1 1 1 1 k k ¡ k k ¡ s ;s ;::: ;s 1 1 k k ¡ 1 k ¡ Y = a rg max ( ) [ ( ) ( )] ( ) j ¢ j ¢ j ¢ P o s P s s P o s P s 1 t t t +1 k k t s ;s ;::: ;s 1 1 k k ¡ =1 t Finding most likely state sequence in HMM (2)

  42. 1 k ¡ Y max ( ) [ ( ) ( )] ( ) j ¢ j ¢ j ¢ P o s P s s P o s P s 1 +1 t t t k k t s ;s ;::: ;s 1 1 k k ¡ =1 t 1 k ¡ Y = max ( ) max [ ( ) ( )] ( ) j ¢ j ¢ j ¢ P o s P s s P o s P s 1 +1 t t t k k t s s ;::: ;s 1 1 k k ¡ =1 t = max ( ) max [ ( ) ( )] j ¢ j ¢ j P o s P s s P o s 1 1 1 k k k k ¡ k ¡ k ¡ s s 1 k k ¡ 2 k ¡ Y max [ ( ) ( )] ( ) ¢ j ¢ j ¢ P s s P o s P s +1 1 t t t t s ;::: ;s 1 2 k ¡ =1 t = max ( ) max [ ( ) ( )] j ¢ j ¢ j P o s P s s P o s 1 1 1 k k k k ¡ k ¡ k ¡ s s 1 k k ¡ max [ ( ) ( )] ¢ j ¢ j ¢ P s s P o s : : : 1 2 2 2 k ¡ k ¡ k ¡ k ¡ s 2 k ¡ max [ ( ) ( )] ( ) ¢ j ¢ j ¢ P s s P o s P s 2 1 1 1 1 s 1 Finding most likely state sequence in HMM (3) A function of sk

  43. max ( ) max [ ( ) ( )] j ¢ j ¢ j P o s P s s P o s 1 1 1 k k k k ¡ k ¡ k ¡ s s 1 k k ¡ max [ ( ) ( )] ¢ j ¢ j ¢ P s s P o s : : : 1 2 2 2 k ¡ k ¡ k ¡ k ¡ s 2 k ¡ max [ ( ) ( )] ¢ j ¢ j ¢ P s s P o s 3 2 2 2 s 2 max [ ( ) ( )] ( ) ¢ j ¢ j ¢ P s s P o s P s 2 1 1 1 1 s 1 Finding most likely state sequence in HMM (4) • Viterbi’s Algorithm • Dynamic Programming

  44. Learning the Model • Estimate • Initial state probability P(s1) • Transition probability P(st|st-1) • Observation probability P(ot|st) • Unsupervised Learning (states are not observed) • EM Algorithm • Supervised Learning (states are observed; more common) • ML Estimate of above terms directly from data • Notice that this is completely analogues to the case of naive Bayes, and essentially all other models.

  45. Prediction: predict tT that maximizes Another view of Markov Models Input: T States: Observations: W Assumptions:

  46. T States: Observations: W Another View of Markov Models Input: As for NB:features are pairs and singletons of t‘s, w’s Only 3 active features This can be extended to an argmax that maximizes the prediction of the whole state sequence and computed, as before, via Viterbi.

  47. Learning with Probabilistic Classifiers • Learning Theory • We showed that probabilistic predictions can be viewed as predictions via Linear Statistical Queries Models. • The low expressivity explains Generalization+Robustness • Is that all? • It does not explain why is it possible to (approximately) fit the data with these models. Namely, is there a reason to believe that these hypotheses minimize the empirical error on the sample? • In General, No.(Unless it corresponds to some probabilistic assumptions that hold).

  48. Learning Protocol • LSQ hypotheses are computed directly, w/o assumptions on the underlying distribution: - Choose features - Compute coefficients • Is there a reason to believe that an LSQ hypothesis minimizes the empirical error on the sample? • In general, no. (Unless it corresponds to some probabilistic assumptions that hold).

  49. Learning Protocol: Practice • LSQ hypotheses are computed directly: - Choose features - Compute coefficients • If hypothesis does not fit the training data - - Augment set of features (Forget your original assumption)

  50. T States: Observations: W Example: probabilistic classifiers If hypothesis does not fit the training data - augment set of features(forget assumptions) Features are pairs and singletons of t‘s, w’s Additional features are included

More Related