1 / 53

Bayesian Learning

Bayesian Learning. Conditional Probability. Probability of an event given the occurrence of some other event. E.g.,

finola
Download Presentation

Bayesian Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bayesian Learning

  2. Conditional Probability • Probability of an event given the occurrence of some other event. E.g., • Consider choosing a card from a well-shuffled standard deck of 52 playing cards. Given that the first card chosen is an ace, what is the probability that the second card chosen will be an ace?

  3. Event space = all possible pairs of cards X Y

  4. Event space = all possible pairs of cards First card is Ace Y

  5. Event space = all possible pairs of cards Y = First card is Ace X = Second card is Ace

  6. P(Y) = 4 / 52 • P(X,Y) = # possible pairs of aces / total # of pairs • = 4×3/52×51 = 12/2652. • P(X | Y) = (12/2652) / (4 / 52) = 3/51.

  7. Deriving Bayes Rule

  8. Bayesian Learning

  9. Application to Machine Learning • In machine learning we have a space H of hypotheses: h1 ,h2 , ...,hn • We also have a set D of data • We want to calculate P(h | D)

  10. Terminology • Prior probability of h: • P(h): Probability that hypothesis h is true given our prior knowledge • If no prior knowledge, all h  H are equally probable • Posterior probability of h: • P(h | D): Probability that hypothesis h is true, given the data D. • Likelihood of D: • P(D | h): Probability that we will see data D, given hypothesis h is true.

  11. Bayes Rule: Machine Learning Formulation

  12. Example

  13. The Monty Hall Problem You are a contestant on a game show. There are 3 doors, A, B, and C. There is a new car behind one of them and goats behind the other two. Monty Hall, the host, asks you to pick a door, any door. You pick door A. Monty tells you he will open a door , different from A, that has a goat behind it. He opens door B: behind it there is a goat. Monty now gives you a choice: Stick with your original choice A or switch to C. Should you switch? http://math.ucsd.edu/~crypto/Monty/monty.html

  14. Bayesian probability formulation Hypothesis space H: h1= Car is behind door A h2= Car is behind door B h3= Car is behind door C Data D = Monty opened B What is P(h1 | D)? What is P(h2 | D)? What is P(h3 | D)?

  15. Event space X = Car behind door A Y = Goat behind door B Event space = All possible configurations of cars and goats behind doors A, B, C

  16. Event space X = Car behind door A Y = Goat behind door B Bayes Rule:

  17. Using Bayes’ Rule to solve the Monty Hall problem You pick door A. Data D = Monty opened door B Hypothesis space H: h1= Car is behind door A h2= Car is behind door C h3 = Car is behind door B What is P(h1 | D)? What is P(h2 | D)? What is P(h3 | D)? Prior probability: P(h1) = 1/3 P(h2) =1/3 P(h3) =1/3 Likelihood: P(D| h1) = 1/2 P(D | h2) = 1 P(D | h3) = 0 P(D) = p(D|h1)p(h1) + p(D|h2)p(h2) + p(D|h3)p(h3) = 1/6 + 1/3 + 0 = 1/2 By Bayes rule: P(h1|D) = P(D|h1)p(h1) / P(D) = ½ 1/3 / ½ = 1/3 P(h2|D) = P(D|h2)p(h2) / P(D) = 1 1/3 / ½ = 2/3 So you should switch!

  18. MAP (“maximum a posteriori”) Learning Bayes rule: Goal of learning: Find maximum a posteriori hypothesis hMAP: because P(D) is a constant independent of h.

  19. Note: If every h  H is equally probable, then This is called the “maximum likelihood hypothesis”.

  20. A Medical Example Toby takes a test for leukemia. The test has two outcomes: positive and negative. It is known that if the patient has leukemia, the test is positive 98% of the time. If the patient does not have leukemia, the test is positive 3% of the time. It is also known that 0.008 of the population has leukemia. Toby’s test is positive. Which is more likely: Toby has leukemia or Toby does not have leukemia?

  21. Hypothesis space: h1 = T. has leukemia h2 = T. does not have leukemia • Prior: 0.008 of the population has leukemia. Thus P(h1) = 0.008 P(h2) = 0.992 • Likelihood: P(+ | h1) = 0.98, P(− | h1) = 0.02 P(+ | h2) = 0.03, P(− | h2) = 0.97 • Posterior knowledge: Blood test is + for this patient.

  22. In summary P(h1) = 0.008, P(h2) = 0.992 P(+ | h1) = 0.98, P(− | h1) = 0.02 P(+ | h2) = 0.03, P(− | h2) = 0.97 • Thus:

  23. What is P(leukemia|+)? So, These are called the “posterior” probabilities.

  24. In-Class Exercise Suppose you receive an e-mail message with the subject “Hi”. You have been keeping statistics on your e-mail, and have found that while only 10% of the total e-mail messages you receive are spam, 50% of the spam messages have the subject “Hi” and 2% of the non-spam messages have the subject “Hi”. What is the probability that the message is spam?

  25. Bayesianism vs. Frequentism • Classical probability: Frequentists • Probability of a particular event is defined relative to its frequency in a sample space of events. • E.g., probability of “the coin will come up heads on the next trial” is defined relative to the frequency of heads in a sample space of coin tosses. • Bayesian probability: • Combine measure of “prior” belief you have in a proposition with your subsequent observations of events. • Example: Bayesian can assign probability to statement “There was life on Mars a billion years ago” but frequentist cannot.

  26. Independence and Conditional Independence • Two random variables, X and Y, are independent if • Two random variables, X and Y, are independent given Z if • Examples?

  27. Naive Bayes Classifier Let f (x) be a target function for classification: f (x)  {+1, −1}. Let x = <x1, x2, ..., xn> We want to find the most probable class value, hMAP, given the data x:

  28. By Bayes Theorem: P(class) can be estimated from the training data. How? However, in general, not practical to use training data to estimate P(x1, x2, ..., xn | class). Why not?

  29. Naive Bayes classifier: Assume Is this a good assumption? Given this assumption, here’s how to classify an instance x = <x1, x2, ...,xn>: Naive Bayes classifier: Estimate the values of these various probabilities over the training set.

  30. Training data: Day Outlook Temp Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No Test data: D15 Sunny Cool High Strong ?

  31. In practice, use training data to compute a probablistic model:

  32. Estimating probabilities • Recap: In previous example, we had a training set and a new example, <Outlook=sunny, Temperature=cool, Humidity=high, Wind=strong> • We asked: What classification is given by a naive Bayes classifier? • Let n(c) be the number of training instances with class c, and n(xi = ai, c) be the number of training instances with attribute value xi=ai and class c. Then

  33. Problem with this method: If n(c) is very small, gives a poor estimate. • E.g., P(Outlook = Overcast | no) = 0.

  34. Now suppose we want to classify a new instance: <Outlook=overcast, Temperature=cool, Humidity=high, Wind=strong>. Then: This incorrectly gives us zero probability due to small sample.

  35. One solution: Laplace smoothing (also called “add-one” smoothing) For each class cj and attribute xi with value ai, add one “virtual” instance. That is, recalculate: where k is the number of possible values of attribute a.

  36. Training data: Day Outlook Temp Humidity Wind PlayTennis D1 Sunny HotHigh Weak No D2 Sunny HotHigh Strong No D3 Overcast HotHigh Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No Add virtual instances for Outlook: Outlook=Sunny: YesOutlook=Overcast: YesOutlook=Rain: Yes Outlook=Sunny: NoOutlook=Overcast: NoOutlook=Rain: No P(Outlook=Overcast| No) = 0 / 5 0 + 1 / 5 + 3 = 1/8

  37. Etc.

  38. In-class exercise 2

  39. Naive Bayes on continuous-valued attributes • How to deal with continuous-valued attributes? Two possible solutions: • Discretize • Assume particular probability distribution of classes over values (estimate parameters from training data)

  40. Simplest discretization method For each attribute xi , create k equal-size bins in interval from min(xi ) to max(xi). Choose thresholds in between bins. P(Humidity < 40 | yes) P(40<=Humidity < 80 | yes) P(80<=Humidity < 120 | yes) P(Humidity < 40 | no) P(40<=Humidity < 80 | no) P(80<=Humidity < 120 | no) Humidity: 25, 38, 50, 80, 90, 92, 96, 99 Threshold: 80 Threshold: 120 Threshold: 40

  41. Questions: What should k be? What if some bins have very few instances? Problem with balance between discretization bias and variance. The more bins, the lower the bias, but the higher the variance, due to small sample size.

  42. Alternative simple (but effective) discretization method(Yang & Webb, 2001) Let n = number of training examples. For each attribute Ai , create bins. Sort values of Ai in ascending order, and put of them in each bin. Don’t need add-one smoothing of probabilities This gives good balance between discretization bias and variance.

  43. Alternative simple (but effective) discretization method(Yang & Webb, 2001) Let n = number of training examples. For each attribute Ai , create bins. Sort values of Ai in ascending order, and put of them in each bin. Don’t need add-one smoothing of probabilities This gives good balance between discretization bias and variance. Humidity: 25, 38, 50, 80, 90, 92, 96, 99

  44. Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifer(P. Domingos and M. Pazzani) Naive Bayes classifier is called “naive” because it assumes attributes are independent of one another.

  45. This paper asks: why does the naive (“simple”) Bayes classifier, SBC, do so well in domains with clearly dependent attributes?

  46. Experiments • Compare five classification methods on 30 data sets from the UCI ML database. SBC = Simple Bayesian Classifier Default = “Choose class with most representatives in data” C4.5 = Quinlan’s decision tree induction system PEBLS = An instance-based learning system CN2 = A rule-induction system

  47. For SBC, numeric values were discretized into ten equal-length intervals.

  48. Number of domains in which SBC was more accurate versus less accurate than corresponding classifier Same as line 1, but significant at 95% confidence Average rank over all domains (1 is best in each domain)

  49. Measuring Attribute Dependence They used a simple, pairwise mutual information measure: For attributes Am and An ,dependence is defined as where AmAn is a “derived attribute”, whose values consist of the possible combinations of values of Am and An Note: If Am and An are independent, then D(Am, An | C) = 0.

More Related