1 / 84

Bayesian Networks

Bayesian Networks. Tamara Berg CS 590-133 Artificial Intelligence. Many slides throughout the course adapted from Svetlana Lazebnik , Dan Klein, Stuart Russell, Andrew Moore, Percy Liang, Luke Zettlemoyer , Rob Pless , Killian Weinberger, Deva Ramanan. Announcements.

shauna
Download Presentation

Bayesian Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bayesian Networks Tamara Berg CS 590-133 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell, Andrew Moore, Percy Liang, Luke Zettlemoyer, Rob Pless, Killian Weinberger, Deva Ramanan

  2. Announcements • Office hours today canceled (due to conference deadline) • Rohit has office hours tomorrow 4-5pm

  3. Bayesian decision making • Suppose the agent has to make decisions about the value of an unobserved query variable X based on the values of an observed evidence variableE • Inference problem: given some evidence E = e, what is P(X | e)? • Learning problem: estimate the parameters of the probabilistic model P(X | E) given training samples{(x1,e1), …, (xn,en)}

  4. Learning in Bayes Nets • Task 1: Given the network structure and given data, where a data point is an observed setting for the variables, learn the CPTs for the Bayes Net. Might also start with priors for CPT probabilities. • Task 2: Given only the data (and possibly a prior over Bayes Nets), learn the entire Bayes Net (both Net structure and CPTs).

  5. Parameter learning • Suppose we know the network structure (but not the parameters), and have a training set of complete observations Training set ? ? ? ? ? ? ? ? ?

  6. Parameter learning • Suppose we know the network structure (but not the parameters), and have a training set of complete observations • P(X | Parents(X)) is given by the observed frequencies of the different values of X for each combination of parent values

  7. Parameter learning • Incomplete observations • Expectation maximization (EM) algorithm for dealing with missing data Training set ? ? ? ? ? ? ? ? ?

  8. Learning from incomplete data • Observe real-world data • Some variables are unobserved • Trick: Fill them in • EM algorithm • E=Expectation • M=Maximization

  9. EM-Algorithm • Initialize CPTs (e.g. randomly, uniformly) • Repeat until convergence • E-Step: • Compute for all i and x (inference) • M-Step: • Update CPT tables

  10. Parameter learning • What if the network structure is unknown? • Structure learning algorithms exist, but they are pretty complicated… Training set C ? S R W

  11. Summary: Bayesian networks • Structure • Independence • Inference • Parameters • Learning

  12. Naïve Bayes

  13. Bayesian inference, Naïve Bayes model http://xkcd.com/1236/

  14. Bayes Rule Rev. Thomas Bayes(1702-1761) • The product rule gives us two ways to factor a joint probability: • Therefore, • Why is this useful? • Key tool for probabilistic inference: can get diagnostic probability from causal probability • E.g.,P(Cavity | Toothache) fromP(Toothache | Cavity)

  15. Bayes Rule example • Marie is getting married tomorrow, at an outdoor ceremony in the desert. In recent years, it has rained only 5 days each year (5/365 = 0.014). Unfortunately, the weatherman has predicted rain for tomorrow. When it actually rains, the weatherman correctly forecasts rain 90% of the time. When it doesn't rain, he incorrectly forecasts rain 10% of the time. What is the probability that it will rain on Marie's wedding?

  16. Bayes Rule example • Marie is getting married tomorrow, at an outdoor ceremony in the desert. In recent years, it has rained only 5 days each year (5/365 = 0.014). Unfortunately, the weatherman has predicted rain for tomorrow. When it actually rains, the weatherman correctly forecasts rain 90% of the time. When it doesn't rain, he incorrectly forecasts rain 10% of the time. What is the probability that it will rain on Marie's wedding?

  17. Bayes rule: Another example • 1% of women at age forty who participate in routine screening have breast cancer.  80% of women with breast cancer will get positive mammographies.9.6% of women without breast cancer will also get positive mammographies.  A woman in this age group had a positive mammography in a routine screening.  What is the probability that she actually has breast cancer?

  18. Probabilistic inference • Suppose the agent has to make a decision about the value of an unobserved query variable X given some observed evidence variable(s)E = e • Partially observable, stochastic, episodic environment • Examples: X = {spam, not spam}, e = email messageX = {zebra, giraffe, hippo}, e = image features

  19. Bayesian decision theory • Let x be the value predicted by the agent and x* be the true value of X. • The agent has a loss function, which is 0 if x = x* and 1 otherwise • Expected loss: • What is the estimate of X that minimizes the expected loss? • The one that has the greatest posterior probability P(x|e) • This is called the Maximum a Posteriori (MAP) decision

  20. MAP decision • Value x of X that has the highest posterior probability given the evidence E = e: • Maximum likelihood (ML) decision: posterior likelihood prior

  21. Naïve Bayes model • Suppose we have many different types of observations (symptoms, features) E1, …, En that we want to use to obtain evidence about an underlying hypothesis X • MAP decision involves estimating • If each feature Eican take on k values, how many entries are in the (conditional) joint probability table P(E1, …, En |X = x)?

  22. Naïve Bayes model • Suppose we have many different types of observations (symptoms, features) E1, …, En that we want to use to obtain evidence about an underlying hypothesis X • MAP decision involves estimating • We can make the simplifying assumption that the different features are conditionally independent given the hypothesis: • If each feature can take on k values, what is the complexity of storing the resulting distributions?

  23. Exchangeability • De Finetti Theorem of exchangeability (bag of words theorem): the joint probability distribution underlying the data is invariant to permutation.

  24. Naïve Bayes model • Posterior: • MAP decision: likelihood posterior prior

  25. A Simple Example – Naïve Bayes C – Class F - Features We only specify (parameters): prior over class labels how each feature depends on the class

  26. Slide from Dan Klein

  27. Spam filter • MAP decision: to minimize the probability of error, we should classify a message as spam if P(spam | message) > P(¬spam | message)

  28. Spam filter • MAP decision: to minimize the probability of error, we should classify a message as spam if P(spam | message) > P(¬spam | message) • We have P(spam | message) P(message | spam)P(spam)and P(¬spam | message) P(message | ¬spam)P(¬spam) • To enable classification, we need to be able to estimate the likelihoodsP(message | spam) andP(message | ¬spam) andpriorsP(spam) andP(¬spam)

  29. Naïve Bayes Representation • Goal: estimate likelihoods P(message | spam) and P(message | ¬spam) andpriorsP(spam) andP(¬spam) • Likelihood: bag of words representation • The message is a sequence of words (w1, …, wn) • The order of the words in the message is not important • Each word is conditionally independent of the others given message class (spam or not spam)

  30. Naïve Bayes Representation • Goal: estimate likelihoods P(message | spam) and P(message | ¬spam) andpriorsP(spam) andP(¬spam) • Likelihood: bag of words representation • The message is a sequence of words (w1, …, wn) • The order of the words in the message is not important • Each word is conditionally independent of the others given message class (spam or not spam)

  31. Bag of words illustration US Presidential Speeches Tag Cloudhttp://chir.ag/projects/preztags/

  32. Bag of words illustration US Presidential Speeches Tag Cloudhttp://chir.ag/projects/preztags/

  33. Bag of words illustration US Presidential Speeches Tag Cloudhttp://chir.ag/projects/preztags/

  34. Naïve Bayes Representation • Goal: estimate likelihoods P(message | spam) and P(message | ¬spam) andpriorsP(spam) andP(¬spam) • Likelihood: bag of words representation • The message is a sequence of words (w1, …, wn) • The order of the words in the message is not important • Each word is conditionally independent of the others given message class (spam or not spam) • Thus, the problem is reduced to estimating marginal likelihoods of individual words P(wi | spam) and P(wi | ¬spam)

  35. Summary: Decision rule • General MAP rule for Naïve Bayes: • Thus, the filter should classify the message as spam if likelihood posterior prior

  36. Slide from Dan Klein

  37. Slide from Dan Klein

  38. Percentage of documents in training set labeled as spam/ham Slide from Dan Klein

  39. In the documents labeled as spam, occurrence percentage of each word (e.g. # times “the” occurred/# total words). Slide from Dan Klein

  40. In the documents labeled as ham, occurrence percentage of each word (e.g. # times “the” occurred/# total words). Slide from Dan Klein

  41. Parameter estimation • Model parameters: feature likelihoods P(word | spam) andP(word | ¬spam) andpriorsP(spam) andP(¬spam) • How do we obtain the values of these parameters? • Need training set of labeled samples from both classes • This is the maximum likelihood (ML) estimate, or estimate that maximizes the likelihood of the training data: # of word occurrences in spam messages P(word | spam) = total # of words in spam messages d: index of training document, i: index of a word

  42. Parameter estimation • Parameter estimate: • Parameter smoothing: dealing with words that were never seen or seen too few times • Laplacian smoothing: pretend you have seen every vocabulary word one more time than you actually did # of word occurrences in spam messages P(word | spam) = total # of words in spam messages # of word occurrences in spam messages + 1 P(word | spam) = total # of words in spam messages + V (V: total number of unique words)

  43. Summary of model and parameters • Naïve Bayes model: • Model parameters: Likelihood of spam Likelihood of ¬spam prior P(w1 | spam) P(w2 | spam) … P(wn | spam) P(w1 | ¬spam) P(w2 | ¬spam) … P(wn | ¬spam) P(spam) P(¬spam)

  44. Classification The class that maximizes:

  45. Classification • In practice • Multiplying lots of small probabilities can result in floating point underflow

  46. Classification • In practice • Multiplying lots of small probabilities can result in floating point underflow • Since log(xy) = log(x) + log(y), we can sum log probabilities instead of multiplying probabilities.

  47. Classification • In practice • Multiplying lots of small probabilities can result in floating point underflow • Since log(xy) = log(x) + log(y), we can sum log probabilities instead of multiplying probabilities. • Since log is a monotonic function, the class with the highest score does not change.

  48. Classification • In practice • Multiplying lots of small probabilities can result in floating point underflow • Since log(xy) = log(x) + log(y), we can sum log probabilities instead of multiplying probabilities. • Since log is a monotonic function, the class with the highest score does not change. • So, what we usually compute in practice is:

  49. Naïve Bayes on images

  50. Visual Categorization with Bags of KeypointsGabriella Csurka, Christopher R. Dance, Lixin Fan, JuttaWillamowski, Cédric Bray

More Related