1 / 35

Overview

Bayesian Probability Bayes’ Rule Naïve Bayesian Classification. Overview. Probability. Let P(A) represent the probability that proposition A is true. Example: Let Risky represent that a customer is a high credit risk.

cleta
Download Presentation

Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bayesian Probability Bayes’ Rule Naïve BayesianClassification Overview Data Mining: Concepts and Techniques

  2. Probability • Let P(A) represent the probability that proposition A is true. • Example: Let Risky represent that a customer is a high credit risk. • P(Risky) = 0.519 means that there is a 51.9% chance a given customer is a high-credit risk. • Without any other information, this probability is called the prior or unconditional probability Data Mining: Concepts and Techniques

  3. Random Variables • Could also consider a random variableX, which can take on one of many values in its domain <x1,x2,…,xn> • Example: Let Weather be a random variable with domain <sunny, rain, cloudy, snow>. • The probabilities of Weather taking on one of these values is P(Weather=sunny)=0.7 P(Weather=rain)=0.2 P(Weather=cloudy)=0.08 P(Weather=snow)=0.02 Data Mining: Concepts and Techniques

  4. Conditional Probability • Probabilities of events change when we know something about the world • The notation P(A|B) is used to represent the conditional or posterior probability of A • Read “the probability of A given that all we know is B.” P(Weather = snow | Temperature = below freezing) = 0.10 Data Mining: Concepts and Techniques

  5. Axioms of Probability • All probabilities are between 0 and 1 • 0P(A) 1 • Necessarily true propositions have prob. of 1, necessarily false prob. of 0 • P(true) = 1 P(false) = 0 • The probability of a disjunction is given by • P(AB) = P(A) + P(B) - P(AB) Data Mining: Concepts and Techniques

  6. Axioms of Probability • We can use logical connectives for probabilities • P(Weather = snow  Temperature = below freezing) • Can use disjunctions (or) or negation (not) as well • The product rule • P(A  B) = P(A|B)P(B) = P(B|A)P(A) Data Mining: Concepts and Techniques

  7. P(AB) P(A) P(B) Bayes Theorem - 1 • Consider the Venn diagram at right. The area of the rectangle is 1, and the area of each region gives the probability of the event(s) associated with that region • P(A|B) means “the probability of observing event A given that event B has already been observed”, i.e. • how much of the time that we see B do we also see A? (i.e. the ratio of the purple region to the magenta region) P(A|B) = P(AB)/P(B), and alsoP(B|A) = P(AB)/P(A), therefore P(A|B) = P(B|A)P(A)/P(B) (Bayes formula for two events) Data Mining: Concepts and Techniques

  8. Bayes Theorem - 2 More formally, • Let X be the sample data (evidence) • Let H be a hypothesis that X belongs to class C • In classification problems we wish to determine the probability that H holds given the observed sample data X • i.e. we seek P(H|X), which is known as the posterior probability of H conditioned on X Data Mining: Concepts and Techniques

  9. Bayes Theorem - 3 • P(H) is the prior probability • Similarly, P(X|H) is the posterior probability of X conditioned on H • Bayes Theorem (from earlier slide) is then Data Mining: Concepts and Techniques

  10. Chapter 7. Classification and Prediction • What is classification? What is prediction? • Issues regarding classification and prediction • Classification by decision tree induction • Bayesian Classification • Classification by backpropagation • Classification based on concepts from association rule mining • Other Classification Methods • Prediction • Classification accuracy • Summary Data Mining: Concepts and Techniques

  11. Bayesian Classification: Why? • A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities • Foundation: Based on Bayes’ Theorem. • Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers • Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data • Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured Data Mining: Concepts and Techniques

  12. Bayesian Theorem • Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes theorem • MAP (maximum posteriori) hypothesis • Predicts X belongs to Ci iff the probability P(Ci|X) is the highest among all the P(Ck|X) for all the k classes • Practical difficulty: require initial knowledge of many probabilities, significant computational cost Data Mining: Concepts and Techniques

  13. Towards Naïve Bayesian Classifier • Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n-D attribute vector X = (x1, x2, …, xn) • Suppose there are m classes C1, C2, …, Cm. • Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X). This can be derived from Bayes’ theorem • Since P(X) is constant for all classes, only needs to be maximized • Once the probability P(X|Ci) is known, assign X to the class with maximum P(X|Ci)*P(Ci) Data Mining: Concepts and Techniques

  14. Derivation of Naïve Bayes Classifier • A simplified assumption: attributes are conditionally independent (i.e., no dependence relation between attributes): • This greatly reduces the computation cost: Only counts the class distribution • If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci, D| (# of tuples of Ci in D) • If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian distribution with a mean μ and standard deviation σ and P(xk|Ci) is Data Mining: Concepts and Techniques

  15. Bayesian classification • The classification problem may be formalized using a-posteriori probabilities: • P(C|X) = prob. that the sample tuple X=<x1,…,xk> is of class C. • E.g. P(class=N | outlook=sunny,windy=true,…) • Idea: assign to sampleXthe class labelCsuch thatP(C|X) is maximal Data Mining: Concepts and Techniques

  16. Play-tennis example: estimating P(xi|C) Data Mining: Concepts and Techniques

  17. Play-tennis example: classifying X • An unseen sample X = <rain, hot, high, false> • P(X|p)·P(p) = P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = 3/9·2/9·3/9·6/9·9/14 = 0.010582 • P(X|n)·P(n) = P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14 = 0.018286 • Sample X is classified in class n (don’t play) Data Mining: Concepts and Techniques

  18. Training dataset Class: C1:buys_computer= ‘yes’ C2:buys_computer= ‘no’ Data sample X =(age<=30, Income=medium, Student=yes Credit_rating= Fair) Data Mining: Concepts and Techniques

  19. Naïve Bayesian Classifier: Example • Compute P(X/Ci) for each classP(age=“<30” | buys_computer=“yes”) = 2/9=0.222P(age=“<30” | buys_computer=“no”) = 3/5 =0.6P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4P(student=“yes” | buys_computer=“yes)= 6/9 =0.667P(student=“yes” | buys_computer=“no”)= 1/5=0.2P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 X=(age<=30 ,income =medium, student=yes,credit_rating=fair) P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044 P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028 P(X|buys_computer=“yes”) * P(buys_computer=“no”)=0.007 X belongs to class “buys_computer=yes” Data Mining: Concepts and Techniques

  20. Children Income Status Many Medium DEFAULTS Many Low DEFAULTS Few Medium PAYS Few High PAYS ApplicantID City 1 Philly 2 Philly 3 Philly 4 Philly Example 3 Take the following training data, from bank loan applicants: • As our attributes are all categorical in this case, we obtain our probabilities using simple counts and ratios: • P[City=Philly | Status = DEFAULTS] = 2/2 = 1 • P[City=Philly | Status = PAYS] = 2/2 = 1 • P[Children=Many | Status = DEFAULTS] = 2/2 = 1 • P[Children=Few | Status = DEFAULTS] = 0/2 = 0 • etc. Data Mining: Concepts and Techniques 20

  21. Example 3 Summarizing, we have the following probabilities: and P[Status = DEFAULTS] = 2/4 = 0.5 P[Status = PAYS] = 2/4 = 0.5 The probability of Income=Medium given the applicant DEFAULTs = the number of applicants with Income=Medium who DEFAULT divided by the number of applicants who DEFAULT = 1/2 = 0.5 Data Mining: Concepts and Techniques 21

  22. Example 3 Now, assume a new example is presented where City=Philly, Children=Many, and Income=Medium: First, we estimate the likelihood that the example is a defaulter, given its attribute values: P[H1|E] = P[E|H1].P[H1] (denominator omitted*) P[Status = DEFAULTS | Philly,Many,Medium] = P[Philly|DEFAULTS] x P[Many|DEFAULTS] x P[Medium|DEFAULTS] x P[DEFAULTS] = 1 x 1 x 0.5 x 0.5 = 0.25 Then we estimate the likelihood that the example is a payer, given its attributes: P[H2|E] =P[E|H2].P[H2] (denominator omitted*) P[Status = PAYS | Philly,Many,Medium] = P[Philly|PAYS] x P[Many|PAYS] x P[Medium|PAYS] x P[PAYS] = 1 x 0 x 0.5 x 0.5 = 0 As the conditional likelihood of being a defaulter is higher (because 0.25 > 0), we conclude that the new example is a defaulter. *Note: We haven’t divided by P[Philly,Many,Medium] in the calculations above, as that doesn’t affect which of the two likelihoods is higher, as its applied to both, so it doesn’t affect our result!) Data Mining: Concepts and Techniques 22

  23. Example 3 Now, assume a new example is presented where City=Philly, Children=Many, and Income=High: First, we estimate the likelihood that the example is a defaulter, given its attribute values: P[Status = DEFAULTS | Philly,Many,High] = P[Philly|DEFAULTS] x P[Many|DEFAULTS] x P[High|DEFAULTS] x P[DEFAULTS] = 1 x 1 x 0 x 0.5 = 0 Then we estimate the likelihood that the example is a payer, given its attributes: P[Status = PAYS | Philly,Many,High] = P[Philly|PAYS] x P[Many|PAYS] x P[High|PAYS] x P[PAYS] = 1 x 0 x 0 x 0.5 = 0 As the conditional likelihood of being a defaulter is the same as that for being a payer, we can come to no conclusion for this example. Data Mining: Concepts and Techniques 23

  24. TransactionID Income Credit Decision 1 Very High Excellent AUTHORIZE 2 High Good AUTHORIZE 3 Medium Excellent AUTHORIZE 4 High Good AUTHORIZE 5 Very High Good AUTHORIZE 6 Medium Excellent AUTHORIZE 7 High Bad REQUEST ID 8 Medium Bad REQUEST ID High Bad 9 REJECT 10 Low Bad CALL POLICE Example 4 Take the following training data, for credit card authorizations: Source: Adapted from Dunham Assume we’d like to determine how to classify a new transaction, with Income = Medium and Credit=Good. Data Mining: Concepts and Techniques 24

  25. Example 4 Our conditional probabilities are: Our class probabilities are: P[Decision = AUTHORIZE] = 6/10 P[Decision = REQUEST ID] = 2/10 P[Decision = REJECT] = 1/10 P[Decision = CALL POLICE] = 1/10 Data Mining: Concepts and Techniques 25

  26. Example 4 Our goal is now to work out, for each class, the conditional probability of the new transaction (with Income=Medium & Credit=Good) being in that class. The class with the highest probability is the classification we choose. Our conditional probabilities (again, ignoring Bayes’s denominator) are: P[Decision = AUTHORIZE | Income=Medium & Credit=Good] = P[Income=Medium|Decision=AUTHORIZE] x P[Credit=Good|Decision=AUTHORIZE] x P[Decision=AUTHORIZE] = 2/6 x 3/6 x 6/10 = 36/360 = 0.1 P[Decision = REQUEST ID | Income=Medium & Credit=Good] = P[Income=Medium|Decision=REQUEST ID] x P[Credit=Good|Decision=REQUEST ID] x P[Decision=REQUEST ID] = 1/2 x 0/2 x 2/10 = 0 P[Decision = REJECT | Income=Medium & Credit=Good] = P[Income=Medium|Decision=REJECT] x P[Credit=Good|Decision=REJECT] x P[Decision=REJECT] = 0/1 x 0/1 x 1/10 = 0 P[Decision = CALL POLICE | Income=Medium & Credit=Good] = P[Income=Medium|Decision=CALL POLICE] x P[Credit=Good|Decision=CALL POLICE] x P[Decision=CALL POLICE] = 0/1 x 0/1 x 1/10 = 0 The highest of these probabilities is the first, so we conclude that the decision for our new transaction should be AUTHORIZE. Data Mining: Concepts and Techniques 26

  27. Example 5 Data Mining: Concepts and Techniques

  28. Example 5 • Let D=<unknown, low, none, 15-35> • Which risk category is D in? • Three hypotheses: Risk=low, Risk=moderate, Risk=high • Because of naïve assumption, calculate individual probabilities and then multiply together. Data Mining: Concepts and Techniques

  29. Example 5 P(CH=unknown | Risk=low) = 2/5 P(D|Risk=low)=2/5*3/5*3/5*0/5=0 P(CH=unknown | Risk=moderate) = 1/3 P(D|Risk=moderate)=1/3*1/3*2/3*2/3=4/81=0.494 P(CH=unknown | Risk=high) = 2/6 P(D|Risk=high)=2/6*2/6*6/6*2/6=48/1296=0.370 P(Debt=low | Risk=low) = 3/5 P(Debt=low | Risk=moderate) = 1/3 P(Risk=low)=5/14 P(Debt=low | Risk=high) = 2/6 P(Risk=moderate)=3/14 P(Coll=none | Risk=low) = 3/5 P(Risk=high)=6/14 P(Coll=none | Risk=moderate) = 2/3 P(Coll=none | Risk=high) = 6/6 P(D|Risk=low)P(Risk=low) = 0*5/14 = 0 P(Inc=15-35 | Risk=low) = 0/5 P(D|Risk=moderate)P(Risk=moderate)=4/81*3/14=0.0106 P(Inc=15-35 | Risk=moderate) = 2/3 P(D|Risk=high)P(Risk=high)=48/1296*6/14=0.0159 P(Inc=15-35 | Risk=high) = 2/6 Data Mining: Concepts and Techniques

  30. Avoiding the 0-Probability Problem • Naïve Bayesian prediction requires each conditional prob. be non-zero. Otherwise, the predicted prob. will be zero • Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and income = high (10), • Use Laplacian correction (or Laplacian estimator) • Adding 1 to each case Prob(income = low) = 1/1003 Prob(income = medium) = 991/1003 Prob(income = high) = 11/1003 • The “corrected” prob. estimates are close to their “uncorrected” counterparts Data Mining: Concepts and Techniques

  31. Naïve Bayesian Classifier: Comments • Advantages : • Easy to implement • Good results obtained in most of the cases • Disadvantages • Assumption: class conditional independence , therefore loss of accuracy • Practically, dependencies exist among variables • E.g., hospitals: patients: Profile: age, family history etc Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc • Dependencies among these cannot be modeled by Naïve Bayesian Classifier • How to deal with these dependencies? • Bayesian Belief Networks Data Mining: Concepts and Techniques

  32. The independence hypothesis… • … makes computation possible • … yields optimal classifiers when satisfied • … but is seldom satisfied in practice, as attributes (variables) are often correlated. • Attempts to overcome this limitation: • Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes • Decision trees, that reason on one attribute at the time, considering most important attributes first Data Mining: Concepts and Techniques

  33. Y Z P Bayesian Belief Networks • Bayesian belief network allows a subset of the variables conditionally independent • A graphical model of causal relationships • Represents dependency among the variables • Gives a specification of joint probability distribution • Nodes: random variables • Links: dependency • X and Y are the parents of Z, and Y is the parent of P • No dependency between Z and P • Has no loops or cycles X Data Mining: Concepts and Techniques

  34. (FH, S) (FH, ~S) (~FH, S) (~FH, ~S) LC 0.8 0.7 0.5 0.1 ~LC 0.2 0.5 0.3 0.9 Bayesian Belief Network: An Example Family History Smoker The conditional probability table (CPT) for variable LungCancer: LungCancer Emphysema CPT shows the conditional probability for each possible combination of its parents PositiveXRay Dyspnea Derivation of the probability of a particular combination of values of X, from CPT: Bayesian Belief Networks Data Mining: Concepts and Techniques

  35. Training Bayesian Networks • Several scenarios: • Given both the network structure and all variables observable: learn only the CPTs • Network structure known, some hidden variables: gradient descent (greedy hill-climbing) method, analogous to neural network learning • Network structure unknown, all variables observable: search through the model space to reconstruct network topology • Unknown structure, all hidden variables: No good algorithms known for this purpose • Ref. D. Heckerman: Bayesian networks for data mining Data Mining: Concepts and Techniques

More Related