1 / 41

Bayesian Classification Dr. Navneet Goyal BITS, Pilani

Bayesian Classification Dr. Navneet Goyal BITS, Pilani. Bayesian Classification. What are Bayesian Classifiers? Statistical Classifiers Predict class membership probabilities Based on Bayes Theorem Naïve Bayesian Classifier Computationally Simple

avye-gentry
Download Presentation

Bayesian Classification Dr. Navneet Goyal BITS, Pilani

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bayesian ClassificationDr. Navneet GoyalBITS, Pilani

  2. Bayesian Classification • What are Bayesian Classifiers? • Statistical Classifiers • Predict class membership probabilities • Based on Bayes Theorem • Naïve Bayesian Classifier • Computationally Simple • Comparable performance with DT and NN classifiers

  3. Bayesian Classification • Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems • Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data.

  4. Bayes Theorem • Let X be a data sample whose class label is unknown • Let H be some hypothesis that X belongs to a class C • For classification determine P(H/X) • P(H/X) is the probability that H holds given the observed data sample X • P(H/X) is posterior probability

  5. Bayes Theorem Example: Sample space: All Fruits X is “round” and “red” H= hypothesis that X is an Apple P(H/X) is our confidence that X is an apple given that X is “round” and “red” • P(H) is Prior Probability of H, ie, the probability that any given data sample is an apple regardless of how it looks • P(H/X) is based on more information • Note that P(H) is independent of X

  6. Bayes Theorem Example: Sample space: All Fruits • P(X/H) ? • It is the probability that X is round and red given that we know that it is true that X is an apple • Here P(X) is prior probability = P(data sample from our set of fruits is red and round)

  7. Estimating Probabilities • P(X), P(H), and P(X/H) may be estimated from given data • Bayes Theorem • Use of Bayes Theorem in Naïve Bayesian Classifier!!

  8. Naïve Bayesian Classification • Also called Simple BC • Why Naïve/Simple?? • Class Conditional Independence • Effect of an attribute values on a given class is independent of the values of other attributes • This assumption simplifies computations

  9. Naïve Bayesian Classification Steps Involved • Each data sample is of the type X=(xi) i =1(1)n, where xi is the values of X for attribute Ai • Suppose there are m classes Ci,i=1(1)m. X  Ci iff P(Ci|X) > P(Cj|X) for 1 j  m, ji i.e BC assigns X to class Ci having highest posterior probability conditioned on X

  10. Naïve Bayesian Classification The class for which P(Ci|X) is maximized is called the maximum posterior hypothesis. From Bayes Theorem • P(X) is constant. Only need be maximized. • If class prior probabilities not known, then assume all classes to be equally likely • Otherwise maximize P(Ci) = Si/S Problem: computing P(X|Ci) is unfeasible! (find out how you would find it and why it is infeasible)

  11. Naïve Bayesian Classification • Naïve assumption: attribute independence = P(x1,…,xn|C) = P(xk|C) • In order to classify an unknown sample X, evaluate for each class Ci. Sample X is assigned to the class Ci iff P(X|Ci)P(Ci) > P(X|Cj) P(Cj) for 1 j  m, ji

  12. Naïve Bayesian Classification EXAMPLE

  13. Naïve Bayesian Classification EXAMPLE X= (<=30,MEDIUM, Y,FAIR, ???) We need to maximize: P(X|Ci)P(Ci) for i=1,2. P(Ci) is computed from training sample P(buys_comp=Y) = 9/14 = 0.643 P(buys_comp=N) = 5/14 = 0.357 How to calculate P(X|Ci)P(Ci) for i=1,2? P(X|Ci) = P(x1, x2, x3, x4|C) = P(xk|C)

  14. Naïve Bayesian Classification EXAMPLE P(age<=30 | buys_comp=Y)=2/9=0.222 P(age<=30 | buys_comp=N)=3/5=0.600 P(income=medium | buys_comp=Y)=4/9=0.444 P(income=medium | buys_comp=N)=2/5=0.400 P(student=Y | buys_comp=Y)=6/9=0.667 P(student=Y | buys_comp=N)=1/5=0.200 P(credit_rating=FAIR | buys_comp=Y)=6/9=0.667 P(credit_rating=FAIR | buys_comp=N)=2/5=0.400

  15. Naïve Bayesian Classification EXAMPLE P(X | buys_comp=Y)=0.222*0.444*0.667*0.667=0.044 P(X | buys_comp=N)=0.600*0.400*0.200*0.400=0.019 P(X | buys_comp=Y)P(buys_comp=Y) = 0.044*0.643=0.028 P(X | buys_comp=N)P(buys_comp=N) = 0.019*0.357=0.007 CONCLUSION: X buys computer

  16. Naïve Bayes Classifier: Issues • Probability values ZERO! • Recall what you observed in WEKA! • If Ak is continuous valued! • Recall what you observed in WEKA! If there are no tuples in the training set corresponding to students for the class buys-comp=NO P(student = Y|buys_comp=N)=0 Implications? Solution?

  17. Naïve Bayes Classifier: Issues • Laplacian Correction or Laplace Estimator • Philosophy – we assume that the training data set is so large that adding one to each count that we need would only make a negligible difference in the estimated prob. value. • Example: D (1000) • Class: buys_comp=Y income=low – zero tuples income=medium – 990 tuples income=high – 10 tuples Without Laplacian Correction the probs. are 0, 0.990, and 0.010 With Laplacian correction: 1/1003 = 0.001, 991/1003=0.988, and 11/1003=0.011 respectively.

  18. Naïve Bayes Classifier: Issues • Continuous variable: need to do more work than categorical attributes! • It is typically assumed to have a Guassian distribution with a mean  and a std. dev. . • Do it yourself! And cross check with WEKA!

  19. Naïve Bayes (Summary) • Robust to isolated noise points • Handle missing values by ignoring the instance during probability estimate calculations • Robust to irrelevant attributes • Independence assumption may not hold for some attributes • Use other techniques such as Bayesian Belief Networks (BBN)

  20. Probability Calculations No. of attributes = 4 Distinct values = 3,3,3,3 No. of classes = 2 Total no. of probability calculations in NBC = 4*3*2 = 24! What if conditional ind. was not assumed? O(kp) for p k-valued attributes Multiply by m classes.

  21. Bayesian Belief Networks • Naïve BC assumes Class Conditional Independence • This assumption simplifies computations • When this assumption holds true, Naïve BC is most accurate compared to all other classifiers • In real problems, dependencies do exist between variables • 2 methods to overcome this limitation of NBC • Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes • Decision trees, that reason on one attribute at the time, considering ‘most important’ attributes first

  22. Bayesian Belief Networks (BBN) • Instead of requiring all attributes to be conditionally independent given the class, we specify which pair of attributes are conditionally independent. • Flexible way to modeling class conditional probabilities P(X|Y) • BBN provides a graphical representation of the probabilistic relationships among a set of RVs • Directed Acyclic Graph (DAG) • Probability table • Consider RVs A, B, & C in which A & B are ind. and each has a direct influence on C.

  23. Bayesian Networks • Parent child relationship • Ancestor – descendant relationship • Fig. taken form Tan & kumar – Intro. to DM book

  24. Bayesian Networks • Property 1 – Conditional Independence: A node in a BBN is CI of its non-descendants, if its parents are known • Fig. taken form Tan & kumar – Intro. to DM book

  25. Fig. taken form Tan & kumar – Intro. to DM book

  26. Conditional Independence • Let X, Y, & Z denote three set of random variables. The variables in X are said to be conditionally independent of Y, given Z if P(X|Y,Z) = P(X|Z) • Rel. bet. a person’s arm length and his/her reading skills!! • One might observe that people with longer arms tend to have higher levels of reading skills • How do you explain this rel.?

  27. Conditional Independence • Can be explained through a confounding factor, AGE • A young child tends to have short arms and lacks the reading skills of an adult • If the age of a person is fixed, then the observed rel. between arm length and reading skills disappears • We can this conclude that arm length and reading skills are conditionally independent when the age variable is fixed P(reading skills| long arms,age) = P(reading skills|age)

  28. Conditional Independence P(X,Y|Z) = P(X,Y,Z)/P(Z) = P(X,Y,Z)/P(Y,Z) x P(Y,Z)/P(Z) = P(X|Y,Z) x P(Y|Z) = P(X|Z) x P(Y|Z) This explains the Naïve Bayesian: P(X|Ci) = P(x1, x2, x3,…,xn|C) = P(xk|C)

  29. Bayesian Belief Networks • Belief Networks • Bayesian Networks • Probabilistic Networks

  30. Bayesian Belief Networks • Conditional Independence (CI) assumption made by NBC may be too rigid • Specially for classification problems in which attributes are somewhat correlated • We need a more flexible approach for modeling the class conditional probabilities P(X|Ci) = P(x1, x2, x3,…,xn|C) • instead of requiring that all the attributes be CI given the class, BBN allows us to specify which pair of attributes are CI

  31. Bayesian Belief Networks • Belief Networks has 2 components • Directed Acyclic Graph (DAG) • Conditional Probability Table (CPT)

  32. Bayesian Belief Networks • A node in BBN is CI of its non-descendants, if its parents are known

  33. (FH, S) (FH, ~S) (~FH, S) LC 0.7 0.8 0.5 0.1 ~LC 0.3 0.2 0.5 0.9 Bayesian Belief Networks Family History Smoker (~FH, ~S) LungCancer Emphysema The conditional probability table for the variable LungCancer PositiveXRay Dyspnea Bayesian Belief Networks

  34. Bayesian Belief Networks • 6 boolean variables • Arcs allow representation of causal knowledge • Having lung cancer is influenced by family history and smoking • PositiveXray is ind. of whether the paient has a FH or if he/she is a smoker given that we know that the patient has lung cancer • Once we know the outcome of Lung Cancer, FH & Smoker do not provide any additional info. about PositiveXray

  35. (~FH, ~S) (FH, S) (FH, ~S) (~FH, S) LC 0.7 0.8 0.5 0.1 ~LC 0.3 0.2 0.5 0.9 CPT for LungCancer Bayesian Belief Networks • Lung Cancer is CI of Emphysema, given its parents, FH & Smoker • BBN has a Conditional Probability Table (CPT) for each variable in the DAG • CPT for a variable Y specifies the conditional distribution P(Y|parents(Y)) P(LC=Y|FH=Y,S=Y) = 0.8 P(LC=N|FH=N,S=N) = 0.9

  36. Bayesian Belief Networks • Let X=(x1, x2,…,xn) be a tuple described by variables or attributes Y1, Y2, …,Yn respectively • Each variable is CI of its nondescendants given its parents • Allows he DAG to provide a complete representation of the existing Joint Probability Distribution by: P(x1, x2, x3,…,xn)=P(xi|Parents(Yi)) where P(x1, x2, x3,…,xn) is the prob. of a particular combination of values of X, and the values for P(xi|Parents(Yi)) correspond to the entries in CPT for Yi

  37. Bayesian Belief Networks • A node within the network can selected as an ‘output’ node, representing a class label attribute • More than one output node • Rather than returning a single class label, the classification process can return a probability distribution that gives the probability of each class • Training BBN!!

  38. Training BBN • Number of scenarios possible • Network topology may be given in advance or inferred from data (how?) • Variables may be observable or hidden (mising or incomplete data) in all or some of the training tuples • Many algos for learning the network topology from the training data given observable variables [see references] • If network topology is known and the variables observable, training is straightforward (just compute CPT entries and you are done)

  39. Training BBNs • Topology given, but some variables are hidden • Gradient Descent • Falls under the class of algos called Adaptive Probabilistic Networks • BBNs are computationally expensive • BBNs provide explicit representation of Causal structure • Domain experts can provide prior knowledge to the training process in the form of topology and/or in conditional probability values • This leads to significant improvement in the learning process

  40. Training BBNs • BBNs are computationally expensive (compare with NBC) • BBNs provide explicit representation of Causal structure • Domain experts can provide prior knowledge to the training process in the form of topology and/or in conditional probability values • This leads to significant improvement in the learning process

  41. References for BBNs • Rusell et. al. 1995. Gradient Ascent Rule The objective function that is maximized is P(D/h) – probability of the observed data D given the hypothesis h. P(D/h) is maximized by following the gradient of ln P(D/h) • Cooper & Herskovits 1992. k2 (a heuristics search algorithm) learning algorithm for learning the structure of a BBN when data is fully observable. Also present a scoring metric for choosing among alternative networks. • Survey papers on Learning Bayesian Networks • Heckerman 1995 • Buntine 1994

More Related