Bayesian Classification Dr. Navneet Goyal BITS, Pilani

Bayesian ClassificationDr. Navneet GoyalBITS, Pilani

Bayesian Classification • What are Bayesian Classifiers? • Statistical Classifiers • Predict class membership probabilities • Based on Bayes Theorem • Naïve Bayesian Classifier • Computationally Simple • Comparable performance with DT and NN classifiers

Bayesian Classification • Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems • Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data.

Bayes Theorem • Let X be a data sample whose class label is unknown • Let H be some hypothesis that X belongs to a class C • For classification determine P(H/X) • P(H/X) is the probability that H holds given the observed data sample X • P(H/X) is posterior probability

Bayes Theorem Example: Sample space: All Fruits X is “round” and “red” H= hypothesis that X is an Apple P(H/X) is our confidence that X is an apple given that X is “round” and “red” • P(H) is Prior Probability of H, ie, the probability that any given data sample is an apple regardless of how it looks • P(H/X) is based on more information • Note that P(H) is independent of X

Bayes Theorem Example: Sample space: All Fruits • P(X/H) ? • It is the probability that X is round and red given that we know that it is true that X is an apple • Here P(X) is prior probability = P(data sample from our set of fruits is red and round)

Estimating Probabilities • P(X), P(H), and P(X/H) may be estimated from given data • Bayes Theorem • Use of Bayes Theorem in Naïve Bayesian Classifier!!

Naïve Bayesian Classification • Also called Simple BC • Why Naïve/Simple?? • Class Conditional Independence • Effect of an attribute values on a given class is independent of the values of other attributes • This assumption simplifies computations

Naïve Bayesian Classification Steps Involved • Each data sample is of the type X=(xi) i =1(1)n, where xi is the values of X for attribute Ai • Suppose there are m classes Ci,i=1(1)m. X  Ci iff P(Ci|X) > P(Cj|X) for 1 j  m, ji i.e BC assigns X to class Ci having highest posterior probability conditioned on X

Naïve Bayesian Classification The class for which P(Ci|X) is maximized is called the maximum posterior hypothesis. From Bayes Theorem • P(X) is constant. Only need be maximized. • If class prior probabilities not known, then assume all classes to be equally likely • Otherwise maximize P(Ci) = Si/S Problem: computing P(X|Ci) is unfeasible! (find out how you would find it and why it is infeasible)

Naïve Bayesian Classification • Naïve assumption: attribute independence = P(x1,…,xn|C) = P(xk|C) • In order to classify an unknown sample X, evaluate for each class Ci. Sample X is assigned to the class Ci iff P(X|Ci)P(Ci) > P(X|Cj) P(Cj) for 1 j  m, ji

Naïve Bayesian Classification EXAMPLE

Naïve Bayesian Classification EXAMPLE X= (<=30,MEDIUM, Y,FAIR, ???) We need to maximize: P(X|Ci)P(Ci) for i=1,2. P(Ci) is computed from training sample P(buys_comp=Y) = 9/14 = 0.643 P(buys_comp=N) = 5/14 = 0.357 How to calculate P(X|Ci)P(Ci) for i=1,2? P(X|Ci) = P(x1, x2, x3, x4|C) = P(xk|C)

Naïve Bayesian Classification EXAMPLE P(X | buys_comp=Y)=0.222*0.444*0.667*0.667=0.044 P(X | buys_comp=N)=0.600*0.400*0.200*0.400=0.019 P(X | buys_comp=Y)P(buys_comp=Y) = 0.044*0.643=0.028 P(X | buys_comp=N)P(buys_comp=N) = 0.019*0.357=0.007 CONCLUSION: X buys computer

Naïve Bayes Classifier: Issues • Probability values ZERO! • Recall what you observed in WEKA! • If Ak is continuous valued! • Recall what you observed in WEKA! If there are no tuples in the training set corresponding to students for the class buys-comp=NO P(student = Y|buys_comp=N)=0 Implications? Solution?

Naïve Bayes Classifier: Issues • Laplacian Correction or Laplace Estimator • Philosophy – we assume that the training data set is so large that adding one to each count that we need would only make a negligible difference in the estimated prob. value. • Example: D (1000) • Class: buys_comp=Y income=low – zero tuples income=medium – 990 tuples income=high – 10 tuples Without Laplacian Correction the probs. are 0, 0.990, and 0.010 With Laplacian correction: 1/1003 = 0.001, 991/1003=0.988, and 11/1003=0.011 respectively.

Naïve Bayes Classifier: Issues • Continuous variable: need to do more work than categorical attributes! • It is typically assumed to have a Guassian distribution with a mean  and a std. dev. . • Do it yourself! And cross check with WEKA!

Naïve Bayes (Summary) • Robust to isolated noise points • Handle missing values by ignoring the instance during probability estimate calculations • Robust to irrelevant attributes • Independence assumption may not hold for some attributes • Use other techniques such as Bayesian Belief Networks (BBN)

Probability Calculations No. of attributes = 4 Distinct values = 3,3,3,3 No. of classes = 2 Total no. of probability calculations in NBC = 4*3*2 = 24! What if conditional ind. was not assumed? O(kp) for p k-valued attributes Multiply by m classes.

Bayesian Belief Networks • Naïve BC assumes Class Conditional Independence • This assumption simplifies computations • When this assumption holds true, Naïve BC is most accurate compared to all other classifiers • In real problems, dependencies do exist between variables • 2 methods to overcome this limitation of NBC • Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes • Decision trees, that reason on one attribute at the time, considering ‘most important’ attributes first

Bayesian Belief Networks (BBN) • Instead of requiring all attributes to be conditionally independent given the class, we specify which pair of attributes are conditionally independent. • Flexible way to modeling class conditional probabilities P(X|Y) • BBN provides a graphical representation of the probabilistic relationships among a set of RVs • Directed Acyclic Graph (DAG) • Probability table • Consider RVs A, B, & C in which A & B are ind. and each has a direct influence on C.

Bayesian Networks • Parent child relationship • Ancestor – descendant relationship • Fig. taken form Tan & kumar – Intro. to DM book

Bayesian Networks • Property 1 – Conditional Independence: A node in a BBN is CI of its non-descendants, if its parents are known • Fig. taken form Tan & kumar – Intro. to DM book

Fig. taken form Tan & kumar – Intro. to DM book

Conditional Independence • Let X, Y, & Z denote three set of random variables. The variables in X are said to be conditionally independent of Y, given Z if P(X|Y,Z) = P(X|Z) • Rel. bet. a person’s arm length and his/her reading skills!! • One might observe that people with longer arms tend to have higher levels of reading skills • How do you explain this rel.?

Conditional Independence • Can be explained through a confounding factor, AGE • A young child tends to have short arms and lacks the reading skills of an adult • If the age of a person is fixed, then the observed rel. between arm length and reading skills disappears • We can this conclude that arm length and reading skills are conditionally independent when the age variable is fixed P(reading skills| long arms,age) = P(reading skills|age)

Bayesian Belief Networks • Belief Networks • Bayesian Networks • Probabilistic Networks

Bayesian Belief Networks • Conditional Independence (CI) assumption made by NBC may be too rigid • Specially for classification problems in which attributes are somewhat correlated • We need a more flexible approach for modeling the class conditional probabilities P(X|Ci) = P(x1, x2, x3,…,xn|C) • instead of requiring that all the attributes be CI given the class, BBN allows us to specify which pair of attributes are CI

Bayesian Belief Networks • Belief Networks has 2 components • Directed Acyclic Graph (DAG) • Conditional Probability Table (CPT)

Bayesian Belief Networks • A node in BBN is CI of its non-descendants, if its parents are known

(FH, S) (FH, ~S) (~FH, S) LC 0.7 0.8 0.5 0.1 ~LC 0.3 0.2 0.5 0.9 Bayesian Belief Networks Family History Smoker (~FH, ~S) LungCancer Emphysema The conditional probability table for the variable LungCancer PositiveXRay Dyspnea Bayesian Belief Networks

Bayesian Belief Networks • 6 boolean variables • Arcs allow representation of causal knowledge • Having lung cancer is influenced by family history and smoking • PositiveXray is ind. of whether the paient has a FH or if he/she is a smoker given that we know that the patient has lung cancer • Once we know the outcome of Lung Cancer, FH & Smoker do not provide any additional info. about PositiveXray

(~FH, ~S) (FH, S) (FH, ~S) (~FH, S) LC 0.7 0.8 0.5 0.1 ~LC 0.3 0.2 0.5 0.9 CPT for LungCancer Bayesian Belief Networks • Lung Cancer is CI of Emphysema, given its parents, FH & Smoker • BBN has a Conditional Probability Table (CPT) for each variable in the DAG • CPT for a variable Y specifies the conditional distribution P(Y|parents(Y)) P(LC=Y|FH=Y,S=Y) = 0.8 P(LC=N|FH=N,S=N) = 0.9

Bayesian Belief Networks • Let X=(x1, x2,…,xn) be a tuple described by variables or attributes Y1, Y2, …,Yn respectively • Each variable is CI of its nondescendants given its parents • Allows he DAG to provide a complete representation of the existing Joint Probability Distribution by: P(x1, x2, x3,…,xn)=P(xi|Parents(Yi)) where P(x1, x2, x3,…,xn) is the prob. of a particular combination of values of X, and the values for P(xi|Parents(Yi)) correspond to the entries in CPT for Yi

Bayesian Belief Networks • A node within the network can selected as an ‘output’ node, representing a class label attribute • More than one output node • Rather than returning a single class label, the classification process can return a probability distribution that gives the probability of each class • Training BBN!!

Training BBN • Number of scenarios possible • Network topology may be given in advance or inferred from data (how?) • Variables may be observable or hidden (mising or incomplete data) in all or some of the training tuples • Many algos for learning the network topology from the training data given observable variables [see references] • If network topology is known and the variables observable, training is straightforward (just compute CPT entries and you are done)

Training BBNs • Topology given, but some variables are hidden • Gradient Descent • Falls under the class of algos called Adaptive Probabilistic Networks • BBNs are computationally expensive • BBNs provide explicit representation of Causal structure • Domain experts can provide prior knowledge to the training process in the form of topology and/or in conditional probability values • This leads to significant improvement in the learning process

Training BBNs • BBNs are computationally expensive (compare with NBC) • BBNs provide explicit representation of Causal structure • Domain experts can provide prior knowledge to the training process in the form of topology and/or in conditional probability values • This leads to significant improvement in the learning process

References for BBNs • Rusell et. al. 1995. Gradient Ascent Rule The objective function that is maximized is P(D/h) – probability of the observed data D given the hypothesis h. P(D/h) is maximized by following the gradient of ln P(D/h) • Cooper & Herskovits 1992. k2 (a heuristics search algorithm) learning algorithm for learning the structure of a BBN when data is fully observable. Also present a scoring metric for choosing among alternative networks. • Survey papers on Learning Bayesian Networks • Heckerman 1995 • Buntine 1994

Bayesian Classification Dr. Navneet Goyal BITS, Pilani