M. Sulaiman Khan (mskhan@liv.ac.uk) ‏ Dept. of Computer Science University of Liverpool 2009

COMP527: Data Mining COMP527: Data Mining M. Sulaiman Khan • (mskhan@liv.ac.uk)‏ • Dept. of Computer Science • University of Liverpool • 2009 Classification: Bayes February 17, 2009 Slide 1

COMP527: Data Mining COMP527: Data Mining Introduction to the Course Introduction to Data Mining Introduction to Text Mining General Data Mining Issues Data Warehousing Classification: Challenges, Basics Classification: Rules Classification: Trees Classification: Trees 2 Classification: Bayes Classification: Neural Networks Classification: SVM Classification: Evaluation Classification: Evaluation 2 Regression, Prediction Input Preprocessing Attribute Selection Association Rule Mining ARM: A Priori and Data Structures ARM: Improvements ARM: Advanced Techniques Clustering: Challenges, Basics Clustering: Improvements Clustering: Advanced Algorithms Hybrid Approaches Graph Mining, Web Mining Text Mining: Challenges, Basics Text Mining: Text-as-Data Text Mining: Text-as-Language Revision for Exam Classification: Bayes February 17, 2009 Slide 2

Today's Topics COMP527: Data Mining Statistical Modeling Bayes Rule Naïve Bayes Fixes to Naïve Bayes Document classification Bayesian Networks • Structure • Learning Classification: Bayes February 17, 2009 Slide 3

Bayes Rule COMP527: Data Mining The probability of hypothesis H, given evidence E: • Pr[H|E] = PR[E|H]*Pr[H] / Pr[E] Pr[H] = A Priori probability of H (before evidence seen)‏ Pr[H|E] = A Posterioriprobability of H (after evidence seen) We want to use this in a classification system, so our goal is to find the most probable hypothesis (class) given the evidence (test instance). Classification: Bayes February 17, 2009 Slide 4

Example COMP527: Data Mining Meningitis causes a stiff neck 50% of the time. Meningitis occurs 1/50,000, stiff necks occur 1/20. Probability of Meningitis, given that the patient has a stiff neck: • Pr[H|E] = PR[E|H]*Pr[H] / Pr[E] • Pr[M|SN] = Pr[SN|M]*Pr[M]/Pr[SN] • = 0.5 * 1/50000 / 1/20 • = 0.0002 Classification: Bayes February 17, 2009 Slide 5

Bayes Rule COMP527: Data Mining Our evidence E is made up of different attributes A[1..n], so: Pr[H|E] = Pr[A1|H]*Pr[A2|H]...Pr[An|H]*Pr[H]/Pr[E] So we need to work out the probability of the individual attributes per class. Easy... Outlook=Sunny appears twice for yes out of 9 yes instances. We can work these out for all of our training instances... Classification: Bayes February 17, 2009 Slide 6

Weather Probabilities Outlook Temperature Humidity Windy Play Yes No Yes No Yes No Yes No Yes No Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5 Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3 Rainy 3 2 Cool 3 1 Sunny 2/9 3/5 Hot 2/9 2/5 High 3/9 4/5 False 6/9 2/5 9/14 5/14 Overcast 4/9 0/5 Mild 4/9 2/5 Normal 6/9 1/5 True 3/9 3/5 Rainy 3/9 2/5 Cool 3/9 1/5 COMP527: Data Mining Given a test instance (sunny, cool, high, true) play=yes: 2/9 * 3/9 * 3/9 * 9/14 = 0.0053 play=no: 3/5 * 1/5 * 4/5 * 3/5 * 5/14 = 0.0206 So we'd predict play=no for that particular instance. Classification: Bayes February 17, 2009 Slide 7

Weather Probabilities COMP527: Data Mining play=yes: 2/9 * 3/9 * 3/9 * 9/14 = 0.0053 play=no: 3/5 * 1/5 * 4/5 * 3/5 * 5/14 = 0.0206 This is the likelihood, not the probability. We need to normalise these. Prob(yes) = 0.0053 / 0.0053 + 0.0206 = 20.5% This is when the Pr[E] denominator disappears from Bayes's rule. Nice. Surely there's more to it than this... ? Classification: Bayes February 17, 2009 Slide 8

Naïve Bayes COMP527: Data Mining Issue: It's only valid to multiply probabilities when the events are independent of each other. It is “naïve” to assume independence between attributes in datasets, hence the name. Eg: The probability of Liverpool winning a football match is not independent of the probabilities for each member of the team scoring a goal. But even given that, Naïve Bayes is still very effective in practice, especially if we can eliminate redundant attributes before processing. Classification: Bayes February 17, 2009 Slide 9

Naïve Bayes COMP527: Data Mining Issue: If an attribute value does not co-occur with a class value, then the probability generated for it will be 0. Eg: Given outlook=overcast, the probability of play=no is 0/5. The other attributes will be ignored as the final result will be multiplied by 0. This is bad for our 4 attribute set, but horrific for (say) a 1000 attribute set. You can easily imagine a case where the likelihood for all classes is 0. Eg: 'Viagra' is always spam, 'data mining' is never spam. An email with both will be 0 for spam=yes and 0 for spam=no ... probability will be undefined ... uh oh! Classification: Bayes February 17, 2009 Slide 10

Laplace Estimator COMP527: Data Mining The trivial solution is of course to mess with the probabilities such that you never have 0s. We add 1 to the numerator and 3 to the denominator to compensate. So we end up with 1/8 instead of 0/5. No reason to use 3, could use 2 and 6. No reason to split equally... we could add weight to some attributes by giving them a larger share: • a+3/na+6 * b+2/nb+6 * c+1/nc+6 However, how to assign these is unclear. For reasonable training sets, simply initialise counts to 1 rather than 0. Classification: Bayes February 17, 2009 Slide 11

Missing Values COMP527: Data Mining Naïve Bayes deals well with missing values: Training: Ignore the instance for the attribute/class combination, but we can still use it for the known attributes. Classification: Ignore the attribute in the calculation as the difference will be normalised during the final step anyway. Classification: Bayes February 17, 2009 Slide 12

Numeric Values COMP527: Data Mining Naïve Bayes does not deal well with numeric values without some help. The probability of it being exactly 65 degrees is zero. We could discretize the attribute, but instead we'll calculate the mean and standard deviation and use a density function to predict the probability. mean: sum(values) / count(values)‏ variance: sum(square(value - mean)) / count(values)-1 standard deviation: square root of variance Mean for temperature is 73, Std. Deviation is 6.2 Classification: Bayes February 17, 2009 Slide 13

Numeric Values COMP527: Data Mining Density function: f(x) = Unless you've a math background, just plug the numbers in... At which point we get a likelihood of 0.034 Then we continue with this number as before. This assumes a reasonably normal distribution. Often not the case. 2 x- ( )‏ m 1 e 2 2 s 2 p s Classification: Bayes February 17, 2009 Slide 14

Document Classification COMP527: Data Mining The Bayesian model is often used to classify documents as it deals well with a huge number of attributes simultaneously. (eg boolean occurrence of words within the text)‏ But we may know how many times the word occurs. This leads to Multinomial Naive Bayes. Assumptions: 1. Probability of a word occurring in a document is independent of its location within the document. 2. The document length is not related to the class. Classification: Bayes February 17, 2009 Slide 15

Document Classification COMP527: Data Mining Pr[E|H] = N! * product(pn/n!)‏ N = number of words in document p = relative frequency of word in documents of class H n = number of occurrences of word in document So, if A has 75% and B has 25% frequency in class H Pr[“A A A”|H] = 3! * 0.753/3! * 0.250/0! = 27/64 = 0.422 Pr[“A A A B B”|H] = 5! * 0.753/3! * 0.252/2! = 0.264 Classification: Bayes February 17, 2009 Slide 16

Document Classification COMP527: Data Mining Pr[E|H] = N! * product(pn/n!)‏ We don't need to work out all the factorials, as they'll normalise out at the end. We still end up with insanely small numbers, as vocabularies are much much larger than 2 words. Instead we can sum the logarithms of the probabilities instead of multiplying them. Classification: Bayes February 17, 2009 Slide 17

Bayesian Networks COMP527: Data Mining Back to the attribute independence assumption. Can we get rid of it? Yes, with a Bayesian Network. Each attribute has a node in a Directed Acyclic Graph. Each node has a table of all attributes with edges pointing at the node linked against the probabilities for the attribute's values. Examples will be hopefully enlightening... Classification: Bayes February 17, 2009 Slide 18

Less Simple Network COMP527: Data Mining play yes no .633 .367 windy play outlook false true yes sunny .500 .500 yes overcast .500 .500 yes rainy .125 .875 no sunny .375 .625 no overcast .500 .500 no rainy .833 .167 Outlook playsunny overcast rainy yes .238 .429 .333 no .538 .077 .385 temperature play outlook hot mild cold yes sunny .238 .429 .333 yes overcast .385 .385 .231 yes rainy .111 .556 .333 no sunny .556 .333 .111 no overcast .333 .333 .333 no rainy .143 .429 .429 humidity play temp high normal yes hot .500 .500 yes mild .500 .500 yes cool .125 .875 no hot .833 .167 no mild .833 .167 no cool .250 .750 Classification: Bayes February 17, 2009 Slide 20

Bayesian Networks COMP527: Data Mining To use the network, simply step through each node and multiply the results in the table together for the instance's attributes' values. Or, more likely, sum the logarithms as with the multinomial case. Then, as before, normalise them to sum to 1. This works because the links between the nodes determine the probability distribution at the node. Using it seems straightforward. So all that remains is to find out the best network structure to use. Given a large number of attributes, there's a LARGE number of possible networks... Classification: Bayes February 17, 2009 Slide 21

Training Bayesian Networks COMP527: Data Mining We need two components: • Evaluate a network based on the dataAs always we need to find a system that measures the 'goodness' without overfitting (overfitting in this case = too many edges)We need a penalty for the complexity of the network. • Search through the space of possible networksAs we know the nodes, we need to find where the edges in the graph are. Which nodes connect to which other nodes? Classification: Bayes February 17, 2009 Slide 22

Training Bayesian Networks COMP527: Data Mining Following the Minimum Description Length ideal, networks with lots of edges will be more complex, and hence likely to over-fit. We could add a penalty for each cell in the nodes' tables. AIC: -LL +K MDL: -LL + K/2 log(N)‏ LL is total log-likelihood of the network and training set. eg Sum of log of probabilities for each instance in the data set. K is the number of cells in tables, minus the number of cells in the last row (which can be calculated, by 1- sum of other cells in row)‏ N is the number of instances in the data. Classification: Bayes February 17, 2009 Slide 23

Network Training: K2 COMP527: Data Mining K2: for each node, • for each previous node, add node, calculate worth continue when doesn't improve (Use MDL or AIC to determine worth)‏ The results of K2 depend on initial order selected to process the nodes in. Run it several times with different orders and select the best. Can help to ensure that the class attribute is first and links to all nodes (not a requirement)‏ Classification: Bayes February 17, 2009 Slide 24

Other Structures COMP527: Data Mining TAN: Tree Augmented Naive Bayes. Class attribute is only parent for each node in Naive Bayes. Start here and consider adding a second parent to each node. Bayesian Multinet: Build a separate network for each class and combine the values. Classification: Bayes February 17, 2009 Slide 25

Further Reading COMP527: Data Mining • Witten 4.2, 6.7 • Han 6.4 • Dunham 4.2 • Devijver and Kittler, Pattern Recognition: A Statistical Approach, Chapter 2 • Berry and Browne, Chapter 2 Classification: Bayes February 17, 2009 Slide 26

M. Sulaiman Khan (mskhan@liv.ac.uk) ‏ Dept. of Computer Science University of Liverpool 2009