Naïve Bayes Classifier for Managers

Naïve-Bayes Classifiers Business Intelligence for Managers

Classifier definition, revisited • A classifier is a system that categorizes instances • Inputs to a classifier: feature/attribute values of a given instance • Output of a classifier: predicted category for that instance • Classifier algorithm often based on a training data set of instances with known categories

Classifiers X1 X2 feature values Y X3 category … Classifier Xn DB collection of instanceswith known categories Example:X1 (motility) = “flies”, X2 (number of legs) = 2, X3 (height) = 6 in  Y = “bird”

Classifier Algorithms • K Nearest Neighbors (kNN) • Naïve-Bayes • Decision trees • Many others (support vector machines, neural networks, genetic algorithms, etc)

Classifier algorithm(approach 1) • Select all instances in the dataset that match the input tuple (X1,X2,…,Xn) of feature values • Determine the distribution of Y-values for all the matches • Output the Y-value representing the most instances

Problems with this approach • Classification process is proportional to dataset size • Not practical if the dataset is huge

Pre-computing distributions (approach 2) • What if we pre-compute all distributions for all possible tuples? • The classification process is then a simple matter of looking up the pre-computed distribution • Time complexity burden will be in the pre-computation stage, done only once • Still not practical if the number of features is not small • Suppose there are only two possible values per feature and there are n features -> 2n possible combinations!

What we need • Typically, n (number of features) will be in the hundreds and m (number of instances in the dataset) will be in the tens of thousands • We want a classifier that pre-computes enough so that it does not need to scan through the instances during the query, but we do not want to pre-compute too many values

Probability Notation • What we want to estimate from our dataset is a conditional probability • P( Y=c | X1=v1, X2=v2, …, Xn = vn ) represents the probability that the category of the instance is c, given that the feature values are v1,v2,…,vn (the input) • In our classifier, we output the c with maximum probability

Bayes Theorem • Bayes theorem allows us to invert conditional probability • P( A=a | B=b ) =P( B=b | A=a ) P( A=a ) P( B=b ) • Why and how this will help? • The answer will come later

P( A=a ) W X Z Y P( B=b ) Suppose U = W+X+Y+Z P( A=a | B=b ) = Z/(Z+Y) P( B=b | A=a ) = Z/(Z+X) P( A=a ) = (Z+X)/UP( B=b ) = (Z+Y)/U P( A=a ) / P( B=b ) = (Z+X)/(Z+Y) P( B=b | A=a ) P( A=a ) P( B=b ) = [ Z/(Z+X) ] (Z+X)/(Z+Y) = Z/(Z+Y)= P( A=a | B=b )

Another helpful equivalence • Assuming that two events are independent, the probability that both events occur is equal to the product of their individual probabilities • P( X1=v1, X2=v2 ) = P( X1=v1 ) P( X2=v2 )

The critical step Goal: maximize this quantity over all possible Y-values P( Y=c | X1=v1, X2=v2, …, Xn=vn ) = P( X1=v1, X2=v2, …, Xn = vn | Y=c ) P( Y=c ) P(X1=v1, X2=v2, …, Xn = vn) P(X1=v1|Y=c) P(X2=v2|Y=c)…P(Xn=vn|Y=c) P( Y=c ) P(X1=v1, X2=v2, …, Xn = vn) Can ignore the divisor since it remains the same regardless of Y-value

And here it is… We want a classifier to compute max P( Y=c | X1=v1, X2=v2, …, Xn = vn ) We get the same c if we instead compute max P(X1=v1|Y=c) P(X2=v2|Y=c)…P(Xn=vn|Y=c) P(Y=c) These values can be pre-computed and the number of computations is not combinatorially explosive

Building a classifier(approach 3) • For each category c, estimate P( Y=c ) = number of c-instances total number of instances • For each category c, for each feature Xi, determine the distribution P( Xi | Y=c ) • For each possible value v for Xi, estimateP( Xi=v | Y=c ) = number of c-instances where Xi=vnumber of c-instances

Using the classifier(approach 3) • For a given input tuple (v1,v2,…,vn), determine the category c that yields max P(X1=v1|Y=c) P(X2=v2|Y=c)…P(Xn=vn|Y=c)P(Y=c) by looking up the terms from the pre-computed values • Output category c

Example • Suppose we wanted a classifier that categorizes organisms according to certain characteristics • Organism categories (Y) are: mammal, bird, fish, insect, spider • Characteristics (X1,X2,X3,X4): motility (walks, flies, swim), number of legs (2,4,6,8), size (small, large), body-covering (fur, scales, feathers) • The dataset contains 1000 organism samples • m = 1000, n = 4, number of categories = 5

Comparing approaches • Approach 1: requires scanning all tuples for matching feature values • entails 1000*4 = 4000 comparisons per query, count occurrences of each category • Approach 2: pre-compute probabilities • Preparation: for each of the 3*4*2*3 = 64 combinations, determine the probability for each category (64*5=320 computations) • Query: straightforward lookup of answer

Comparing approaches • Approach 3: Naïve Bayes classifier • Preparation: compute P(Y=c) probabilities: 5 of them; computeP( Xi=v | Y=c ),5*(3+4+2+3)=60 of them • Query: straightforward computation of 5 probabilities, determine maximum, return category that yields the maximum

About the Naïve Bayes Classifier • Computations and resources required are reasonable, both for the preparatory stage and actual query stage • Even if the number n of features is in the thousands! • The classifier is naïve because it assumes independence of features (this is likely not the case) • It turns out that the classifier works well in practice even with this limitation • Log of probabilities are often used instead of actual probabilities to avoid underflow when computing the probability products

Naïve Bayes Classifier for Managers

Naïve Bayes Classifier for Managers

Presentation Transcript

Bayesian Networks

Tackling the Poor Assumptions of Naive Bayes Text Classifiers Pubished by: Jason D.M.Rennie, Lawrence Shih, Jamime Teeva

PROBABILITY - Bayes’ Theorem

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Bayes Rule

Improve Naïve Bayesian Classifier by Discriminative Training

Classifiers

Machine Learning – Classifiers and Boosting

Chapter 6: Automatic Classification (Supervised Data Organization)

Variational Bayes 101

Bayesian Classifiers

Bayes Classification

Lirong Xia

Classifiers

Ensemble Classifiers

Probabilistic Reasoning With Bayes’ Rule

Plasma cells

CLASSIFERS