Generative classifiers: The Gaussian classifier

Generative classifiers: The Gaussian classifier Ata Kaban School of Computer Science University of Birmingham

Outline • We have already seen how Bayes rule can be turned into a classifier • In all our examples so far we had discrete valued attributes (e.g. in {‘sunny’,’rainy’}, {+,-}) • Today we learn how to do this when the data attributes are continuous valued

40 Empirical data for male 35 Empirical data for female 30 25 20 15 10 5 0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 Example • Task: predict gender of individuals based on their heights • Given • 100 height examples of women • 100 height examples of man Frequency Height (meters)

Class priors • We can encode the values of the hypothesis (class) as 1 (male) and 0 (female). So, • Since in this example we had the same number of males and females, we have P(h=1)=P(h=0)=0.5. These are the prior probabilities of class membership because they can be set before measuring any data. • Note that in cases when the class proportions are imbalanced, we can use the priors to make predictions even before seeing any data.

Class-conditional likelihood • Our measurements are heights. This is our data, . • Class-conditional likelihoods: p(x|h=1): probability that a male has height x meters p(x|h=0):

Class posterior • As before, from Bayes rule we can obtain the class posteriors:Meaning of the denominator is the probability of measuring the height value x irrespective of the class. • If we can compute this then we can use it for predicting the gender from the height measurement

40 Empirical data for male 35 Empirical data for female 30 25 20 15 10 5 0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 Discriminant function • When does our prediction switch from predicting h=0 vs predicting h=1? • … more precisely Frequency Height (meters)

Discriminant function • If we make a measurement, say we get • We compute the posteriors and find • Then we decide to predict ‘, i.e. , male • If we measured , we will get

Discriminant function • We can define a discriminant function as: and compare the function value to 1. • More convenient to have the switching at 0 rather than at 1. Define discriminant function as the log of f1: • Then the sign of this function defines the prediction (if f(x)>0 => male, if f(x)<0 => female)

How do we compute it? • Let’s write it out using Bayes rule: • Now, we need the class conditional likelihood terms, and . Note that now takes continuous real values. • We will model each class by a Gaussian distribution. (Note, there are other ways to do it, this is a generic problem that Density Estimation deals with. Here consider the specific case of using Gaussian, which is fairly commonly done in practice.)

40 Empirical data for male Fitted distributionfor male 35 Empirical data for female Fitted distribution for female 30 25 20 15 10 5 0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 Illustration – our 1D example Frequency Height (meters)

Gaussian - univariate Where is the mean (center), and is the variance (spread). These are the parameters that describe the distributions. We will have a separate Gaussian for each class. So, the female class will have mean, and as its variance. The male class will have mean, and as its variance. We need to estimate these parameters from the data.

Gaussian - multivariate Let . So x has d attributes. Let k in {0,1}. = Where the mean vectors, andis the covariance matrices. These are the parameters that describe the distributions, and they are estimated from the data.

Gaussian - multivariate

2D example with 2 classes Attribute 2 Attribute 1

Naïve Bayes • Notice the full covariances are • In many situations there is not enough data to estimate the full covariance – e.g. when d is large. • The Naïve Bayes assumption is again an easy simplification that we can make and tends to work well in practice. In the Gaussian model it means that the covariance matrix is diagonal. • For the brave: Check this last statement for yourself! – 3% extra credit if you hand in a correct solution to me before next Thursday’s class!

Are we done? • How do we estimate the parameters, i.e. the means and the variance/ covariance? • If we use the Naïve Bayes assumption, we can compute the estimates of the mean and variance in each class separately for each feature. • If d is small, and you have many points in your training set, then working with full covariance is expected to work better. • In MatLab there are built-in functions that you can use: mean, cov, var.

Multi-class classification • We may have more than 2 classes – e.g. ‘healthy’, ‘disease type 1’, ‘disease type 2’. • Our Gaussian classifier is easy to use in multi-class problems. • We compute the posterior probability for each of the classes • We predict the class whose posterior probability is highest.

Summing up • This type of classifier is called ‘generative’, because it rests on the assumption that the cloud of points in each class can be seen as generated by some distribution, e.g. a Gaussian, and works out its decisions based on estimating these distributions. • One could instead model the discriminant function directly! That type of classifier is called ‘discriminative’. • For the brave: Try to work out the form of the discriminant function by plugging into it the form of the Gaussian class conditional densities. You will get a quadratic function of x in general. When does it reduce to a linear functon? • Recommended reading: Rogers & Girolami, Chapter 5.

Generative classifiers: The Gaussian classifier

Generative classifiers: The Gaussian classifier

Presentation Transcript

Gaussian Pyramid

Kernel Embedding

Gaussian Integration

Linear Systems Pivoting in Gaussian Elim.

Ensemble Classifiers

LECTURE 03: GAUSSIAN CLASSIFIERS

Lecture 6 Ensemble Learning (1) Boosting

Boosting

LECTURE 03: GAUSSIAN CLASSIFIERS

Learning Bayesian Network Classifiers by Maximizing Conditional Likelihood

CLASSIFERS

Classifiers in Atlas

Discriminative Naïve Bayesian Classifiers

PAC-Bayesian Theorems for Gaussian Process Classifications

Combining Multiple Classifiers

Support Vector Machine

ECE 471/571 – Lecture 20

Classifier Ensembles: Facts, Fiction, Faults and Future

Classifiers

Gaussian process regression

Object Class Recognition Readings: Yi Li’s 2 Papers

Fun with Hyperplanes: Perceptrons, SVMs, and Friends