# The Classification Problem - PowerPoint PPT Presentation

1 / 24

PGM: Tirgul 11 Na?ve Bayesian Classifier + Tree Augmented Na?ve Bayes (adapted from tutorial by Nir Friedman and Moises Goldszmidt. Age Sex ChestPain RestBP Cholesterol BloodSugar ECG MaxHeartRt Angina OldPeak Heart Disease. The Classification Problem.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

The Classification Problem

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

#### Presentation Transcript

PGM: Tirgul 11Na?ve Bayesian Classifier +Tree Augmented Na?ve Bayes(adapted from tutorial by Nir Friedman and Moises Goldszmidt

Age

Sex

ChestPain

RestBP

Cholesterol

BloodSugar

ECG

MaxHeartRt

Angina

OldPeak

Heart Disease

### The Classification Problem

• From a data set describing objects by vectors of features and a class

• Find a function F: featuresclass to classify a new object

Vector1= <49, 0, 2, 134, 271, 0, 0, 162, 0, 0, 2, 0, 3> Presence

Vector2= <42, 1, 3, 130, 180, 0, 0, 150, 0, 0, 1, 0, 3>Presence

Vector3= <39, 0, 3, 94, 199, 0, 0, 179, 0, 0, 1, 0, 3 >Presence

Vector4= <41, 1, 2, 135, 203, 0, 0, 132, 0, 0, 2, 0, 6 >Absence

Vector5= <56, 1, 3, 130, 256, 1, 2, 142, 1, 0.6, 2, 1, 6 >Absence

Vector6= <70, 1, 2, 156, 245, 0, 2, 143, 0, 0, 1, 0, 3 >Presence

Vector7= <56, 1, 4, 132, 184, 0, 2, 105, 1, 2.1, 2, 1, 6 >Absence

### Examples

• Predicting heart disease

• Features: cholesterol, chest pain, angina, age, etc.

• Class: {present, absent}

• Finding lemons in cars

• Features: make, brand, miles per gallon, acceleration,etc.

• Class: {normal, lemon}

• Digit recognition

• Features: matrix of pixel descriptors

• Class: {1, 2, 3, 4, 5, 6, 7, 8, 9, 0}

• Speech recognition

• Features: Signal characteristics, language model

• Class: {pause/hesitation, retraction}

### Approaches

• Memory based

• Define a distance between samples

• Nearest neighbor, support vector machines

• Decision surface

• Find best partition of the space

• CART, decision trees

• Generative models

• Induce a model and impose a decision rule

• Bayesian networks

### Generative Models

• Bayesian classifiers

• Induce a probability describing the data

P(A1,…,An,C)

• Impose a decision rule. Given a new object < a1,…,an >

c = argmaxC P(C = c | a1,…,an)

• We have shifted the problem to learning P(A1,…,An,C)

• We are learning how to do this efficiently: learn a Bayesian network representation for P(A1,…,An,C)

### Optimality of the decision ruleMinimizing the error rate...

• Let ci be the true class, and let lj be the class returned by the classifier.

• A decision by the classifier is correct if ci=lj, and in error if ci lj.

• The error incurred by choose label lj is

• Thus, had we had access to P, we minimize error rate by choosing liwhenwhich is the decision rule for the Bayesian classifier

### Advantages of the Generative Model Approach

• Output: Rank over the outcomes---likelihood of present vs. absent

• Explanation: What is the profile of a “typical” person with a heart disease

• Missing values: both in training and testing

• Value of information: If the person has high cholesterol and blood sugar, which other test should be conducted?

• Validation: confidence measures over the model and its parameters

• Background knowledge: priors and structure

Partition the data set in n segments

Do n times

Train the classifier with the green segments

Test accuracy on the red segments

Compute statistics on the n runs

Variance

Mean accuracy

Accuracy: on test data of size m

Acc =

### Evaluating the performance of a classifier: n-fold cross validation

D1

D2

D3

Dn

Run 1

Run 2

Run 3

Run n

Original data set

Outcome

Age

MaxHeartRate

Vessels

STSlope

Angina

BloodSugar

OldPeak

ChestPain

RestBP

ECG

Thal

Sex

Cholesterol

### Advantages of Using a Bayesian Network

• Efficiency in learning and query answering

• Combine knowledge engineering and statistical induction

• Algorithms for decision making, value of information, diagnosis and repair

Heart disease

Accuracy = 85%

Data source

UCI repository

Problems with BNs as classifiers

When evaluating a Bayesian network, we examine the likelyhood of the model B given the data D and try to maximize it:

When Learning structure we also add penalty for structure complexity and seek a balance between the two terms (MDL or variant). The following properties follow:

• A Bayesian network minimized the error over all the variables in the domain and not necessarily the local error of the class given the attributes (OK with enough data).

• Because of the penalty, a Bayesian network in effect looks at a small subset of the variables that effect a given node (it’s Markov blanket)

Problems with BNs as classifiers (cont.)

Let’s look closely at the likelyhood term:

• The first term estimates just what we want: the probability of the class given the attributes. The second term estimates the joint probability of the attributes.

• When there are many attributes, the second term starts to dominate (value of log is increased for small values).

• Why not use the just the first term? We can no longer factorize and calculations become much harder.

C

insulin

F1

F2

F3

F4

F5

F6

age

mass

glucose

pregnant

dpf

### The Naïve Bayesian Classifier

Diabetes in

Pima Indians

(from UCI repository)

• Fixed structure encoding the assumption that features are independent of each other given the class.

• Learning amounts to estimating the parameters for each P(Fi|C) for each Fi.

The Naïve Bayesian Classifier (cont.)

What do we gain?

• We ensure that in the learned network, the probability P(C|A1…An) will take every attribute into account.

• We will show polynomial time algorithm for learning the network.

• Estimates are robust consisting of low order statistics requiring few instances

• Has proven to be a powerful classifier often exceeding unrestricted Bayesian networks.

C

F1

F2

F3

F4

F5

F6

### The Naïve Bayesian Classifier (cont.)

• Common practice is to estimate

• These estimate are identical to MLE for multinomials

### Improving Naïve Bayes

• Naïve Bayes encodes assumptions of independence that may be unreasonable:

Are pregnancy and age independent given diabetes?

Problem: same evidence may be incorporated multiple times (a rare Glucose level and a rare Insulin level over penalize the class variable)

• The success of naïve Bayes is attributed to

• Robust estimation

• Decision may be correct even if probabilities are inaccurate

• Idea: improve on naïve Bayes by weakening the independence assumptions

Bayesian networks provide the appropriate mathematical language for this task

C

mass

dpf

pregnant

age

glucose

F1

F2

F4

F5

F6

insulin

F3

### Tree Augmented Naïve Bayes (TAN)

• Approximate the dependence among features with a tree Bayes net

• Tree induction algorithm

• Optimality: maximum likelihood tree

• Efficiency: polynomial algorithm

• Robust parameter estimation

Optimal Tree construction algorithm

The procedure of Chow and Lui construct a tree structure BT that maximizes LL(BT |D)

• Compute the mutual information between every pair of attributes:

• Build a complete undirected graph in which the vertices are the attributes and each edge is annotated with the corresponding mutual information as weight.

• Build a maximum weighted spanning tree of this graph.

Complexity: O(n2N) + O(n2) + O(n2logn) = O(n2N) where n is the number of attributes and N is the sample size

Tree construction algorithm (cont.)

It is easy to “plant” the optimal tree in the TAN by revising the algorithm to use a revised conditional measure which takes the conditional probability on the class into account:

This measures the gain in the log-likelyhood of adding Ai as a parent of Aj when C is already a parent.

Problem with TAN

When evaluating parameters we estimate the conditional probability P(Ai|Parents(Ai)). This is done by partitionaing the data according to possible values of Parents(Ai).

• When a partition contains just a few instances we get an unreliable estimate

• In Naive Bayes the partition was only on the values of the classifier (and we have to assume that is adequate)

• In TAN we have twice the number of partitions and get unreliable estimates, especially for small data sets.

Solution:

where s is the smoothing bias and typically small.

### Performance: TAN vs. Naïve Bayes

100

• 25 Data sets from UCI repository

• Medical

• Signal processing

• Financial

• Games

• Accuracy based on 5-fold cross-validation

• No parameter tuning

95

90

85

Naïve Bayes

80

75

70

65

65

70

75

80

85

90

95

100

TAN

### Performance: TAN vs C4.5

• 25 Data sets from UCI repository

• Medical

• Signal processing

• Financial

• Games

• Accuracy based on 5-fold cross-validation

• No parameter tuning

100

95

90

85

C4.5

80

75

70

65

65

70

75

80

85

90

95

100

TAN

### Beyond TAN

• Can we do better by learning a more flexible structure?

• Experiment: learn a Bayesian network without restrictions on the structure

100

95

90

85

80

75

70

65

65

70

75

80

85

90

95

100

### Performance: TAN vs. Bayesian Networks

• 25 Data sets from UCI repository

• Medical

• Signal processing

• Financial

• Games

• Accuracy based on 5-fold cross-validation

• No parameter tuning

Bayesian Networks

TAN

### Classification: Summary

• Bayesian networks provide a useful language to improve Bayesian classifiers

• Lesson: we need to be aware of the task at hand, the amount of training data vs dimensionality of the problem, etc