1 / 24

# The Classification Problem - PowerPoint PPT Presentation

PGM: Tirgul 11 Na?ve Bayesian Classifier + Tree Augmented Na?ve Bayes (adapted from tutorial by Nir Friedman and Moises Goldszmidt. Age Sex ChestPain RestBP Cholesterol BloodSugar ECG MaxHeartRt Angina OldPeak Heart Disease. The Classification Problem.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'The Classification Problem' - armand-bean

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

PGM: Tirgul 11Na?ve Bayesian Classifier +Tree Augmented Na?ve Bayes(adapted from tutorial by Nir Friedman and Moises Goldszmidt

Sex

ChestPain

RestBP

Cholesterol

BloodSugar

ECG

MaxHeartRt

Angina

OldPeak

Heart Disease

The Classification Problem

• From a data set describing objects by vectors of features and a class

• Find a function F: featuresclass to classify a new object

Vector1= <49, 0, 2, 134, 271, 0, 0, 162, 0, 0, 2, 0, 3> Presence

Vector2= <42, 1, 3, 130, 180, 0, 0, 150, 0, 0, 1, 0, 3>Presence

Vector3= <39, 0, 3, 94, 199, 0, 0, 179, 0, 0, 1, 0, 3 >Presence

Vector4= <41, 1, 2, 135, 203, 0, 0, 132, 0, 0, 2, 0, 6 >Absence

Vector5= <56, 1, 3, 130, 256, 1, 2, 142, 1, 0.6, 2, 1, 6 >Absence

Vector6= <70, 1, 2, 156, 245, 0, 2, 143, 0, 0, 1, 0, 3 >Presence

Vector7= <56, 1, 4, 132, 184, 0, 2, 105, 1, 2.1, 2, 1, 6 >Absence

• Predicting heart disease

• Features: cholesterol, chest pain, angina, age, etc.

• Class: {present, absent}

• Finding lemons in cars

• Features: make, brand, miles per gallon, acceleration,etc.

• Class: {normal, lemon}

• Digit recognition

• Features: matrix of pixel descriptors

• Class: {1, 2, 3, 4, 5, 6, 7, 8, 9, 0}

• Speech recognition

• Features: Signal characteristics, language model

• Class: {pause/hesitation, retraction}

• Memory based

• Define a distance between samples

• Nearest neighbor, support vector machines

• Decision surface

• Find best partition of the space

• CART, decision trees

• Generative models

• Induce a model and impose a decision rule

• Bayesian networks

• Bayesian classifiers

• Induce a probability describing the data

P(A1,…,An,C)

• Impose a decision rule. Given a new object < a1,…,an >

c = argmaxC P(C = c | a1,…,an)

• We have shifted the problem to learning P(A1,…,An,C)

• We are learning how to do this efficiently: learn a Bayesian network representation for P(A1,…,An,C)

Optimality of the decision ruleMinimizing the error rate...

• Let ci be the true class, and let lj be the class returned by the classifier.

• A decision by the classifier is correct if ci=lj, and in error if ci lj.

• The error incurred by choose label lj is

• Thus, had we had access to P, we minimize error rate by choosing liwhenwhich is the decision rule for the Bayesian classifier

• Output: Rank over the outcomes---likelihood of present vs. absent

• Explanation: What is the profile of a “typical” person with a heart disease

• Missing values: both in training and testing

• Value of information: If the person has high cholesterol and blood sugar, which other test should be conducted?

• Validation: confidence measures over the model and its parameters

• Background knowledge: priors and structure

Partition the data set in n segments

Do n times

Train the classifier with the green segments

Test accuracy on the red segments

Compute statistics on the n runs

Variance

Mean accuracy

Accuracy: on test data of size m

Acc =

Evaluating the performance of a classifier: n-fold cross validation

D1

D2

D3

Dn

Run 1

Run 2

Run 3

Run n

Original data set

Age

MaxHeartRate

Vessels

STSlope

Angina

BloodSugar

OldPeak

ChestPain

RestBP

ECG

Thal

Sex

Cholesterol

Advantages of Using a Bayesian Network

• Efficiency in learning and query answering

• Combine knowledge engineering and statistical induction

• Algorithms for decision making, value of information, diagnosis and repair

Heart disease

Accuracy = 85%

Data source

UCI repository

When evaluating a Bayesian network, we examine the likelyhood of the model B given the data D and try to maximize it:

When Learning structure we also add penalty for structure complexity and seek a balance between the two terms (MDL or variant). The following properties follow:

• A Bayesian network minimized the error over all the variables in the domain and not necessarily the local error of the class given the attributes (OK with enough data).

• Because of the penalty, a Bayesian network in effect looks at a small subset of the variables that effect a given node (it’s Markov blanket)

Let’s look closely at the likelyhood term:

• The first term estimates just what we want: the probability of the class given the attributes. The second term estimates the joint probability of the attributes.

• When there are many attributes, the second term starts to dominate (value of log is increased for small values).

• Why not use the just the first term? We can no longer factorize and calculations become much harder.

insulin

F1

F2

F3

F4

F5

F6

age

mass

glucose

pregnant

dpf

The Naïve Bayesian Classifier

Diabetes in

Pima Indians

(from UCI repository)

• Fixed structure encoding the assumption that features are independent of each other given the class.

• Learning amounts to estimating the parameters for each P(Fi|C) for each Fi.

What do we gain?

• We ensure that in the learned network, the probability P(C|A1…An) will take every attribute into account.

• We will show polynomial time algorithm for learning the network.

• Estimates are robust consisting of low order statistics requiring few instances

• Has proven to be a powerful classifier often exceeding unrestricted Bayesian networks.

F1

F2

F3

F4

F5

F6

The Naïve Bayesian Classifier (cont.)

• Common practice is to estimate

• These estimate are identical to MLE for multinomials

• Naïve Bayes encodes assumptions of independence that may be unreasonable:

Are pregnancy and age independent given diabetes?

Problem: same evidence may be incorporated multiple times (a rare Glucose level and a rare Insulin level over penalize the class variable)

• The success of naïve Bayes is attributed to

• Robust estimation

• Decision may be correct even if probabilities are inaccurate

• Idea: improve on naïve Bayes by weakening the independence assumptions

Bayesian networks provide the appropriate mathematical language for this task

mass

dpf

pregnant

age

glucose

F1

F2

F4

F5

F6

insulin

F3

Tree Augmented Naïve Bayes (TAN)

• Approximate the dependence among features with a tree Bayes net

• Tree induction algorithm

• Optimality: maximum likelihood tree

• Efficiency: polynomial algorithm

• Robust parameter estimation

The procedure of Chow and Lui construct a tree structure BT that maximizes LL(BT |D)

• Compute the mutual information between every pair of attributes:

• Build a complete undirected graph in which the vertices are the attributes and each edge is annotated with the corresponding mutual information as weight.

• Build a maximum weighted spanning tree of this graph.

Complexity: O(n2N) + O(n2) + O(n2logn) = O(n2N) where n is the number of attributes and N is the sample size

It is easy to “plant” the optimal tree in the TAN by revising the algorithm to use a revised conditional measure which takes the conditional probability on the class into account:

This measures the gain in the log-likelyhood of adding Ai as a parent of Aj when C is already a parent.

When evaluating parameters we estimate the conditional probability P(Ai|Parents(Ai)). This is done by partitionaing the data according to possible values of Parents(Ai).

• When a partition contains just a few instances we get an unreliable estimate

• In Naive Bayes the partition was only on the values of the classifier (and we have to assume that is adequate)

• In TAN we have twice the number of partitions and get unreliable estimates, especially for small data sets.

Solution:

where s is the smoothing bias and typically small.

100

• 25 Data sets from UCI repository

• Medical

• Signal processing

• Financial

• Games

• Accuracy based on 5-fold cross-validation

• No parameter tuning

95

90

85

Naïve Bayes

80

75

70

65

65

70

75

80

85

90

95

100

TAN

• 25 Data sets from UCI repository

• Medical

• Signal processing

• Financial

• Games

• Accuracy based on 5-fold cross-validation

• No parameter tuning

100

95

90

85

C4.5

80

75

70

65

65

70

75

80

85

90

95

100

TAN

• Can we do better by learning a more flexible structure?

• Experiment: learn a Bayesian network without restrictions on the structure

95

90

85

80

75

70

65

65

70

75

80

85

90

95

100

Performance: TAN vs. Bayesian Networks

• 25 Data sets from UCI repository

• Medical

• Signal processing

• Financial

• Games

• Accuracy based on 5-fold cross-validation

• No parameter tuning

Bayesian Networks

TAN

• Bayesian networks provide a useful language to improve Bayesian classifiers

• Lesson: we need to be aware of the task at hand, the amount of training data vs dimensionality of the problem, etc