1 / 15

# Classification Algorithms - PowerPoint PPT Presentation

Classification Algorithms. Basic Principle (Inductive Learning Hypothesis): Any hypothesis found to approximate the target function well over a sufficiently large set of training examples will also approximate the target function well over other unobserved examples. Decision trees

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Classification Algorithms' - guillermo

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Basic Principle (Inductive Learning Hypothesis): Any hypothesis found to approximate the target function well over a sufficiently large set of training examples will also approximate the target function well over other unobserved examples.

• Decision trees

• Rule-based induction

• Neural networks

• Memory(Case) based reasoning

• Genetic algorithms

• Bayesian networks

Typical Algorithms:

General idea: Recursively partition data into sub-groups

• Select an attribute and formulate a logical test on attribute

• Branch on each outcome of test, move subset of examples (training data) satisfying that outcome to the corresponding child node.

• Run recursively on each child node.

Termination rule specifies when to declare a leaf node.

Decision tree learning is a heuristic, one-step lookahead (hill climbing), non-backtracking search through the space of all possible decision trees.

Sunny

Overcast

Rain

Humidity

Wind

Yes

Strong

High

Normal

Weak

No

Yes

No

Yes

Decision Tree: Example

Day Outlook Temperature Humidity Wind Play Tennis

1 Sunny Hot High Weak No

2 Sunny Hot High Strong No

3 Overcast Hot High Weak Yes

4 Rain Mild High Weak Yes

5 Rain Cool Normal Weak Yes

6 Rain Cool Normal Strong No

7 Overcast Cool Normal Strong Yes

8 Sunny Mild High Weak No

9 Sunny Cool Normal Weak Yes

10 Rain Mild Normal Weak Yes

11 Sunny Mild Normal Strong Yes

12 Overcast Mild High Strong Yes

13 Overcast Hot Normal Weak Yes

14 Rain Mild High Strong No

DecisionTree(examples) =

Prune (Tree_Generation(examples))

Tree_Generation (examples) =

IF termination_condition (examples)

THEN leaf ( majority_class (examples) )

ELSE

LET

Best_test = selection_function (examples)

IN

FOR EACH value v OF Best_test

Let subtree_v = Tree_Generation ({ e  example| e.Best_test = v )

IN Node (Best_test, subtree_v )

Definition :

selection: used to partition training data

termination condition: determines when to stop partitioning

pruning algorithm: attempts to prevent overfitting

The basic approach to select a attribute is to examine each attribute and evaluate its likelihood for improving the overall decision performance of the tree.

The most widely used node-splitting evaluation functions work by reducing the degree of randomness or ‘impurity” in the current node:

Entropy function (C4.5):

Information gain :

• • ID3 and C4.5 branch on every value and use an entropy minimisation heuristic to select best attribute.

• • CART branches on all values or one value only, uses entropy minimisation or gini function.

• GIDDY formulates a test by branching on a subset of attribute values (selection by entropy minimisation)

The algorithm searches through the space of possible decision trees from simplest to increasingly complex, guided by the information gain heuristic.

Outlook

Sunny

Overcast

Rain

{1, 2,8,9,11 }

{4,5,6,10,14}

Yes

?

?

D (Sunny, Humidity) = 0.97 - 3/5*0 - 2/5*0 = 0.97

D (Sunny,Temperature) = 0.97-2/5*0 - 2/5*1 - 1/5*0.0 = 0.57

D (Sunny,Wind)= 0.97 -= 2/5*1.0 - 3/5*0.918 = 0.019

training data : error_training (h)

entire distribution D of data : error_D (h)

Hypothesis h overfits training data if there is an alternative hypothesis h’ such that

error_training (h) < error_training (h’)

error_D (h) > error (h’)

Overfitting

We don’t want to these algorithms to fit to ``noise’’

The generated tree may overfit the training data

Too many branches, some may reflect anomalies due to noise or outliers

Result is in poor accuracy for unseen samples

Preventing Overfitting

Predicted

True Positives

False Negatives

Actual

Evaluation of Classification Systems

Training Set: examples with class values for learning.

Test Set: examples with class values for evaluating.

Evaluation: Hypotheses are used to infer classification of examples in the test set; inferred classification is compared to known classification.

Accuracy: percentage of examples in the test set that are classified correctly.

physician fee freeze = n:

| adoption of the budget resolution = y: democrat (151.0)

| adoption of the budget resolution = u: democrat (1.0)

| adoption of the budget resolution = n:

| | education spending = n: democrat (6.0)

| | education spending = y: democrat (9.0)

| | education spending = u: republican (1.0)

physician fee freeze = y:

| synfuels corporation cutback = n: republican (97.0/3.0)

| synfuels corporation cutback = u: republican (4.0)

| synfuels corporation cutback = y:

| | duty free exports = y: democrat (2.0)

| | duty free exports = u: republican (1.0)

| | duty free exports = n:

| | | education spending = n: democrat (5.0/2.0)

| | | education spending = y: republican (13.0/2.0)

| | | education spending = u: democrat (1.0)

physician fee freeze = u:

| water project cost sharing = n: democrat (0.0)

| water project cost sharing = y: democrat (4.0)

| water project cost sharing = u:

| | mx missile = n: republican (0.0)

| | mx missile = y: democrat (3.0/1.0)

| | mx missile = u: republican (2.0)

Simplified Decision Tree:

physician fee freeze = n: democrat (168.0/2.6)

physician fee freeze = y: republican (123.0/13.9)

physician fee freeze = u:

| mx missile = n: democrat (3.0/1.1)

| mx missile = y: democrat (4.0/2.2)

| mx missile = u: republican (2.0/1.0)

Evaluation on training data (300 items):

Before Pruning After Pruning

---------------- ---------------------------

Size Errors Size Errors Estimate

25 8( 2.7%) 7 13( 4.3%) ( 6.9%) <

-

+

Actual Class

Entries are counts of correct classifications and counts of errors

Y

A: True +

B : False +

Predicted class

N

C : False -

D : True -

• Other evaluation metrics

• True positive rate (TP) = A/(A+C)= 1- false negative rate

• False positive rate (FP)= B/(B+D) = 1- true negative rate

• Sensitivity = true positive rate

• Specificity = true negative rate

• Positive predictive value = A/(A+B)

• Recall = A/(A+C) = true positive rate = sensitivity

• Precision = A/(A+B) = PPV

Posterior probabilities

likelihoods

approximated using error frequencies

prior probabilities approximated by class frequencies

P (+) :

P (-)

P(+ | Y)

P(- | N)

P(Y |+)

P(Y |- )

Class Distribution

Defined for a particular training set

Confusion matrix

Defined for a particular classifier

Calculate expected profit:

profit = P(+)*(TP*B(Y, +) + (1-TP)*C(N, +))

+ P(-)*((1-FP)*B(N, -) + FP*C(Y, -))

Choose the classifier that maximises profit

Model Evaluation within Context

Benefits of correct classification

costs of incorrect classification

retain training data points; each potentially affects the estimation at new point