Classification Algorithms. Basic Principle (Inductive Learning Hypothesis): Any hypothesis found to approximate the target function well over a sufficiently large set of training examples will also approximate the target function well over other unobserved examples. Decision trees
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Basic Principle (Inductive Learning Hypothesis): Any hypothesis found to approximate the target function well over a sufficiently large set of training examples will also approximate the target function well over other unobserved examples.
Typical Algorithms:
General idea: Recursively partition data into subgroups
• Select an attribute and formulate a logical test on attribute
• Branch on each outcome of test, move subset of examples (training data) satisfying that outcome to the corresponding child node.
• Run recursively on each child node.
Termination rule specifies when to declare a leaf node.
Decision tree learning is a heuristic, onestep lookahead (hill climbing), nonbacktracking search through the space of all possible decision trees.
Sunny
Overcast
Rain
Humidity
Wind
Yes
Strong
High
Normal
Weak
No
Yes
No
Yes
Decision Tree: ExampleDay Outlook Temperature Humidity Wind Play Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
DecisionTree(examples) =
Prune (Tree_Generation(examples))
Tree_Generation (examples) =
IF termination_condition (examples)
THEN leaf ( majority_class (examples) )
ELSE
LET
Best_test = selection_function (examples)
IN
FOR EACH value v OF Best_test
Let subtree_v = Tree_Generation ({ e example e.Best_test = v )
IN Node (Best_test, subtree_v )
Definition :
selection: used to partition training data
termination condition: determines when to stop partitioning
pruning algorithm: attempts to prevent overfitting
The basic approach to select a attribute is to examine each attribute and evaluate its likelihood for improving the overall decision performance of the tree.
The most widely used nodesplitting evaluation functions work by reducing the degree of randomness or ‘impurity” in the current node:
Entropy function (C4.5):
Information gain :
The algorithm searches through the space of possible decision trees from simplest to increasingly complex, guided by the information gain heuristic.
Outlook
Sunny
Overcast
Rain
{1, 2,8,9,11 }
{4,5,6,10,14}
Yes
?
?
D (Sunny, Humidity) = 0.97  3/5*0  2/5*0 = 0.97
D (Sunny,Temperature) = 0.972/5*0  2/5*1  1/5*0.0 = 0.57
D (Sunny,Wind)= 0.97 = 2/5*1.0  3/5*0.918 = 0.019
training data : error_training (h)
entire distribution D of data : error_D (h)
Hypothesis h overfits training data if there is an alternative hypothesis h’ such that
error_training (h) < error_training (h’)
error_D (h) > error (h’)
OverfittingWe don’t want to these algorithms to fit to ``noise’’
The generated tree may overfit the training data
Too many branches, some may reflect anomalies due to noise or outliers
Result is in poor accuracy for unseen samples
Preventing OverfittingPredicted
True Positives
False Negatives
Actual
Evaluation of Classification Systems
Training Set: examples with class values for learning.
Test Set: examples with class values for evaluating.
Evaluation: Hypotheses are used to infer classification of examples in the test set; inferred classification is compared to known classification.
Accuracy: percentage of examples in the test set that are classified correctly.
physician fee freeze = n:
 adoption of the budget resolution = y: democrat (151.0)
 adoption of the budget resolution = u: democrat (1.0)
 adoption of the budget resolution = n:
  education spending = n: democrat (6.0)
  education spending = y: democrat (9.0)
  education spending = u: republican (1.0)
physician fee freeze = y:
 synfuels corporation cutback = n: republican (97.0/3.0)
 synfuels corporation cutback = u: republican (4.0)
 synfuels corporation cutback = y:
  duty free exports = y: democrat (2.0)
  duty free exports = u: republican (1.0)
  duty free exports = n:
   education spending = n: democrat (5.0/2.0)
   education spending = y: republican (13.0/2.0)
   education spending = u: democrat (1.0)
physician fee freeze = u:
 water project cost sharing = n: democrat (0.0)
 water project cost sharing = y: democrat (4.0)
 water project cost sharing = u:
  mx missile = n: republican (0.0)
  mx missile = y: democrat (3.0/1.0)
  mx missile = u: republican (2.0)
Simplified Decision Tree:
physician fee freeze = n: democrat (168.0/2.6)
physician fee freeze = y: republican (123.0/13.9)
physician fee freeze = u:
 mx missile = n: democrat (3.0/1.1)
 mx missile = y: democrat (4.0/2.2)
 mx missile = u: republican (2.0/1.0)
Evaluation on training data (300 items):
Before Pruning After Pruning
 
Size Errors Size Errors Estimate
25 8( 2.7%) 7 13( 4.3%) ( 6.9%) <

+
Actual Class
Entries are counts of correct classifications and counts of errors
Y
A: True +
B : False +
Predicted class
N
C : False 
D : True 
Posterior probabilities
likelihoods
approximated using error frequencies
prior probabilities approximated by class frequencies
P (+) :
P ()
P(+  Y)
P(  N)
P(Y +)
P(Y  )
Class Distribution
Defined for a particular training set
Confusion matrix
Defined for a particular classifier
Calculate expected profit:
profit = P(+)*(TP*B(Y, +) + (1TP)*C(N, +))
+ P()*((1FP)*B(N, ) + FP*C(Y, ))
Choose the classifier that maximises profit
Model Evaluation within ContextBenefits of correct classification
costs of incorrect classification
retain training data points; each potentially affects the estimation at new point