Supervised Learning

1 / 92

# Supervised Learning - PowerPoint PPT Presentation

Supervised Learning. Introduction. Key idea Known target concept (predict certain attribute) Find out how other attributes can be used Algorithms Rudimentary Rules (e.g., 1R) Statistical Modeling (e.g., Na ï ve Bayes) Divide and Conquer: Decision Trees Instance-Based Learning

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## Supervised Learning

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Supervised Learning

Introduction
• Key idea
• Known target concept (predict certain attribute)
• Find out how other attributes can be used
• Algorithms
• Rudimentary Rules (e.g., 1R)
• Statistical Modeling (e.g., Naïve Bayes)
• Divide and Conquer: Decision Trees
• Instance-Based Learning
• Neural Networks
• Support Vector Machines
1-Rule
• Generate a one-level decision tree
• One attribute
• Performs quite well!
• Basic idea:
• Rules testing a single attribute
• Classify according to frequency in training data
• Evaluate error rate for each attribute
• Choose the best attribute
• That’s all folks!
Apply 1R

Attribute Rules Errors Total

1 outlook sunnyno 2/5 4/14

overcast yes 0/4

rainy yes 2/5

2 temperature hot  no 2/4 5/14

mild  yes 2/6

cool  no 3/7

3 humidity high  no 3/7 4/14

normal  yes 2/8

4 windy false  yes 2/8 5/14

true  no 3/6

Other Features
• Numeric Values
• Discretization :
• Sort training data
• Split range into categories
• Missing Values
• “Dummy” attribute
Naïve Bayes Classifier
• Allow all attributes to contribute equally
• Assumes
• All attributes equally important
• All attributes independent
• Realistic?
• Selection of attributes
Bayes Theorem

Hypothesis

Posterior

Probability

Prior

Evidence

Conditional probability

of H given E

Maximum a Posteriori (MAP)

Maximum Likelihood (ML)

Classification
• Want to classify a new instance (a1, a2,…, an) into finite number of categories from the set V.
• Bayesian approach: Assign the most probable category vMAP given (a1, a2,…, an).
• Can we estimate the probabilities from the training data?
Naïve Bayes Classifier
• Second probability easy to estimate
• How?
• The first probability difficult to estimate
• Why?
• Assume independence (this is the naïve bit):
Estimation
• Given a new instance with
• outlook=sunny,
• temperature=high,
• humidity=high,
• windy=true
Normalization
• Note that we can normalize to get the probabilities:
Problems ….
• Suppose we had the following training data:

Now what?

Laplace Estimator
• Replace estimates

with

Numeric Values
• Assume a probability distribution for the numeric attributes  density f(x)
• normal
• fit a distribution (better)
• Similarly as before
Discussion
• Simple methodology
• Powerful - good results in practice
• Missing values no problem
• Not so good if independence assumption is severely violated
• Extreme case: multiple attributes with same values
• Solutions:
• Preselect which attributes to use
• Non-naïve Bayesian methods: networks
Decision Tree Learning
• Basic Algorithm:
• Select an attribute to be tested
• If classification achieved return classification
• Otherwise, branch by setting attribute to each of the possible values
• Repeat with branch as your new tree
• Main issue: how to select attributes
Deciding on Branching
• What do we want to accomplish?
• Make good predictions
• Obtain simple to interpret rules
• No diversity (impurity) is best
• all same class
• all classes equally likely
• Goal: select attributes to reduce impurity
Measuring Impurity/Diversity
• Lets say we only have two classes:
• Minimum
• Gini index/Simpson diversity index
• Entropy
Impurity Functions

Entropy

Gini index

Minimum

Entropy

Number of classes

Training data

(instances)

Proportion of

S classified as i

• Entropy is a measure of impurity in the training data S
• Measured in bits of information needed to encode a member of S
• Extreme cases
• All member same classification (Note: 0·log 0 = 0)
• All classifications equally frequent
Expected Information Gain

All possible values

for attribute a

Gain(S,a) is the expected information provided about the

classification from knowing the value of attribute a

(Reduction in number of bits needed)

Decision Tree: Root Node

Outlook

Rainy

Sunny

Overcast

Yes

Yes

No

No

No

Yes

Yes

Yes

Yes

Yes

Yes

Yes

No

No

Next Level

Outlook

Rainy

Sunny

Overcast

Temperature

No

No

Yes

No

Yes

Final Tree

Outlook

Rainy

Sunny

Overcast

Humidity

Yes

Windy

High

Normal

True

False

No

Yes

No

Yes

What’s in a Tree?
• Our final decision tree correctly classifies every instance
• Is this good?
• Two important concepts:
• Overfitting
• Pruning
Overfitting
• Two sources of abnormalities
• Noise (randomness)
• Outliers (measurement errors)
• Chasing every abnormality causes overfitting
• Tree to large and complex
• Does not generalize to new data
• Solution: prune the tree
Pruning
• Prepruning
• Halt construction of decision tree early
• Use same measure as in determining attributes, e.g., halt if InfoGain < K
• Most frequent class becomes the leaf node
• Postpruning
• Construct complete decision tree
• Prune it back
• Prune to minimize expected error rates
• Prune to minimize bits of encoding (Minimum Description Length principle)
Scalability
• Need to design for large amounts of data
• Two things to worry about
• Large number of attributes
• Leads to a large tree (prepruning?)
• Takes a long time
• Large amounts of data
• Can the data be kept in memory?
• Some new algorithms do not require all the data to be memory resident
Discussion: Decision Trees
• The most popular methods
• Quite effective
• Relatively simple
• Have discussed in detail the ID3 algorithm:
• Information gain to select attributes
• No pruning
• Only handles nominal attributes
Selecting Split Attributes
• Other Univariate splits
• Gain Ratio: C4.5 Algorithm (J48 in Weka)
• CART (not in Weka)
• Multivariate splits
• May be possible to obtain better splits by considering two or more attributes simultaneously
Instance-Based Learning
• Classification
• To not construct a explicit description of how to classify
• Store all training data (learning)
• New example: find most similar instance
• computing done at time of classification
• k-nearest neighbor
K-Nearest Neighbor
• Each instance lives in n-dimensional space
• Distance between instances
Example: nearest neighbor

-

+

1-Nearest neighbor?

6-Nearest neighbor?

-

-

+

-

xq*

-

-

+

-

+

+

Normalizing
• Some attributes may take large values and other small
• Normalize
• All attributes on equal footing
Other Methods for Supervised Learning
• Neural networks
• Support vector machines
• Optimization
• Rough set approach
• Fuzzy set approach
Evaluating the Learning
• Measure of performance
• Classification: error rate
• Resubstitution error
• Performance on training set
• Poor predictor of future performance
• Overfitting
• Useless for evaluation
Test Set
• Need a set of test instances
• Independent of training set instances
• Representative of underlying structure
• Sometimes: validation data
• Fine-tune parameters
• Independent of training and test data
• Plentiful data - no problem!
Holdout Procedures
• Common case: data set large but limited
• Usual procedure:
• Reserve some data for testing
• Use remaining data for training
• Problems:
• Want both sets as large as possible
• Want both sets to be representitive
"Smart" Holdout
• Simple check: Are the proportions of classes about the same in each data set?
• Stratified holdout
• Guarantee that classes are (approximately) proportionally represented
• Repeated holdout
• Randomly select holdout set several times and average the error rate estimates
Holdout w/ Cross-Validation
• Cross-validation
• Fixed number of partitions of the data (folds)
• In turn: each partition used for testing and remaining instances for training
• May use stratification and randomization
• Standard practice:
• Stratified tenfold cross-validation
• Instances divided randomly into the ten partitions
Cross Validation

Fold 1

Train on 90% of the data

Model

Test on 10%

of the data

Error rate e1

Fold 2

Train on 90% of the data

Model

Test on 10%

of the data

Error rate e2

Cross-Validation
• Final estimate of error
• Quality of estimate
Leave-One-Out Holdout
• n-Fold Cross-Validation (n instance set)
• Use all but one instance for training
• Maximum use of the data
• Deterministic
• High computational cost
• Non-stratified sample
Bootstrap
• Sample with replacement n times
• Use as training data
• Use instances not in training data for testing
• How many test instances are there?
0.632 Bootstrap
• On the average e-1 n = 0.369 n instances will be in the test set
• Thus, on average we have 63.2% of instance in training set
• Estimate error rate

e = 0.632 etest + 0.368 etrain

Accuracy of our Estimate?
• Suppose we observe s successes in a testing set of ntest instances ...
• We then estimate the success rate

Rsuccess=s/ ntest.

• Each instance is either a success or failure (Bernoulli trial w/success probability p)
• Mean p
• Variance p(1-p)
Properties of Estimate
• We have

E[Rsuccess]=p

Var[Rsuccess]=p(1-p)/ntest

• If ntrainingis large enough the Central Limit Theorem (CLT) states that, approximately,

Rsuccess~Normal(p,p(1-p)/ntest)

Confidence Interval

Look up in table

• CI for normal
• CI for p

Level

Comparing Algorithms
• Know how to evaluate the results of our data mining algorithms (classification)
• How should we compare different algorithms?
• Evaluate each algorithm
• Rank
• Select best one
• Don't know if this ranking is reliable
Assessing Other Learning
• Developed procedures for classification
• Association rules
• Evaluated based on accuracy
• Same methods as for classification
• Numerical prediction
• Error rate no longer applies
• Same principles
• use independent test set and hold-out procedures
• cross-validation or bootstrap
Measures of Effectiveness
• Need to compare:
• Predicted values p1, p2,..., pn.
• Actual values a1, a2,..., an.
• Most common measure
• Mean-squared error
Other Measures
• Mean absolute error
• Relative squared error
• Relative absolute error
• Correlation
What to Do?
• “Large” amounts of data
• Hold-out 1/3 of data for testing
• Train a model on 2/3 of data
• Estimate error (or success) rate and calculate CI
• “Moderate” amounts of data
• Estimate error rate:
• Use 10-fold cross-validation with stratification,
• or use bootstrap.
• Train model on the entire data set
Predicting Probabilities
• Classification into k classes
• Predict probabilities p1, p2,..., pnfor each class.
• Actual values a1, a2,..., an.
• No longer 0-1 error

Correct class

Information Loss Function

where the j-th prediction is correct.

• Information required to communicate which class is correct
• in bits
• with respect to the probability distribution
Occam's Razor
• Given a choice of theories that are equally good the simplest theory should be chosen
• Physical sciences: any theory should be consistant with all empirical observations
• Data mining:
• theory = predictive model
• good theory = good prediction
• What is good? Do we minimize the error rate?
Minimum Description Length
• MDL principle:
• Minimize
• size of theory + info needed to specify exceptions
• Suppose trainings set E is mined resulting in a theory T
• Want to minimize
Most Likely Theory
• Suppose we want to maximize P[T|E]
• Bayes' rule
• Take logarithms
Information Function
• Maximizing P[T|E] equivilent to minimizing
• That is, the MDL principle!

Number of bits it takes

to submit the exceptions

Number of bits it takes

to submit the theory

Applications to Learning
• Classification, association, numeric prediciton
• Several predictive models with 'similar' error rate (usually as small as possible)
• Select between them using Occam's razor
• Simplicity subjective
• Use MDL principle
• Clustering
• Important learning that is difficult to evaluate
• Can use MDL principle
Comparing Mining Algorithms
• Know how to evaluate the results
• Suppose we have two algorithms
• Obtain two different models
• Estimate the error rates e(1) and e(2).
• Compare estimates
• Select the better one
• Problem?
Weather Data Example
• Suppose we learn the rule

If outlook=rainy then play=yes

Otherwise play=no

• Test it on the following test set:
• Have zero error rate
Different Test Set 2
• Again, suppose we learn the rule

If outlook=rainy then play=yes

Otherwise play=no

• Test it on a different test set:
• Have 100% error rate!
Comparing Random Estimates
• Estimated error rate is just an estimate (random)
• Need variance as well as point estimates
• Construct a t-test statistic

Average of differences

in error rates

H0: Difference = 0

Estimated standard

deviation

Discussion
• Now know how to compare two learning algorithms and select the one with the better error rate
• We also know to select the simplest model that has 'comparable' error rate
• Is it really better?
• Minimising error rate can be misleading
Examples of 'Good Models'
• Application: loan approval
• Model: no applicants default on loans
• Evaluation: simple, low error rate
• Application: cancer diagnosis
• Model: all tumors are benign
• Evaluation: simple, low error rate
• Application: information assurance
• Model: all visitors to network are well intentioned
• Evaluation: simple, low error rate
What's Going On?
• Many (most) data mining applications can be thought about as detecting exceptions
• Ignoring the exceptions does not significantly increase the error rate!
• Ignoring the exceptions often leads to a simple model!
• Thus, we can find a model that we evaluate as good but completely misses the point
• Need to account for the cost of error types
Accounting for Cost of Errors
• Explicit modeling of the cost of each error
• costs may not be known
• often not practical
• visual inspection
• semi-automated learning
• Cost-sensitive learning
• assign costs to classes a priori
Explicit Modeling of Cost

Confusion Matrix

(Displayed in Weka)

Cost Sensitive Learning
• Have used cost information to evaluate learning
• Better: use cost information to learn
• Simple idea:
• Increase instances that demonstrate important behavior (e.g., classified as exceptions)
• Applies for any learning algorithm
Discussion
• Evaluate learning
• Estimate error rate
• Minimum length principle/Occam’s Razor
• Comparison of algorithm
• Based on evaluation
• Make sure difference is significant
• Cost of making errors may differ
• Use evaluation procedures with caution
• Incorporate into learning
Engineering the Output
• Prediction base on one model
• Model performs well on one training set, but poorly on others
• New data becomes available  new model
• Combine models
• Bagging
• Boosting
• Stacking

}

Improve prediction but complicate structure

Bagging
• Bias: error despite all the data in the world!
• Variance: error due to limited data
• Intuitive idea of bagging:
• Assume we have several data sets
• Apply learning algorithm to each set
• Vote on the prediction (classification/numeric)
• What type of error does this reduce?
• When is this beneficial?
Bootstrap Aggregating
• In practice: only one training data set
• Create many sets from one
• Sample with replacement (remember the bootstrap)
• Does this work?
• Often given improvements in predictive performance
• Never degeneration in performance
Boosting
• Assume a stable learning procedure
• Low variance
• Bagging does very little
• Combine structurally different models
• Intuitive motivation:
• Any given model may be good for a subset of the training data
• Encourage models to explain part of the data
• Generate models:
• Assign equal weight to each training instance
• Iterate:
• Apply learning algorithm and store model
• e¬ error
• If e = 0 or e > 0.5 terminate
• For every instance:

If classified correctly multiply weight by e/(1-e)

• Normalize weight
• Until STOP
• Classification:
• Assign zero weight to each class
• For every model:

to class predicted by model

• Return class with highest weight
Performance Analysis
• Error of combined classifier converges to zero at an exponential rate (very fast)
• Questionable value due to possible overfitting
• Must use independent test data
• Fails on test data if
• Classifier more complex than training data justifies
• Training error become too large too quickly
• Must achieve balance between model complexity and the fit to the data
Fitting versus Overfitting
• Overfitting very difficult to assess here
• Assume we have reached zero error
• May be beneficial to continue boosting!
• Occam's razor?
• Build complex models from simple ones
• Boosting offers very significant improvement
• Can hope for more improvement than bagging
• Can degenerate performance
• Never happens with bagging
Stacking
• Models of different types
• Meta learner:
• Learn which learning algorithms are good
• Combine learning algorithms intelligently

Level-0 Models

Level-1 Model

Decision Tree

Naïve Bayes

Instance-Based

Meta Learner

Meta Learning
• Holdout part of the training set
• Use remaining data for training level-0 methods
• Use holdout data to train level-1 learning
• Retrain level-0 algorithms with all the data
• Level-1 learning: use very simple algorithm (e.g., linear model)
• Can use cross-validation to allow level-1 algorithms to train on all the data
Supervised Learning
• Two types of learning
• Classification
• Numerical prediction
• Classification learning algorithms
• Decision trees
• Naïve Bayes
• Instance-based learning
• Many others are part of Weka, browse!
Other Issues in Supervised Learning
• Evaluation
• Accuracy: hold-out, bootstrap, cross-validation
• Simplicity: MDL principle
• Usefulness: cost-sensitive learning
• Metalearning
• Bagging, Boosting, Stacking