- By
**matt** - Follow User

- 148 Views
- Uploaded on

Download Presentation
## Supervised Learning

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Introduction

- Key idea
- Known target concept (predict certain attribute)
- Find out how other attributes can be used
- Algorithms
- Rudimentary Rules (e.g., 1R)
- Statistical Modeling (e.g., Naïve Bayes)
- Divide and Conquer: Decision Trees
- Instance-Based Learning
- Neural Networks
- Support Vector Machines

1-Rule

- Generate a one-level decision tree
- One attribute
- Performs quite well!
- Basic idea:
- Rules testing a single attribute
- Classify according to frequency in training data
- Evaluate error rate for each attribute
- Choose the best attribute
- That’s all folks!

Apply 1R

Attribute Rules Errors Total

1 outlook sunnyno 2/5 4/14

overcast yes 0/4

rainy yes 2/5

2 temperature hot no 2/4 5/14

mild yes 2/6

cool no 3/7

3 humidity high no 3/7 4/14

normal yes 2/8

4 windy false yes 2/8 5/14

true no 3/6

Other Features

- Numeric Values
- Discretization :
- Sort training data
- Split range into categories
- Missing Values
- “Dummy” attribute

Naïve Bayes Classifier

- Allow all attributes to contribute equally
- Assumes
- All attributes equally important
- All attributes independent
- Realistic?
- Selection of attributes

Maximum a Posteriori (MAP)

Maximum Likelihood (ML)

Classification

- Want to classify a new instance (a1, a2,…, an) into finite number of categories from the set V.
- Bayesian approach: Assign the most probable category vMAP given (a1, a2,…, an).
- Can we estimate the probabilities from the training data?

Naïve Bayes Classifier

- Second probability easy to estimate
- How?
- The first probability difficult to estimate
- Why?
- Assume independence (this is the naïve bit):

Estimation

- Given a new instance with
- outlook=sunny,
- temperature=high,
- humidity=high,
- windy=true

Calculations continued …

- Similarly
- Thus

Normalization

- Note that we can normalize to get the probabilities:

Numeric Values

- Assume a probability distribution for the numeric attributes density f(x)
- normal
- fit a distribution (better)
- Similarly as before

Discussion

- Simple methodology
- Powerful - good results in practice
- Missing values no problem
- Not so good if independence assumption is severely violated
- Extreme case: multiple attributes with same values
- Solutions:
- Preselect which attributes to use
- Non-naïve Bayesian methods: networks

Decision Tree Learning

- Basic Algorithm:
- Select an attribute to be tested
- If classification achieved return classification
- Otherwise, branch by setting attribute to each of the possible values
- Repeat with branch as your new tree
- Main issue: how to select attributes

Deciding on Branching

- What do we want to accomplish?
- Make good predictions
- Obtain simple to interpret rules
- No diversity (impurity) is best
- all same class
- all classes equally likely
- Goal: select attributes to reduce impurity

Measuring Impurity/Diversity

- Lets say we only have two classes:
- Minimum
- Gini index/Simpson diversity index
- Entropy

Entropy

Number of classes

Training data

(instances)

Proportion of

S classified as i

- Entropy is a measure of impurity in the training data S
- Measured in bits of information needed to encode a member of S
- Extreme cases
- All member same classification (Note: 0·log 0 = 0)
- All classifications equally frequent

Expected Information Gain

All possible values

for attribute a

Gain(S,a) is the expected information provided about the

classification from knowing the value of attribute a

(Reduction in number of bits needed)

Calculating the Gain

Select!

Calculating the Gain

Select

What’s in a Tree?

- Our final decision tree correctly classifies every instance
- Is this good?
- Two important concepts:
- Overfitting
- Pruning

Overfitting

- Two sources of abnormalities
- Noise (randomness)
- Outliers (measurement errors)
- Chasing every abnormality causes overfitting
- Tree to large and complex
- Does not generalize to new data
- Solution: prune the tree

Pruning

- Prepruning
- Halt construction of decision tree early
- Use same measure as in determining attributes, e.g., halt if InfoGain < K
- Most frequent class becomes the leaf node
- Postpruning
- Construct complete decision tree
- Prune it back
- Prune to minimize expected error rates
- Prune to minimize bits of encoding (Minimum Description Length principle)

Scalability

- Need to design for large amounts of data
- Two things to worry about
- Large number of attributes
- Leads to a large tree (prepruning?)
- Takes a long time
- Large amounts of data
- Can the data be kept in memory?
- Some new algorithms do not require all the data to be memory resident

Discussion: Decision Trees

- The most popular methods
- Quite effective
- Relatively simple
- Have discussed in detail the ID3 algorithm:
- Information gain to select attributes
- No pruning
- Only handles nominal attributes

Selecting Split Attributes

- Other Univariate splits
- Gain Ratio: C4.5 Algorithm (J48 in Weka)
- CART (not in Weka)
- Multivariate splits
- May be possible to obtain better splits by considering two or more attributes simultaneously

Instance-Based Learning

- Classification
- To not construct a explicit description of how to classify
- Store all training data (learning)
- New example: find most similar instance
- computing done at time of classification
- k-nearest neighbor

K-Nearest Neighbor

- Each instance lives in n-dimensional space
- Distance between instances

Normalizing

- Some attributes may take large values and other small
- Normalize
- All attributes on equal footing

Other Methods for Supervised Learning

- Neural networks
- Support vector machines
- Optimization
- Rough set approach
- Fuzzy set approach

Evaluating the Learning

- Measure of performance
- Classification: error rate
- Resubstitution error
- Performance on training set
- Poor predictor of future performance
- Overfitting
- Useless for evaluation

Test Set

- Need a set of test instances
- Independent of training set instances
- Representative of underlying structure
- Sometimes: validation data
- Fine-tune parameters
- Independent of training and test data
- Plentiful data - no problem!

Holdout Procedures

- Common case: data set large but limited
- Usual procedure:
- Reserve some data for testing
- Use remaining data for training
- Problems:
- Want both sets as large as possible
- Want both sets to be representitive

"Smart" Holdout

- Simple check: Are the proportions of classes about the same in each data set?
- Stratified holdout
- Guarantee that classes are (approximately) proportionally represented
- Repeated holdout
- Randomly select holdout set several times and average the error rate estimates

Holdout w/ Cross-Validation

- Cross-validation
- Fixed number of partitions of the data (folds)
- In turn: each partition used for testing and remaining instances for training
- May use stratification and randomization
- Standard practice:
- Stratified tenfold cross-validation
- Instances divided randomly into the ten partitions

Cross Validation

Fold 1

Train on 90% of the data

Model

Test on 10%

of the data

Error rate e1

Fold 2

Train on 90% of the data

Model

Test on 10%

of the data

Error rate e2

Cross-Validation

- Final estimate of error
- Quality of estimate

Leave-One-Out Holdout

- n-Fold Cross-Validation (n instance set)
- Use all but one instance for training
- Maximum use of the data
- Deterministic
- High computational cost
- Non-stratified sample

Bootstrap

- Sample with replacement n times
- Use as training data
- Use instances not in training data for testing
- How many test instances are there?

0.632 Bootstrap

- On the average e-1 n = 0.369 n instances will be in the test set
- Thus, on average we have 63.2% of instance in training set
- Estimate error rate

e = 0.632 etest + 0.368 etrain

Accuracy of our Estimate?

- Suppose we observe s successes in a testing set of ntest instances ...
- We then estimate the success rate

Rsuccess=s/ ntest.

- Each instance is either a success or failure (Bernoulli trial w/success probability p)
- Mean p
- Variance p(1-p)

Properties of Estimate

- We have

E[Rsuccess]=p

Var[Rsuccess]=p(1-p)/ntest

- If ntrainingis large enough the Central Limit Theorem (CLT) states that, approximately,

Rsuccess~Normal(p,p(1-p)/ntest)

Comparing Algorithms

- Know how to evaluate the results of our data mining algorithms (classification)
- How should we compare different algorithms?
- Evaluate each algorithm
- Rank
- Select best one
- Don't know if this ranking is reliable

Assessing Other Learning

- Developed procedures for classification
- Association rules
- Evaluated based on accuracy
- Same methods as for classification
- Numerical prediction
- Error rate no longer applies
- Same principles
- use independent test set and hold-out procedures
- cross-validation or bootstrap

Measures of Effectiveness

- Need to compare:
- Predicted values p1, p2,..., pn.
- Actual values a1, a2,..., an.
- Most common measure
- Mean-squared error

Other Measures

- Mean absolute error
- Relative squared error
- Relative absolute error
- Correlation

What to Do?

- “Large” amounts of data
- Hold-out 1/3 of data for testing
- Train a model on 2/3 of data
- Estimate error (or success) rate and calculate CI
- “Moderate” amounts of data
- Estimate error rate:
- Use 10-fold cross-validation with stratification,
- or use bootstrap.
- Train model on the entire data set

Predicting Probabilities

- Classification into k classes
- Predict probabilities p1, p2,..., pnfor each class.
- Actual values a1, a2,..., an.
- No longer 0-1 error
- Quadratic loss function

Correct class

Information Loss Function

- Instead of quadratic function:

where the j-th prediction is correct.

- Information required to communicate which class is correct
- in bits
- with respect to the probability distribution

Occam's Razor

- Given a choice of theories that are equally good the simplest theory should be chosen
- Physical sciences: any theory should be consistant with all empirical observations
- Data mining:
- theory = predictive model
- good theory = good prediction
- What is good? Do we minimize the error rate?

Minimum Description Length

- MDL principle:
- Minimize
- size of theory + info needed to specify exceptions
- Suppose trainings set E is mined resulting in a theory T
- Want to minimize

Most Likely Theory

- Suppose we want to maximize P[T|E]
- Bayes' rule
- Take logarithms

Information Function

- Maximizing P[T|E] equivilent to minimizing
- That is, the MDL principle!

Number of bits it takes

to submit the exceptions

Number of bits it takes

to submit the theory

Applications to Learning

- Classification, association, numeric prediciton
- Several predictive models with 'similar' error rate (usually as small as possible)
- Select between them using Occam's razor
- Simplicity subjective
- Use MDL principle
- Clustering
- Important learning that is difficult to evaluate
- Can use MDL principle

Comparing Mining Algorithms

- Know how to evaluate the results
- Suppose we have two algorithms
- Obtain two different models
- Estimate the error rates e(1) and e(2).
- Compare estimates
- Select the better one
- Problem?

Weather Data Example

- Suppose we learn the rule

If outlook=rainy then play=yes

Otherwise play=no

- Test it on the following test set:
- Have zero error rate

Different Test Set 2

- Again, suppose we learn the rule

If outlook=rainy then play=yes

Otherwise play=no

- Test it on a different test set:
- Have 100% error rate!

Comparing Random Estimates

- Estimated error rate is just an estimate (random)
- Need variance as well as point estimates
- Construct a t-test statistic

Average of differences

in error rates

H0: Difference = 0

Estimated standard

deviation

Discussion

- Now know how to compare two learning algorithms and select the one with the better error rate
- We also know to select the simplest model that has 'comparable' error rate
- Is it really better?
- Minimising error rate can be misleading

Examples of 'Good Models'

- Application: loan approval
- Model: no applicants default on loans
- Evaluation: simple, low error rate
- Application: cancer diagnosis
- Model: all tumors are benign
- Evaluation: simple, low error rate
- Application: information assurance
- Model: all visitors to network are well intentioned
- Evaluation: simple, low error rate

What's Going On?

- Many (most) data mining applications can be thought about as detecting exceptions
- Ignoring the exceptions does not significantly increase the error rate!
- Ignoring the exceptions often leads to a simple model!
- Thus, we can find a model that we evaluate as good but completely misses the point
- Need to account for the cost of error types

Accounting for Cost of Errors

- Explicit modeling of the cost of each error
- costs may not be known
- often not practical
- Look at trade-offs
- visual inspection
- semi-automated learning
- Cost-sensitive learning
- assign costs to classes a priori

Cost Sensitive Learning

- Have used cost information to evaluate learning
- Better: use cost information to learn
- Simple idea:
- Increase instances that demonstrate important behavior (e.g., classified as exceptions)
- Applies for any learning algorithm

Discussion

- Evaluate learning
- Estimate error rate
- Minimum length principle/Occam’s Razor
- Comparison of algorithm
- Based on evaluation
- Make sure difference is significant
- Cost of making errors may differ
- Use evaluation procedures with caution
- Incorporate into learning

Engineering the Output

- Prediction base on one model
- Model performs well on one training set, but poorly on others
- New data becomes available new model
- Combine models
- Bagging
- Boosting
- Stacking

}

Improve prediction but complicate structure

Bagging

- Bias: error despite all the data in the world!
- Variance: error due to limited data
- Intuitive idea of bagging:
- Assume we have several data sets
- Apply learning algorithm to each set
- Vote on the prediction (classification/numeric)
- What type of error does this reduce?
- When is this beneficial?

Bootstrap Aggregating

- In practice: only one training data set
- Create many sets from one
- Sample with replacement (remember the bootstrap)
- Does this work?
- Often given improvements in predictive performance
- Never degeneration in performance

Boosting

- Assume a stable learning procedure
- Low variance
- Bagging does very little
- Combine structurally different models
- Intuitive motivation:
- Any given model may be good for a subset of the training data
- Encourage models to explain part of the data

AdaBoost.M1

- Generate models:
- Assign equal weight to each training instance
- Iterate:
- Apply learning algorithm and store model
- e¬ error
- If e = 0 or e > 0.5 terminate
- For every instance:

If classified correctly multiply weight by e/(1-e)

- Normalize weight
- Until STOP

AdaBoost.M1

- Classification:
- Assign zero weight to each class
- For every model:
- Add

to class predicted by model

- Return class with highest weight

Performance Analysis

- Error of combined classifier converges to zero at an exponential rate (very fast)
- Questionable value due to possible overfitting
- Must use independent test data
- Fails on test data if
- Classifier more complex than training data justifies
- Training error become too large too quickly
- Must achieve balance between model complexity and the fit to the data

Fitting versus Overfitting

- Overfitting very difficult to assess here
- Assume we have reached zero error
- May be beneficial to continue boosting!
- Occam's razor?
- Build complex models from simple ones
- Boosting offers very significant improvement
- Can hope for more improvement than bagging
- Can degenerate performance
- Never happens with bagging

Stacking

- Models of different types
- Meta learner:
- Learn which learning algorithms are good
- Combine learning algorithms intelligently

Level-0 Models

Level-1 Model

Decision Tree

Naïve Bayes

Instance-Based

Meta Learner

Meta Learning

- Holdout part of the training set
- Use remaining data for training level-0 methods
- Use holdout data to train level-1 learning
- Retrain level-0 algorithms with all the data
- Comments:
- Level-1 learning: use very simple algorithm (e.g., linear model)
- Can use cross-validation to allow level-1 algorithms to train on all the data

Supervised Learning

- Two types of learning
- Classification
- Numerical prediction
- Classification learning algorithms
- Decision trees
- Naïve Bayes
- Instance-based learning
- Many others are part of Weka, browse!

Other Issues in Supervised Learning

- Evaluation
- Accuracy: hold-out, bootstrap, cross-validation
- Simplicity: MDL principle
- Usefulness: cost-sensitive learning
- Metalearning
- Bagging, Boosting, Stacking

Download Presentation

Connecting to Server..