An Exercise in Machine Learning

An Exercise in Machine Learning • http://www.cs.iastate.edu/~cs573x/bbsilab.html • Machine Learning Software • Preparing Data • Building Classifiers • Interpreting Results • Test-driving WEKA

Machine Learning Software • Suites (General Purpose) • WEKA (Source: Java) • MLC++ (Source: C++) • SIPINA • List from KDNuggets (Various) • Specific • Classification: C4.5, SVMlight • Association Rule Mining • Bayesian Net …… • Commercial vs. Free vs. Programming

What does WEKA do? • Implementation of state-of-art learning algorithm • Main strengths in the classification • Regression, Association Rules and clustering algorithms • Extensible to try new learning schemes • Large variety of handy tools (transforming datasets, filters, visualization etc…)

WEKA resources • API Documentation, Tutorial, Source code. • WEKA mailing list • Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations • Weka-related Projects: • Weka-Parallel - parallel processing for Weka • RWeka - linking R and Weka • YALE - Yet Another Learning Environment • Many others…

Getting Started • Installation (Java runtime +WEKA) • Setting up the environment (CLASSPATH) • Reference Book and online API document • Preparing Data sets • Running WEKA • Interpreting Results

ARFF Data Format • Attribute-Relation File Format • Header – describing the attribute types • Data – (instances, examples) comma-separated list • Use the right data format: • Filestem, CSV  ARFF format • Use C45Loader and CSVLoader to convert

Launching WEKA

Load Dataset into WEKA

Data Filters • Useful support for data preprocessing • Removing or adding attributes, resampling the dataset, removing examples, etc. • Creates stratified cross-validation folds of the given dataset, and class distributions are approximately retained within each fold. • Typically split data as 2/3 in training and 1/3 in testing

Building Classifiers • A classifier model - mapping from dataset attributes to the class (target) attribute. Creation and form differs. • Decision Tree and Naïve Bayes Classifiers • Which one is the best? • No Free Lunch!

Building Classifier

(1) weka.classifiers.rules.ZeroR • Building and using a 0-R classifier. Predicts the mean (for a numeric class) or the mode (for a nominal class). (2) weka.classifiers.bayes.NaiveBayes • Class for building a Naive Bayesian classifier

(3) weka.classifiers.trees.J48 • Class for generating an unpruned or a pruned C4.5 decision tree.

Test Options • Percentage Split (2/3 Training; 1/3 Testing) • Cross-validation • estimating the generalization error based on resampling when limited data; averaged error estimate. • stratified • 10-fold • leave-one-out (Loo) • 10-fold vs. Loo

Understanding Output

Decision Tree Output (1) === Error on training data === Correctly Classified Instance 14 100 % Incorrectly Classified Instances 0 0 % Kappa statistic 1 Mean absolute error 0 Root mean squared error 0 Relative absolute error 0% Root relative squared error 0% Total Number of Instances 14 === Detailed Accuracy By Class === TP FP Precision Recall F-Measure Class 1 0 1 1 1 yes 1 0 1 1 1 no === Confusion Matrix === a b <-- classified as • 0 | a = yes • 0 5 | b = no J48 pruned tree ------------------ outlook = sunny | humidity <= 75: yes (2.0) | humidity > 75: no (3.0) outlook = overcast: yes (4.0) outlook = rainy | windy = TRUE: no (2.0) | windy = FALSE: yes (3.0) Number of Leaves : 5 Size of the tree : 8

Decision Tree Output (2) === Stratified cross-validation === Correctly Classified Instances 9 64.2857 % Incorrectly Classified Instances 5 35.7143 % Kappa statistic 0.186 Mean absolute error 0.2857 Root mean squared error 0.4818 Relative absolute error 60% Root relative squared error 97.6586 % Total Number of Instances 14 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.778 0.6 0.7 0.778 0.737 yes 0.4 0.222 0.5 0.4 0.444 no === Confusion Matrix === a b <-- classified as 7 2 | a = yes 3 2 | b = no

Performance Measures • Accuracy & Error rate • Mean absolute error • Root mean-squared root (square root of the average quadratic loss) • Confusion matrix – contingency table • True Positive rate & False Positive rate • Precision & F-Measure

Decision Tree Pruning • Overcome Over-fitting • Pre-pruning and Post-pruning • Reduced error pruning • Subtree raising with different confidence • Comparing tree size and accuracy.

Subtree replacement • Bottom-up: tree is considered for replacement once all its subtrees have been considered

Subtree Raising • Deletes node and redistributes instances • Slower than subtree replacement

Naïve Bayesian Classifier • Output CPT, same set of performance measures • By default, use normal distribution to model numeric attributes. • Kernel density estimator could improve performance if normality assumption is incorrect. (-k option)

Data Sets to work on • Data sets were preprocessed into ARFF format • Three data sets from UCI repository • Two data sets from Computational Biology • Protein Function Prediction • Surface Residue Prediction

Protein Function Prediction • Build a Decision Tree classifier that assign protein sequences into functional families based on characteristic motif compositions • Each attribute (motif) has a Prosite access number: PS#### • Class label use Prosite Doc ID: PDOC#### • 73 attributes (binary) & 10 classes (PDOC). • Suggested method: Use 10-fold CV and Pruning the tree using Sub-tree raising method

Surface Residue Prediction • Prediction is based on the identity of the target residue and its 4 sequence neighbors • Window Size = 5 • Target residue is on Surface or not? • 5 attributes and binary classes. • Suggested method: Use Naïve Bayesian Classifier with no kernels

Your Turn to Test Drive!

An Exercise in Machine Learning

An Exercise in Machine Learning

Presentation Transcript

Topics in Machine Learning

Machine Learning in Bioinformatics

An exercise in forecasting

An Interactive Group Learning Exercise

An Introduction to Machine Learning

An Overview of Machine Learning

Learning from an ePMA procurement exercise

Machine Learning in DryadLINQ

Machine learning in IDS

an exercise in composition

Machine Learning: An Overview

Classification by Machine Learning Approaches - Exercise Solution

The Learning Commons at AUC: an exercise in collaboration

Machine Learning in GATE

Evaluation in Machine Learning

Machine Learning in Football

CAUSAL INFERENCE AS A MACHINE LEARNING EXERCISE

An Introduction to Machine Learning Algorithms

Learning path for an aspiring machine learning expert

Machine Learning - An Emerging Career!

Experiments in Machine Learning

An Overview of Machine Learning