1 / 26

An Exercise in Machine Learning

Learn how to use WEKA, a machine learning software, to prepare data, build classifiers, and interpret results. Explore different learning schemes, handy tools, and resources available in WEKA.

sullivanr
Download Presentation

An Exercise in Machine Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Exercise in Machine Learning • http://www.cs.iastate.edu/~cs573x/bbsilab.html • Machine Learning Software • Preparing Data • Building Classifiers • Interpreting Results • Test-driving WEKA

  2. Machine Learning Software • Suites (General Purpose) • WEKA (Source: Java) • MLC++ (Source: C++) • SIPINA • List from KDNuggets (Various) • Specific • Classification: C4.5, SVMlight • Association Rule Mining • Bayesian Net …… • Commercial vs. Free vs. Programming

  3. What does WEKA do? • Implementation of state-of-art learning algorithm • Main strengths in the classification • Regression, Association Rules and clustering algorithms • Extensible to try new learning schemes • Large variety of handy tools (transforming datasets, filters, visualization etc…)

  4. WEKA resources • API Documentation, Tutorial, Source code. • WEKA mailing list • Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations • Weka-related Projects: • Weka-Parallel - parallel processing for Weka • RWeka - linking R and Weka • YALE - Yet Another Learning Environment • Many others…

  5. Getting Started • Installation (Java runtime +WEKA) • Setting up the environment (CLASSPATH) • Reference Book and online API document • Preparing Data sets • Running WEKA • Interpreting Results

  6. ARFF Data Format • Attribute-Relation File Format • Header – describing the attribute types • Data – (instances, examples) comma-separated list • Use the right data format: • Filestem, CSV  ARFF format • Use C45Loader and CSVLoader to convert

  7. Launching WEKA

  8. Load Dataset into WEKA

  9. Data Filters • Useful support for data preprocessing • Removing or adding attributes, resampling the dataset, removing examples, etc. • Creates stratified cross-validation folds of the given dataset, and class distributions are approximately retained within each fold. • Typically split data as 2/3 in training and 1/3 in testing

  10. Building Classifiers • A classifier model - mapping from dataset attributes to the class (target) attribute. Creation and form differs. • Decision Tree and Naïve Bayes Classifiers • Which one is the best? • No Free Lunch!

  11. Building Classifier

  12. (1) weka.classifiers.rules.ZeroR • Building and using a 0-R classifier. Predicts the mean (for a numeric class) or the mode (for a nominal class). (2) weka.classifiers.bayes.NaiveBayes • Class for building a Naive Bayesian classifier

  13. (3) weka.classifiers.trees.J48 • Class for generating an unpruned or a pruned C4.5 decision tree.

  14. Test Options • Percentage Split (2/3 Training; 1/3 Testing) • Cross-validation • estimating the generalization error based on resampling when limited data; averaged error estimate. • stratified • 10-fold • leave-one-out (Loo) • 10-fold vs. Loo

  15. Understanding Output

  16. Decision Tree Output (1) === Error on training data === Correctly Classified Instance 14 100 % Incorrectly Classified Instances 0 0 % Kappa statistic 1 Mean absolute error 0 Root mean squared error 0 Relative absolute error 0% Root relative squared error 0% Total Number of Instances 14 === Detailed Accuracy By Class === TP FP Precision Recall F-Measure Class 1 0 1 1 1 yes 1 0 1 1 1 no === Confusion Matrix === a b <-- classified as • 0 | a = yes • 0 5 | b = no J48 pruned tree ------------------ outlook = sunny | humidity <= 75: yes (2.0) | humidity > 75: no (3.0) outlook = overcast: yes (4.0) outlook = rainy | windy = TRUE: no (2.0) | windy = FALSE: yes (3.0) Number of Leaves : 5 Size of the tree : 8

  17. Decision Tree Output (2) === Stratified cross-validation === Correctly Classified Instances 9 64.2857 % Incorrectly Classified Instances 5 35.7143 % Kappa statistic 0.186 Mean absolute error 0.2857 Root mean squared error 0.4818 Relative absolute error 60% Root relative squared error 97.6586 % Total Number of Instances 14 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.778 0.6 0.7 0.778 0.737 yes 0.4 0.222 0.5 0.4 0.444 no === Confusion Matrix === a b <-- classified as 7 2 | a = yes 3 2 | b = no

  18. Performance Measures • Accuracy & Error rate • Mean absolute error • Root mean-squared root (square root of the average quadratic loss) • Confusion matrix – contingency table • True Positive rate & False Positive rate • Precision & F-Measure

  19. Decision Tree Pruning • Overcome Over-fitting • Pre-pruning and Post-pruning • Reduced error pruning • Subtree raising with different confidence • Comparing tree size and accuracy.

  20. Subtree replacement • Bottom-up: tree is considered for replacement once all its subtrees have been considered

  21. Subtree Raising • Deletes node and redistributes instances • Slower than subtree replacement

  22. Naïve Bayesian Classifier • Output CPT, same set of performance measures • By default, use normal distribution to model numeric attributes. • Kernel density estimator could improve performance if normality assumption is incorrect. (-k option)

  23. Data Sets to work on • Data sets were preprocessed into ARFF format • Three data sets from UCI repository • Two data sets from Computational Biology • Protein Function Prediction • Surface Residue Prediction

  24. Protein Function Prediction • Build a Decision Tree classifier that assign protein sequences into functional families based on characteristic motif compositions • Each attribute (motif) has a Prosite access number: PS#### • Class label use Prosite Doc ID: PDOC#### • 73 attributes (binary) & 10 classes (PDOC). • Suggested method: Use 10-fold CV and Pruning the tree using Sub-tree raising method

  25. Surface Residue Prediction • Prediction is based on the identity of the target residue and its 4 sequence neighbors • Window Size = 5 • Target residue is on Surface or not? • 5 attributes and binary classes. • Suggested method: Use Naïve Bayesian Classifier with no kernels

  26. Your Turn to Test Drive!

More Related