1 / 29

An Exercise in Machine Learning

An Exercise in Machine Learning. http://www.cs.iastate.edu/~cs573x/BBSIlab/2006/. Cornelia Caragea. Outline. Machine Learning Software Preparing Data Building Classifiers Interpreting Results. Machine Learning Software. Suites (General Purpose) WEKA (Source: Java)

glennl
Download Presentation

An Exercise in Machine Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Exercise in Machine Learning • http://www.cs.iastate.edu/~cs573x/BBSIlab/2006/ • Cornelia Caragea

  2. Outline • Machine Learning Software • Preparing Data • Building Classifiers • Interpreting Results

  3. Machine Learning Software • Suites (General Purpose) • WEKA (Source: Java) • MLC++ (Source: C++) • SAS • List from KDNuggets (Various) • Specific • Classification: C4.5, SVMlight • Association Rule Mining • Bayesian Net … • Commercial vs. Free

  4. What does WEKA do? • Implementation of the state-of-the-art learning algorithm • Main strengths in the classification • Regression, Association Rules and clustering algorithms • Extensible to try new learning schemes • Large variety of handy tools (transforming datasets, filters, visualization etc…)

  5. WEKA resources • API Documentation, Tutorials, Source code. • WEKA mailing list • Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations • Weka-related Projects: • Weka-Parallel - parallel processing for Weka • RWeka - linking R and Weka • YALE - Yet Another Learning Environment • Many others…

  6. Outline • Machine Learning Software • Preparing Data • Building Classifiers • Interpreting Results

  7. Preparing Data • ARFF Data Format • Header – describing the attribute types • Data – (instances, examples) comma-separated list

  8. Launching WEKA • java -jar weka.jar

  9. Load Dataset into WEKA

  10. Data Filters • Useful support for data preprocessing • Removing or adding attributes, resampling the dataset, removing examples, etc. • Creates stratified cross-validation folds of the given dataset, and class distributions are approximately retained within each fold. • Typically split data as 2/3 in training and 1/3 in testing

  11. Data Filters

  12. Outline • Machine Learning Software • Preparing Data • Building Classifiers • Interpreting Results

  13. Building Classifiers • A classifier model - mapping from dataset attributes to the class (target) attribute. Creation and form differs. • Decision Tree and Naïve Bayes Classifiers • Which one is the best? • No Free Lunch!

  14. Building Classifiers

  15. (1) weka.classifiers.rules.ZeroR • Class for building and using a 0-R classifier • Majority class classifier • Predicts the mean (for a numeric class) or the mode (for a nominal class)

  16. Exercise 1 • http://www.cs.iastate.edu/~cs573x/BBSIlab/2006/exercises/ex1.html

  17. (2)weka.classifiers.bayes.NaiveBayes • Class for building a Naive Bayes classifier

  18. (3) weka.classifiers.trees.J48 • Class for generating a pruned or unpruned C4.5 decision tree

  19. Test Options • Percentage Split (2/3 Training; 1/3 Testing) • Cross-validation • estimating the generalization error based on resampling when limited data; averaged error estimate. • stratified • 10-fold • leave-one-out (Loo)

  20. Outline • Machine Learning Software • Preparing Data • Building Classifiers • Interpreting Results

  21. Understanding Output

  22. Decision Tree Output (1)

  23. Decision Tree Output (2)

  24. Exercise 2 • http://www.cs.iastate.edu/~cs573x/BBSIlab/2006/exercises/ex2.html

  25. Performance Measures • Accuracy & Error rate • Confusion matrix – contingency table • True Positive rate & False Positive rate (Area under Receiver Operating Characteristic) • Precision,Recall & F-Measure • Sensitivity & Specificity • For more information on these, see • uisp09-Evaluation.ppt

  26. Decision Tree Pruning • Overcome Over-fitting • Pre-pruning and Post-pruning • Reduced error pruning • Subtree raising with different confidence • Comparing tree size and accuracy

  27. Subtree replacement • Bottom-up: tree is considered for replacement once all its subtrees have been considered

  28. Subtree Raising • Deletes node and redistributes instances • Slower than subtree replacement

  29. Exercise 3 • http://www.cs.iastate.edu/~cs573x/BBSIlab/2006/exercises/ex3.html

More Related