100 likes | 119 Views
Discover Weka, a powerful open-source tool for machine learning tasks like classification, regression, and clustering. Experiment with feature selection, evaluation schemes, and visualizations. Explore learning tasks like generating procedures for labeling unseen examples. Dive into data formats and evaluation schemes for precise and accurate results.
E N D
WekaJust do it Free and Open Source ML Suite Ian Witten & Eibe Frank University of Waikato New Zealand
Overview • Classifiers, Regressors, and clusterers • Multiple evaluation schemes • Bagging and Boosting • Feature Selection: • right features and data key to successful learning • Experimenter • Visualizer • Text not up to date. • They welcome additions.
Learning Tasks • Classification: given examples labelled from a finite domain, generate a procedure for labelling unseen examples. • Regression: given examples labelled with a real value, generate procedure for labelling unseen examples. • Clustering: from a set of examples, partitioning examples into “interesting” groups. What scientists want.
Data Format: IRIS @RELATION iris @ATTRIBUTE sepallength REAL @ATTRIBUTE sepalwidth REAL @ATTRIBUTE petallength REAL @ATTRIBUTE petalwidth REAL @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica} @DATA 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa Etc. General from @atttribute attribute-name REAL or list of values
J48 = Decision Tree petalwidth <= 0.6: Iris-setosa (50.0) : # under node petalwidth > 0.6 # ..number wrong | petalwidth <= 1.7 | | petallength <= 4.9: Iris-versicolor (48.0/1.0) | | petallength > 4.9 | | | petalwidth <= 1.5: Iris-virginica (3.0) | | | petalwidth > 1.5: Iris-versicolor (3.0/1.0) | petalwidth > 1.7: Iris-virginica (46.0/1.0)
Cross-validation • Correctly Classified Instances 143 95.3% • Incorrectly Classified Instances 7 4.67 % • Default 10-fold cross validation i.e. • Split data into 10 equal sized pieces • Train on 9 pieces and test on remainder • Do for all possibilities and average
J48 Confusion Matrix Old data set from statistics: 50 of each class a b c <-- classified as 49 1 0 | a = Iris-setosa 0 47 3 | b = Iris-versicolor 0 3 47 | c = Iris-virginica
Precision, Recall, and Accuracy • Precision: probability of being correct given that your decision. • Precision of iris-setosa is 49/49 = 100% • Specificity in medical literature • Recall: probability of correctly identifying class. • Recall accuracy for iris-setosa is 49/50 = 98% • Sensitity in medical literature • Accuracy: # right/total = 143/150 =~95%
Other Evaluation Schemes • Leave-one-out cross-validation • Cross-validation where n = number of training instanced • Specific train and test set • Allows for exact replication • Ok if train/test large, e.g. 10,000 range.
Bootstrap sampling • Randomly select n with replacement from n • Expect about 2/3 to be chosen for training • Prob of not chosen = (1-1/n)^n ~ 1/e. • Testing on remainder • Repeat about 30 times and average. • Avoids partition bias