1 / 18

Data Mining CSCI 307, Spring 2019 Lecture 8

Explore various classification algorithms in WEKA, including decision trees, rules, and naïve Bayes, along with evaluation methods such as test data set and cross-validation.

strohl
Download Presentation

Data Mining CSCI 307, Spring 2019 Lecture 8

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data MiningCSCI 307, Spring 2019Lecture 8 WEKA

  2. Classification • Predicted target must be categorical/nominal • Implemented methods • Decision trees (J48, etc.) • Rules (ZeroR, OneR, etc.) • Naïve Bayes • Evaluation methods • Test data set • Cross-validation

  3. Classification • Algorithms • ZeroR: Ignores all the attributes, and relies only on the target class. Always predicts the majority value. • OneR: Make one rule for each attribute (based on frequency of outcomes for each value of the attribute). Choose the rule/attribute that gives the smallest error. • Naive Bayes: A probabilistic classifier based on Bayes' Theorem. Assumes all attributes are independent.

  4. Evaluation Methods • Test Data Set • Train on all data; Test on all data (not recommended) • Split the data (E.g. 66% for training, 34% for testing). • Use separate files, one with training instances, one with testing instances. • Cross Validation: • Divide data set into groups (e.g. 10 groups of instances) • Choose one group for testing, use the rest for training • Repeat multiple times with different group for testing each time. (E.g. repeat 10 times using one of the 10 original groups for testing each time, and the rest for training). • Average the results of all the testing.

  5. WEKA Data Formats • Data can be imported from a file in various formats: • ARFF (Attribute Relation File Format) has two sections: • the Header information defines attribute name, type and relations. • the Data section lists the data records (instances). • CSV: Comma Separated Values (text file) • C4.5: A format used by a decision induction algorithm, requires two separate files • Name file: defines the names of the attributes • Data file: lists the records (samples) • binary • Data can also be read from a URL or from an SQL database (using JDBC; Java DataBase Connectivity is an API for Java that defines how a client may access a database)

  6. Attribute Relation File Format (arff) ARFF files consist of two distinct sections: • the Header section defines attribute name, type and relations, start with a keyword. • @relation <data-name> • @attribute <attribute-name> <type> or {range} • the Data section lists the data records, starts with • @data • list of data instances • Comment: Any line starting with %

  7. Breast Cancer data in ARFF % Breast Cancer data*: 286 instances (no-recurrence-events: 201, recurrence- events: 85) % Part 1: Definitions of attribute name, types and relations @relation breast-cancer @attribute age {'10-19','20-29','30-39','40-49','50-59','60-69','70-79','80-89','90-99'} @attribute menopause {'lt40','ge40','premeno'} @attribute tumor-size {'0-4','5-9','10-14','15-19','20-24','25-29','30-34','35-39','40-44','45-49','50-54','55-59'} @attribute inv-nodes {'0-2','3-5','6-8','9-11','12-14','15-17','18-20','21-23','24-26','27-29','30-32','33-35','36-39'} @attribute node-caps {'yes','no'} @attribute deg-malig {'1','2','3'} @attribute breast {'left','right'} @attribute breast-quad {'left_up','left_low','right_up','right_low','central'} @attribute irradiat {'yes','no'} @attribute Class {'no-recurrence-events','recurrence-events'} % Part 2: Data Section @data '40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events' '50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events' '50-59','ge40','35-39','0-2','no','2','left','left_low','no','recurrence-events' …… % source: http://archive.ics.uci.edu/ml/datasets/Breast+Cancer % NOTE: not sure about those single quote marks.

  8. Interpreting the Output:The Confusion Matrix The confusion matrix shows how many of each class value was classified in each classification category. The Confusion Matrix: a b <-- classified as 56 8 | a = no-recurrence-events 23 10 | b = recurrence-events • 56 no-recurrence events (a) were classified correctly as (a) • 8 no-recurrence events (a) were incorrectly classified as (b) • 23 recurrence events (b) were incorrectly classified as (a) • 10 recurrence events (b) were correctly classified as (b) • Items on the main diagonal are correct classifications

  9. Interpreting the Output Text representation of a tree: J48 pruned tree ------------------ node-caps = yes | deg-malig = 1: recurrence-events (1.01/0.4) | deg-malig = 2: no-recurrence-events (26.2/8.0) | deg-malig = 3: recurrence-events (30.4/7.4) node-caps = no: no-recurrence-events (228.39/53.4) Number of Leaves : 4 Size of the tree : 6

  10. WEKA Explorer • Click the Explorer on Weka GUI • On the Explorer window, Click "Open File" • To open a data file, e.g. Breast Cancer data: breast_cancer.arff • Or (if you don’t have this data set), the data folder provided by the WEKA package e.g. iris.arff or weather_nominal.arff

  11. WEKA Explorer: Open Data File Open Breast Cancer data. Click an attribute, e.g. age, then its distribution will be displayed in a histogram.

  12. WEKA Explorer: Classifiers • After loading a data file, click Classify Tab • Choose a classifier, Under Classifier • Click Choose Button • From drop-down menu, Click Trees Folder • Select J48 – a decision tree algorithm • Choose a test option • Select Percentage Split Radio Button • Use default ratio 66% for training and 34% for testing • Click Start Button to train and test the classifier. • The training and testing information will be displayed in classifier output window.

  13. WEKA Explorer: Results 97 cases used in test. Correct: 66 (68%) Wrong: 31 (32%)

  14. Result and Model Options Point to result list window, and right/option click mouse. Menu will display options available about the model.

  15. Choose Visualize Tree

  16. View Classifier Errors • Correctly predicted cases • Wrong cases

  17. Save the Model and Results Right/option click on result. Choose Save model and Save result buffer to save the classifier and the results,

  18. Summary Weka is open source data mining software that offers • GUI interfaces: Explorer, Experimenter, Knowledge Flow • Functions and Tools • Methods for classification: decision trees, rule learners, naive Bayes, etc. • Methods for regression/prediction: linear regression, model tree generators, etc. • Methods for clustering • Methods for feature selection • And More...

More Related