실험데이터처리 Data mining (Ch. 9 - Ch. 10) Using Weka

실험데이터처리Data mining (Ch. 9 - Ch. 10)Using Weka Major: Interdisciplinary program of the integrated biotechnology Graduate school of bio- & information technology Youngil Lim (N110), Lab. FACS phone: +82 31 670 5200 (secretary), +82 31 670 5207 (direct) Fax: +82 31 670 5445, mobile phone: +82 10 7665 5207 Email: limyi@hknu.ac.kr, homepage:http://facs.maru.net

Overview of this lecture output (ch. 3) input (ch. 2) Information (data, database) Data mining (extraction of useful information) Relationships? Modeling Structural patterns Technical tools: machine learning Knowledge (understanding, application, prediction) • - Machine learning = acquisition of structural descriptions automatically or semi-auto. • (it is similar as the brain development from repeating experiences) • Weka written in JAVA (object-oriented programming language) • (JAVA is free to OS and its calculation is 2-3 times slower than C, C++ and Fortran • - Java compiler (Java virtual machine) translate the byte-code into machine code

Outline of this lecture Part I. Machine learning tools and techniques - Level 1: Ch 1. Applications, common problems Ch 2. Input, concepts, instances and attributes Ch 3. Output, knowledge representation - Level 2: Ch 4. Numerical algorithms, the basic methods - Level 3: Ch 5-6 (advanced topics) Part II. Weka manual (ftp://facs/lim/lecture_related/weka3.4.exe) - Level 1: Ch 9. Introduction of Weka Ch 10. Explorer - Level 2: Ch 11-15 (advanced options in Weka) But, you need to read those chapters to make a paper on data mining

Ch. 9. Introduction to Weka Introduction - No single scheme ML is appropriate to all DM problems - DM is an experimental science. - Weka is a collection of state-of-the-art ML algorithms - Weka includes 1) input data preparation (ARFF) 2) various learning algorithm evaluations 3) input data visualization 4) visualization of ML result

9.1 What’s in Weka - Weka workbench includes methods for DM 1) regression (numerical prediction) 2) classification 3) clustering 4) association rule 5) attribute-selection

9.2 How do you use it? Terminology and components - classifier: learning methods (or algorithms) - object editor: adjustment of tunable parameters of the classifier - filter: tools for data preparation (filtering algorithm) - 4 graphical user interfaces (GUI) of Weka 1) Explorer (for small/medium data size): main GUI  Ch. 10 2) Knowledge flow (for large data set): design of configurations for streamed data processing, incremental learning data.  Ch. 11 3) Experimenter: automatic running of classifier and filter with different parameter settings, parallel computing.  Ch. 12 4) command-line interface in JAVA.  Ch. 13

9.2 How do you use it? Weka GUI chooser 4 graphical user interfaces (GUI) of Weka 1) Explorer (for small/medium data size): main GUI  Ch. 10 2) Knowledge flow (for large data set): design of configurations for streamed data processing, incremental learning data.  Ch. 11 3) Experimenter: automatic running of classifier and filter with different parameter settings, parallel computing.  Ch. 12 4) CLI in JAVA.  Ch. 13

Ch. 10. The Explorer This lecture does not cover Ch-10.6 and Ch-10.7 Outline 10.1 Getting started 10.2 Exploring the explorer 10.3 Filtering algorithms 10.4 Learning algorithms 10.5 Meta-learning algorithms 10.6 Clustering algorithms 10.7 Association-rule learners 10.8 Attribute selection

Ch. 10. The Explorer Procedure of using Explorer Build a decision tree from the data: • Prepare the data (comma separated value format) • Fire up Weka • Load data • Select a decision tree construction method • Build a tree • Interpret the output

10.1 Getting started (1) Preparing the data - ARFF by default (comma-separated value format) - tags: 1) @relation (data title) 2) @attribute (variables) 3) @data (instances)

10.1 Getting started (1) Preparing the data Open <weather.arff> using MS word or other editors

10.1.1 Loading the data into the Explorer

Class attribute (dependent variable) 10.1.2 Building a decision tree

10.1.3 Test options The result of applying the chosen classifier will be tested according to the options that are set by clicking in the Test options box. There are four test modes: 1) Use training set. The classifier is evaluated on how well it predicts the class of the instances it was trained on. 2) Supplied test set. The classifier is evaluated on how well it predicts the class of a set of instances loaded from a file. Clicking the Set... button brings up a dialog allowing you to choose the file to test on. 3) Cross-validation. The classifier is evaluated by cross-validation, using the number of folds that are entered in the Folds text field. 4) Percentage split. The classifier is evaluated on how well it predicts a certain percentage of the data which is held out for testing. The amount of data held out depends on the value entered in the % field.

10.1.4 Cross-validation dataset Random sampling is required ! (stratification) Validation dataset (testing, 1/10 of data) Training dataset (9/10 of data) - Cross-validation (repeated holdout): 1) fold: number of partitioning of the data. 2) 10-fold cross-validation is generally used for a single and fixed data. 3) divide randomly into 10 parts from the data 4) 9 parts are used for training and 1 part is used for testing 5) measure its error rate 6) repeat 10 times of cross-validation on different training sets 7) the overall error is the average of the 10 error rates. Prediction dataset (new data) Witten & Frank (2005), data mining, p149-151

10.1.5 Examining the output

10.1.6 Doing it again

10.1.7 working with models Exercise: analyze iris dataset ! - Load iris data into Weka - Find the classification rule - Visualize the decision tree - Visualize threshold curve

10.1.8 When things go wrong What’s going on? (memory available) To see the error message, click the Log button.

10.2 Exploring the Explorer - We have six tabs 1) preprocess: choose the data set and modify it. 2) classify: train learning schemes that perform classification, regression, and evaluate them. 3) cluster: learn clusters for the dataset 4) associate: learn associate rules for the data and evaluate them 5) select attributes: select the most relevant aspects in the dataset. 6) visualize: view different two-dimensional plots of the data and interact with them.

10.2 Exploring the Explorer - The bird (Weka) dances when Weka is active. - shows how many con-current processes are running. - The bird sits when Weka is non-active. - The bird is standing but stops moving, it’s sick (something has gone wrong, and the Explorer should be restarted).

10.2.1 Converting files to ARFF - 3 file converters to arff 1) .csv (comma-separated value)  .arff 2) .name and .data (C4.5’s native file)  .arff 3) .bsi (binary serialized instances)  .arff

10.2.2 Using filters Unsupervised attribute filter

10.2.3 Training and testing learning schemes Open file: cpu.with.vendor.arff

10.2.3 Training and testing learning schemes Classifiers>trees>M5P

10.2.3 Training and testing learning schemes How many leaves? How many nodes?

10.2.3 Training and testing learning schemes Classifiers>functions> LinearRegression It gives a single linear regression model rather than the two of “trees>M5P”

10.2.4 Visualizing error What is better between M5P and linear regression?

10.2.5 Do it yourself: the user classifier - Open data>segment-challenge.arff - Segment the visual image data into classes (grass, sky, cement …) Classifiers>Trees>UserClassifier data>segment-test.arff

10.2.5 Do it yourself: the user classifier Change X and Y axis ! The goal is to find a combination that separates the classes as clearly as possible.

10.2.5 Do it yourself: the user classifier Specify a region in the graph 1. Select instance 2. Rectangle 3. Polygon 4. Polyline - clear: clear the selection - save: save instances in the current tree node as an ARFF file

10.2.5 Do it yourself: the user classifier Accept to tree (right-click of your mouse on any blank space Building trees manually is very tedious Correctly classified instances: 40% In-correctly ones: 60%

10.2.6 Using a metalearner Meta-learner means a powerful user who controls Weka very well Adaptive boosting Boosting decision stumps up to 10 times

10.2.6 Using a metalearner Meta-learner means a powerful user who controls Weka very well

10.2.7 Clustering and association rules We skip clustering and association rules. So, 10.6 and 10.7 are also skipped. In Ch. 4, 4.5 and 4.8 are also skipped.

10.2.8 Attribution selection We will learn more Ch. 10.8

10.2.9 Visualization of data 2D scattering plots of every pair of attributes

10.3 Filtering algorithms - Filtering of data (= attributes + instances) - All filters transform the input dataset in a way. - Two kinds of filter: 1) supervised (Section 7.2): to be used carefully 2) unsupervised: - Each filter has two types of distinction between attribute filters and instance filters. 1) attribute filter: it works on the attribute of data 2) instance filter: it works on the instance of data - See Section 7.3 1) PCA (principal component analysis) 2) Random projections

10.3.1 Adding and removing attributes - Add: insert a new attribute, whose value is empty. - Copy: copy existing attributes and their values. - Remove: it is the same as <remove> tab - RemoveType: remove all instances of the same type such as nominal, numeric, string, or date. - AddCluster: apply a clustering algorithm to the data before filtering it (see Section 10.6) - AddExpression: create a new attribute by applying a mathematical function to numeric attributes. e.g., a1^2*a5/log(4*a7) - NumericTransform: performs an arbitrary transformation by applying a given JAVA function to selected numeric attributes. - Normalize: scale all numeric values to lie between 0 and 1. - Standardize: transforms all numeric values to have zero mean and unit variance.

10.3.2 Changing values - SwapValues: just change the position of two values of a nominal attribute (it does not affect learning at all). - MergeTwoValues: merge values of a nominal attribute into a single category. - ReplaceMissingValues: replace each missing value with the mean for numeric attributes and the mode for nominal. 1) if a class is set, missing values of that attribute are not replaced.

10.3.3 Conversions - Discretize: change the numeric attribute to the nominal attribute (section 7.2). 1) equal-width binning 2) equal-frequency binning - PKIDiscretize: discretize numeric attributes using equal-frequency binning where the number of bins is the square root of the number of values. 83 instances without missing value are binned by 9 bins. - MakeIndicator: convert a nominal attribute into a binary indicator attribute  it is necessary when the numeric attribute is required for a ML scheme. - NorminalToBinary: transform all multivalued nominal attributes ina dataset into binary ones (k-value attribute  k-binary attribute) - NumericToBinary: convert all numeric attributes into binary ones (if numeric value = 0, then binary value = 0. otherwise, binary value = 1). - FirstOrder: take a difference between two attribute values.

10.3.4 String conversions - StringToNominal: - StringToWordVector:

10.3.5 Time series For time-series data, - TimeSeriesTranslate: replace attribute values in the current instance with the equivalent attribute values of some previous (or future) instance. - TimeSeriesDelta: replace attribute values in the current instance with the difference between the current value and the equivalent attribute value of some previous (or future) instance.

10.3.6 Randomizing These filters change values of the data. - AddNoise: it introduces a noise to the data. It takes a nominal attribute and changes its value to other value by a given percentage. - Obfuscate: rename the attribute name and anonymize data. - RandomProjection: see section 7.3

10.3.7 Unsupervised instance filters - Attribute filers: it affects all values of attribute (column of data) - Instance filters: if affects all values of instance (raw of data)

10.3.8 Randomizing and subsampling - Randomize: the order of instances are randomized - Normalize: all numeric attributes are treated as a vector and normalized to a given length. - Resample: it produces a random sample by sampling by replacement - RemoveFolds: it first splits the data into a given number of cross-validation folds and then reduces the data just one of them. If random number seed is provided, the dataset will be shuffled before the subset is extracted. - RemovePercentage: it removes a given percentage of instnaces. - RemoveRange: it removes a certain range of instance numbers. - RemoveWithValues: it removes all instances that have certain values above or below a certain threshold.

10.3.9 Sparse instances - NonSparseToSparse: - SparseToNonSparse:

10.3.10 Supervised filters - Supervised filters are affected by the class attribute. - We have two categories of supervised filters 1) attribute 2) instance - You need to be careful with them because they are not really preprocessing operations. - Discretize: see section 7.2 - NormialToBinary: see section 6.5 - ClassOrder: it changes the ordering of the class values - Resample: it is like the unsupervised instance filter, except that it maintains the class distribution in the subsample. …..

10.4 Learning algorithms - We have 7 categories in classification. 1) Bayesian: document classification (e.g., google search) 2) Trees: decision trees, divide-and-conquer (stump, node, leaf, model tree) 3) Rules: covering approach (or excluding instances), the decision tree is converted to a set of logical expression. 4) Functions: linear model, nonlinear model. 5) Lazy (instance-based learning): distance function. 6) metalearning algorithms: more powerful learner. 7) Miscellaneous: divers - Ch. 4 and Ch. 6 covers those algorithms.

실험데이터처리 Data mining (Ch. 9 - Ch. 10) Using Weka