Data Mining Algorithms

Data Mining Algorithms Prof. S. Sudarshan CSE Dept, IIT Bombay Most Slides Courtesy Prof. Sunita Sarawagi School of IT, IIT Bombay

Overview • Decision Tree classification algorithms • Clustering algorithms • Challenges • Resources

Decision Tree Classifiers

Decision tree classifiers • Widely used learning method • Easy to interpret: can be re-represented as if-then-else rules • Approximates function by piece wise constant regions • Does not require any prior knowledge of data distribution, works well on noisy data. • Has been applied to: • classify medical patients based on the disease, • equipment malfunction by cause, • loan applicant by likelihood of payment.

Setting • Given old data about customers and payments, predict new applicant’s loan eligibility. Previous customers Classifier Decision rules Age Salary Profession Location Customer type Salary > 5 L Good/ bad Prof. = Exec New applicant’s data

Decision trees Good Bad Bad Good • Tree where internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels. Salary < 1 M Prof = teaching Age < 30

Topics to be covered • Tree construction: • Basic tree learning algorithm • Measures of predictive ability • High performance decision tree construction: Sprint • Tree pruning: • Why prune • Methods of pruning • Other issues: • Handling missing data • Continuous class labels • Effect of training size

Tree learning algorithms • ID3 (Quinlan 1986) • Successor C4.5 (Quinlan 1993) • SLIQ (Mehta et al) • SPRINT (Shafer et al)

Basic algorithm for tree building Greedy top-down construction. Gen_Tree (Node, data) Yes Stop make node a leaf? Selection criteria Find best attribute and best split on attribute Partition data on split condition For each child j of node Gen_Tree (node_j, data_j)

Split criteria • Select the attribute that is best for classification. • Intuitively pick one that best separates instances of different classes. • Quantifying the intuitive: measuring separability: • First define impurity of an arbitrary set S consisting of K classes 1

Impurity Measures • Information entropy: • Zero when consisting of only one class, one when all classes in equal number • Other measures of impurity: Gini:

Split criteria • K classes, set of S instances partitioned into r subsets. Instance Sj has fraction pij instances of class j. • Information entropy: • Gini index: 1/4 Gini 0 1 Impurity r =1, k=2

Information gain • Information gain on partitioning S into r subsets • Impurity (S) - sum of weighted impurity of each subset

Information gain: example K= 2, |S| = 100, p1= 0.6, p2= 0.4 E(S) = -0.6 log(0.6) - 0.4 log (0.4)=0.29 S S1 S2 | S1 | = 70, p1= 0.8, p2= 0.2 E(S1) = -0.8log0.8 - 0.2log0.2 = 0.21 | S2| = 30, p1= 0.13, p2= 0.87 E(S2) = -0.13log0.13 - 0.87 log 0.87=.16 Information gain: E(S) - (0.7 E(S1 ) + 0.3 E(S2) ) =0.1

Meta learning methods • No single classifier good under all cases • Difficult to evaluate in advance the conditions • Meta learning: combine the effects of the classifiers • Voting: sum up votes of component classifiers • Combiners: learn a new classifier on the outcomes of previous ones: • Boosting: staged classifiers • Disadvantage: interpretation hard • Knowledge probing: learn single classifier to mimic meta classifier

SPRINT (Serial PaRallelizable INduction of decision Trees) • Decision-tree classifier for data mining • Design goals: • Able to handle large disk-resident training sets • No restrictions on training-set size • Easily parallelizable

Example • Example Data Age < 25 CarType in {sports} High High Low

Building tree • GrowTree(TrainingData D) • Partition(D); • Partition(Data D) • if(all points in D belong to the same class) then • return; • for each attribute A do • evaluate splits on attribute A; • use best split found to partition D into D1 and D2; • Partition(D1); • Partition(D2);

Data Setup: Attribute Lists Example list: • One list for each attribute • Entries in an Attribute List consist of: • attribute value • class value • record id • Lists for continuous attributes are in sorted order • Lists may be disk-resident • Each leaf-node has its own set of attribute lists representing thetraining examples belonging to that leaf

Attribute Lists: Example Initial Attribute Lists for the root node:

Evaluating Split Points • Gini Index • if data D contains examples from c classes Gini(D) = 1 - pj2 where pj is the relative frequency of class j in D • If D split into D1 & D2 with n1 & n2 tuples each Ginisplit(D) = n1* gini(D1) + n2* gini(D2) n n • Note: Only class frequencies are needed to compute index

Finding Split Points • For each attribute A do • evaluate splits on attribute A using attribute list • Keep split with lowest GINI index

Finding Split Points: Continuous Attrib. • Consider splits of form: value(A) < x • Example: Age < 17 • Evaluate this split-form for every value in an attribute list • To evaluate splits on attribute A for a given tree-node: Initialize class-histogram of left child to zeroes; Initialize class-histogram of right child to same as its parent; for each record in the attribute list do evaluate splitting index for value(A) < record.value; using class label of the record, update class histograms;

Finding Split Points: Continuous Attrib. Position of cursor in scan State of Class Histograms: Attribute List Left Child Right Child GINI Index: 0: Age < 17 GINI = undef 1: Age < 20 GINI = 0.4 3: Age < 32 GINI = 0.222 6 GINI = undef

Finding Split Points: Categorical Attrib. • Consider splits of the form: value(A) {x1, x2, ..., xn} • Example: CarType {family, sports} • Evaluate this split-form for subsets of domain(A) • To evaluate splits on attribute A for a given tree node: initialize class/value matrix of node to zeroes; for each record in the attribute list do increment appropriate count in matrix; evaluate splitting index for various subsets using the constructed matrix;

Finding Split Points: Categorical Attrib. class/value matrix Attribute List Left Child Right Child GINI Index: CarType in {family} GINI = 0.444 CarType in {sports} GINI = 0.333 CarType in {truck} GINI = 0.267

Performing the Splits • The attribute lists of every node must be divided among the two children • To split the attribute lists of a give node: for the list of the attribute used to split this node do use the split test to divide the records; collect the record ids; build a hashtable from the collected ids; for the remaining attribute lists do use the hashtable to divide each list; build class-histograms for each new leaf;

Performing the Splits: Example Age < 32 Hash Table 0 Left 1 Left 2 Right 3 Right 4 Right 5 Left

Sprint: summary • Each node of the decision tree classifier, requires examining possible splits on each value of each attribute. • After choosing a split attribute, need to partition all data into its subset. • Need to make this search efficient. • Evaluating splits on numeric attributes: • Sort on attribute value, incrementally evaluate gini • Splits on categorical attributes • For each subset, find gini and choose the best • For large sets, use greedy method

Approaches to prevent overfitting • Stop growing the tree beyond a certain point • First over-fit, then post prune. (More widely used) • Tree building divided into phases: • Growth phase • Prune phase • Hard to decide when to stop growing the tree, so second appraoch more widely used.

Criteria for finding correct final tree size: • Cross validation with separate test data • Use all data for training but apply statistical test to decide right size. • Use some criteria function to choose best size • Example: Minimum description length (MDL) criteria • Cross validation approach: • Partition the dataset into two disjoint parts: • 1. Training set used for building the tree. • 2. Validation set used for pruning the tree • Build the tree using the training-set. • Evaluate the tree on the validation set and at each leaf and internal node keep count of correctly labeled data. • Starting bottom-up, prune nodes with error less than its children.

Cross validation.. • Need large validation set to smooth out over-fittings of training data. Rule of thumb: one-third. • What if training data set size is limited? • Generate many different parititions of data. • n-fold cross validation: partition training data into n parts D1, D2…Dn. • Train n classifiers with D-Di as training and Di as test instance. • Pick average.

Rule-based pruning • Tree-based pruning limits the kind of pruning. If a node is pruned all subtrees under it has to be pruned. • Rule-based: For each leaf of the tree, extract a rule using a conjuction of all tests upto the root. • On the validation set, independently prune tests from each rule to get the highest accuracy for that rule. • Sort rule by decreasing accuracy..

MDL-based pruning • Idea: a branch of the tree is over-fitted if the training examples that fit under it can be explicitly enumerated (with classes) in less space than occupied by tree • Prune branch if over-fitted • philosophy: use tree that minimizes description length of training data

Regression trees • Decision tree with continuous class labels: • Regression trees approximates the function with piece-wise constant regions. • Split criteria for regression trees: • Predicted value for a set S = average of all values in S • Error: sum of the square of error of each member of S from the predicted average. • Pick smallest average error.

Issues • Multiple splits on continuous attributes [Fayyad 93, Multi-interval discretization of continuous attributes] • Multi attribute tests on nodes to handle correlated attributes • multivariate linear splits [Oblique trees, Murthy 94] • Methods of handling missing values • assume majority value • take most probable path • Allowing varying costs for different attributes

Pros and Cons of decision trees • Cons • Cannot handle complicated relationship between features • simple decision boundaries • problems with lots of missing data • Pros • Reasonable training time • Fast application • Easy to interpret • Easy to implement • Can handle large number of features More information: http://www.recursive-partitioning.com/

Clustering or Unsupervised learning

Distance functions • Numeric data: euclidean, manhattan distances • Minkowski metric: [sum(xi-yi)^m]^(1/m) • Larger m gives higher weight to larger distances • Categorical data: 0/1 to indicate presence/absence • Euclidean distance: equal weightage to 1 and 0 match • Hamming distance (# dissimilarity) • Jaccard coefficients: #similarity in 1s/(# of 1s) (0-0 matches not important • Combined numeric and categorical data:weighted normalized distance:

Distance functions on high dimensional data • Example: Time series, Text, Images • Euclidian measures make all points equally far • Reduce number of dimensions: • choose subset of original features using random projections, feature selection techniques • transform original features using statistical methods like Principal Component Analysis • Define domain specific similarity measures: e.g. for images define features like number of objects, color histogram; for time series define shape based measures. • Define non-distance based (model-based) clustering methods:

Clustering methods • Hierarchical clustering • agglomerative Vs divisive • single link Vs complete link • Partitional clustering • distance-based: K-means • model-based: EM • density-based:

Partitional methods: K-means • Criteria: minimize sum of square of distance • Between each point and centroid of the cluster. • Between each pair of points in the cluster • Algorithm: • Select initial partition with K clusters: random, first K, K separated points • Repeat until stabilization: • Assign each point to closest cluster center • Generate new cluster centers • Adjust clusters by merging/splitting

Properties • May not reach global optima • Converges fast in practice: guaranteed for certain forms of optimization function • Complexity: O(KndI): • I number of iterations, n number of points, d number of dimensions, K number of clusters. • Database research on scalable algorithms: • Birch: one/two pass of data by keeping R-tree like index in memory [Sigmod 96]

Model based clustering • Assume data generated from K probability distributions • Typically Gaussian distribution Soft or probabilistic version of K-means clustering • Need to find distribution parameters. • EM Algorithm

EM Algorithm • Initialize K cluster centers • Iterate between two steps • Expectation step: assign points to clusters • Maximation step: estimate model parameters

Properties • May not reach global optima • Converges fast in practice: guaranteed for certain forms of optimization function • Complexity: O(KndI): • I number of iterations, n number of points, d number of dimensions, K number of clusters.

Scalable clustering algorithms • Birch: one/two pass of data by keeping R-tree like index in memory [Sigmod 96] • Fayyad and Bradley: Sample repetitively and update summary of clusters stored in memory (K-mean and EM) [KDD 98] • Dasgupta 99: Recent theoretical breakthrough, find Gaussian clusters with guaranteed performance • Random projections

To Learn More

Books • Ian H. Witten and Frank Eibe,Data mining : practical machine learning tools and techniques with Java implementations, Morgan Kaufmann, 1999 • Usama Fayyad et al. (eds), Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996 • Tom Mitchell, Machine Learning, McGraw-Hill

Software • Public domain • Weka 3: data mining algos in Java (http://www.cs.waikato.ac.nz/~ml/weka) • classification, regression • MLC++: data mining tools in C++ • mainly classification • Free for universities • try convincing IBM to give it free! • Datasets: follow links from www.kdnuggets.com to UC Irvine site

Data Mining Algorithms

Data Mining Algorithms

Presentation Transcript

Data Mining Cluster Analysis: Advanced Concepts and Algorithms

Data Mining Cluster Analysis: Advanced Concepts and Algorithms

Data Mining Association Analysis: Basic Concepts and Algorithms

Data Mining Association Analysis: Basic Concepts and Algorithms

MapReduce, GPGPU and Iterative Data mining algorithms

Data Mining – Algorithms: Linear Models

Process Mining: Extension Mining Algorithms

Data Mining Algorithms for Recommendation Systems

Data Mining Cluster Analysis: Basic Concepts and Algorithms

Efficient Algorithms for Mining Semi-structured Data

Score Function for Data Mining Algorithms

CS 277: Data Mining Regression Algorithms

Data Mining Association Analysis: Basic Concepts and Algorithms

Data Mining Algorithms for Recommendation Systems

Data Mining – Algorithms: Decision Trees - ID3

Chapter 6 Regression Algorithms in Data Mining

DATA MINING: Algorithms, Applications and Beyond

Data Mining – Algorithms: Decision Trees - ID3

Data Mining Algorithms