DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos

DSCI 4520/5240: Data MiningFall 2013 – Dr. Nick Evangelopoulos Lecture 4: Decision Trees: Overview Some slide material taken from: SAS Education

On the News: The Rise of the Numerati. BusinessWeek, Sep 8, 2008: With the explosion of data from the Internet, cell phones, and credit cards, the people who can make sense of it all are changing our world. An excerpt from the introduction of the book The Numerati by Stephen Baker: Imagine you're in a café, perhaps the noisy one I'm sitting in at this moment. A young woman at a table to your right is typing on her laptop. You turn your head and look at her screen. She surfs the Internet. You watch. Hours pass. She reads an online newspaper. You notice that she reads three articles about China. She scouts movies for Friday night and watches the trailer for Kung Fu Panda. She clicks on an ad that promises to connect her to old high school classmates. You sit there taking notes. With each passing minute, you're learning more about her. Now imagine that you could watch 150 million people surfing at the same time. That's what is happening today at the business place.

On the News: The Rise of the Numerati. By building mathematical models of its own employees, IBM aims to improve productivity and automate management. In 2005, IBM embarked on research to harvest massive data on employees, and to build mathematical models of 50,000 of the company’s consultants. The goal was to optimize them, using operations research, so that they can be deployed with ever more efficiency. Data on IBM employees include: -Allergies -Number of interns managed -Client visits -Computer languages -Number of words per e-mail -Amount spent entertaining clients -Number of weekends worked -Time spent in meetings -Social network participation -Time spent surfing the Web -Response time to e-mails -Amount of sales -Marital status -Ratio of personal to work e-mails

Agenda • Introduce the concept of “Curse of Dimensionality” • Benefits and Pitfalls in Decision Tree modeling • Consequences of a decision

The Curse of Dimensionality The dimension of a problem refers to the number of input variables (actually, degrees of freedom). Data mining problems are often massive in both the number of cases and the dimension. 1–D 2–D 3–D The curse of dimensionality refers to the exponential increase in data required to densely populate space as the dimension increases. For example, the eight points fill the one-dimensional space but become more separated as the dimension increases. In 100-dimensional space, they would be like distant galaxies. The curse of dimensionality limits our practical ability to fit a flexible model to noisy data (real data) when there are a large number of input variables. A densely populated input space is required to fit highly complex models.

E(Target) Input1 Input2 Addressing the Curse of Dimensionality: Reduce the Dimensions Redundancy Irrelevancy Input3 Input1 The two principal reasons for eliminating a variable are redundancy and irrelevancy. A redundant input does not give any new information that has not already been explained. Useful methods: principal components, factor analysis, variable clustering. An irrelevant input is not useful in explaining variation in the target. Interactions and partial associations make irrelevancy more difficult to detect than redundancy. It is often useful to first eliminate redundant dimensions and then tackle irrelevancy.

Model Complexity Too flexible Not flexible enough A naïve modeler might assume that the most complex model should always outperform the others, but this is not the case. An overly complex model might be too flexible. This will lead to overfitting – accommodating nuances of the random noise in the particular sample (high variance). A model with just enough flexibility will give the best generalization.

Overfitting Training Set Test Set

Better Fitting Training Set Test Set

The Cultivation of Trees • Split Search • Which splits are to be considered? • Splitting Criterion • Which split is best? • Stopping Rule • When should the splitting stop? • Pruning Rule • Should some branches be lopped off?

Possible Splits to Consider:an enormous number 500,000 Nominal Input 400,000 Ordinal Input 300,000 200,000 100,000 1 2 4 6 8 10 12 14 16 18 20 Input Levels

Splitting Criteria How is the best split determined? In some situations, the worth of a split is obvious. If the expected target is the same in the child nodes as in the parent node, no improvement was made, and the split is worthless! In contrast, if a split results in pure child nodes, the split is undisputedly best. For classification trees, the three most widely used splitting criteria are based on the Pearson chi-squared test, the Gini index, and entropy. All three measure the difference in class distributions across the child nodes. The three methods usually give similar results.

Splitting Criteria Left Right 3196 1304 4500 Not Bad Debt-to-Income Ratio < 45 Bad 154 346 500 Left Center Right 2521 1188 791 4500 Not Bad A Competing Three-Way Split Bad 115 162 223 500 4500 0 4500 Not Bad Perfect Split Bad 0 500 500

Controlling tree growth: Stunting Stunting A universally accepted rule is to stop growing if the node is pure. Two other popular rules for stopping tree growth are to stop if the number of cases in a node falls below a specified limit or to stop when the split is not statistically significant at a specified level. This is called Pre-pruning, or stunting.

Controlling tree growth: Pruning Pruning (also called post-pruning) creates a sequence of trees of increasing complexity. An assessment criterion is needed for deciding the best (sub) tree. The assessment criteria are usually based on performance on holdout samples (validation data or with cross-validation). Cost or profit considerations can be incorporated into the assessment. Pruning

Benefits of Trees • Interpretability • tree-structured presentation • Mixed Measurement Scales • nominal, ordinal, interval • Regression trees • Robustness • Missing Values

Prob Input Input Benefits of Trees • Automatically • Detects interactions (AID) • Accommodates nonlinearity • Selects input variables Multivariate Step Function

Drawbacks of Trees • Roughness • Linear, Main Effects • Instability

Building and Interpreting Decision Trees • Explore the types of decision tree models available in Enterprise Miner. • Build a decision tree model. • Examine the model results and interpret these results. • Choose a decision threshold theoretically and empirically.

Consequences of a Decision

Example • Recall the home equity line of credit scoring example. Presume that every two dollars loaned eventually returns three dollars if the loan is paid off in full.

Consequences of a Decision

Bayes Rule: Optimal threshold Using the cost structure defined for the home equity example, the optimal threshold is 1/(1+(2/1)) = 1/3. That is, • reject all applications whose predicted probability of default exceeds 0.33.

Consequences of a Decision: Profit matrix (SAS EM)

Outlook sunny rainy overcast Humidity Windy high normal true false yes yes no no yes Decision Tree Algorithms • Read Lecture 5 notes (Tree Algorithms) before coming to class next week • Focus on Rule-induction using Entropy and Information Gain

DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos

DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos

Presentation Transcript

CSE 634 Data Mining Concepts and Techniques Association Rule Mining

Data Mining: Preprocessing Techniques

Chapter 3: Data Mining and Data Visualization

Mining data with PolyAnalyst

DATA MINING LECTURE 4

Web Mining

Data ! Data! Data!

What we have covered?

MMDSS 2007 Data stream management and mining

Mining text and data on chemicals

Data Warehousing : and OLAP — BIS 541 — — Chapter 3 —

Monte F. Hancock, Jr. Chief Scientist Celestech, Inc.

Data Mining with Big Data

Spatial Data Mining

Data Mining: Concepts and Techniques

CENG 464 Introduction to Data Mining