The Process of Data Mining

Data Mining Methodology The Process of Data Mining

Why have a Methodology • Don’t want to learn things that aren’t true • May not represent any underlying reality • Spurious correlation • May not be statistically significant or may be statistically significant but coincidental • Because data mining makes less assumptions about the data and searches through a richer hypothesis space, this is a big issue • Model overfitting is an issue

Why have a Methodology II • Data may not reflect relevant population • Data mining normally assumes training data matches the test and score data • Quick overview of how data used for DM • Training set used to build the model • Validation set used to tune model or select amongst alternative models • Test set used to evaluate model & report quality • For prediction tasks, test set must have the “answer” • Model eventually applied to score set, which for predictive tasks, does not have the answer • Evaluation must always occur on data not used to build or tune or select the model

Why have a Methodology III • Do not want to learn things that are not useful • May be already known • May not be actionable

Hypothesis Testing • Data Mining is not usually used for hypothesis testing • B&L does not really say this • Typical assumption is data already collected and you have little influence on the process • Data may be in a data warehouse • usually do not modify the scenarios for collecting the data or the parameters • Experimental design not part of data mining • Active learning is related to this, where you carefully select the data to learn from

The Methodology (Fayyad) • According to the article by Fayyad et. al, the main steps are: • Data Selection • Preprocessing • Transformation • Data Mining • Interpretation/Evaluation

The Methodology (B & L) • According to Berry & Linoff, the main steps are: • Translate business problem into a DM problem • Select Data • Get to know the data • Create a model set • Fix problems with the data (“preprocess”) • Transform the data • Build models (“Data mining”) • Assess models (“Interpret/Evaluate”) • Deploy Models • Assess Results then start over

Steps in the Process: Selection • Many of the steps are not very complex, so her some selective comments: • Selection: • DM usually tries to use all available data • May not be necessary, can generate learning curves where see how performance varies with increasing amounts of data • Data Mining is not afraid of using lots of variables (unlike statistics). But some data mining methods (especially statistical ones) do have problems with many variables.

Steps in the Process: Know the Data • Getting to know the data: • always useful and also helps make sure you understand the problem • Data visualization can help • Data mining is not really like a black box where the computer does all of the work • having or generating good features (variables) is critical. Data visualization can help

Steps in the Process: Create Model Set • Creating a model (training) set • Sometimes you may want to form the training set other than by random sampling • It is often recommended to balance the classes if they are highly unbalanced • Not really a good idea or needed. Can use cost-sensitive learning instead, but we will address later • May want to focus on harder problems • Active learning skews the training data, but the purpose is to save effort in manually labeling the training data

Steps in the Process: Create Model Set • Data sets relevant to Data Mining • Training set: used to build initial model • Validation set: used to either tune model (e.g., pruning) or select amongst multiple models • Test set: used to evaluate goodness of model • For predictive tasks, must have class labels • Score set: Data that model ultimately build for • For predictive tasks, class labels are not available • Note that training, validation and test data come from labeled data • Cross validation can maximize size of labeled data • 10-fold cross validation uses 90% for training and 10% testing. It will entail 10 runs.

Steps in the Process: Fix Data • Many data mining methods don’t need as much variable “fixing” as statistical methods • Types of fixing • Missing values: many ways to fix • Too many categorical values: reduce • Binning, etc. • Numerical values skewed • Take log etc • Data preprocessing (Fayyad) may just alter the representation

Steps in the Process: Transform • Aggregate data to a higher level • Time series data often must be converted into examples for classification algorithms • Phone call data aggregated from call level to describe activity associated with a phone #/user • Construction of new features is part of this step. Feature construction can be critical. • Area of plot more useful for estimating value of home than length and width.

Steps in the Process: Assess Model • Predictive models are assessed based on the correctness of their predictions • Accuracy is the simplest measure, but often not very useful since not all errors are equal • we will learn more about this later • Lift curves are discussed in B&L (p 81) • Lift ratio = P(class|sample) / P(class|population) • Life only makes sense when we can be selective, like in direct marketing where we don’t have to judge every response • Descriptive models can be hard to evaluate since their may not be objective criteria • How do you tell if a clustering is meaningful? • More on assessment methods later

Steps in the Process: Deploy • Research models are fine, we run them off line and when we want to • In a business, must deal with real-world issues • In the WISDM project, we want to classify activities in real time. This is also needed for many fraud detection models. Must be able to execute the model and do it quickly, possibly on different hardware. • Some tools allow you to export the model as code • Even in off-line evaluation, may need to handle huge amounts of data

Steps in the Process: Assess Results • True assessment is not just of model, but includes the business context • Takes into account all costs and benefits • This may include costs that are very hard to quantify • How much does a false negative medical test cost it causes the patient to die of a preventable disease?

Steps in the Process: Iterate • Data Mining is an iterative process • Iteration can occur between most of the steps • Example: You don’t like overall results so you add another feature. You then assess its impact to see if you should keep it. • Example: You realize that assessment of your model does not make sense and is missing some costs, so you then incorporate these costs into the model

The Process of Data Mining