140 likes | 159 Views
Data Mining Methods Course. Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University. What is Data Mining?. The process of discovering useful information in large data repositories. (Tan, P-N., Steinbach, M., and Kumar, V., Introduction to Data Mining, Addison-Wesley, 2006)
E N D
Data Mining Methods Course Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University
What is Data Mining? • The process of discovering useful information in large data repositories. (Tan, P-N., Steinbach, M., and Kumar, V., Introduction to Data Mining, Addison-Wesley, 2006) • Discovered information should be: • Valid • Previously unknown • Actionable
Course Objectives • Seven objectives of Lenox and Cuff in 2002 (based on ACM 2001 Ironman Report) • Prepare and warehouse data • Process data based on set of DM algorithms • Analyze results • Make predictions • Select proper algorithm • Make application • Motivated to continue graduate studies in DM • We have added • Get to know data using statistical analysis tools • Use visualization tools for analysis and review
Overall Approach • Get to know the data. • Select an appropriate data mining algorithm based on the data and the mining objective. • Construct a model using the selected algorithm. • Analyze the results. • Make application.
Get to Know the Data • How is it structured? • Single table/flat-file. • Multi-table – relationships • Number of observations • Number of dimensions (attributes) • Compute summary statistics using tool such as MS-Excel • Visually evaluate characteristics of the data
Visual Exploration • Tools developed: • Correlation Matrix • Scatter Plot • Parallel Coordinate Plot
Visual Exploration Objectives • Distributions of data • Data ranges of numeric attributes • Cardinality of discrete attributes • Shape of distribution • Skewed • Multi-model • Location of outliers • Identification possible relationships between attributes • Identification of subpopulations within the data
The Data Mining Methodologies • Microsoft Business Intelligence Tools • Association Analysis – aka market basket analysis • Classification • Decision Trees • Artificial Neural Network • Bayesian Analysis • Regression • Cluster Analysis • Custom Tools with Embedded Visual Presentation • Artificial neural network for both classification and regression • Self-Organizing Map (SOM) for cluster analysis
What do students need to know? • Purpose of each methodology • Steps of underlying algorithm • Data types supported • Issues in construction and application • Parameter settings • Results interpretation
Issue - Overtraining • Does the model fit the training data too well? • Need to separate available into training and validation subsets. • Visual view of training progress valuable.
Classification ErrorsWhat are the costs? • Mushroom edibility classifiers Classifier A Actual Edible Poisonous Predicted Edible 38% 0% Poisonous 8% 54% Classifier B Actual Edible Poisonous Predicted Edible 44% 1% Poisonous 2% 53%
Prediction Model Evaluation • Black Box - models built using sophisticated methodologies (ANN’s for example) perform very well, but gaining an understanding of the model itself is difficult. • Contribution of individual input attributes • Nature of contribution (shape of curve) • Interaction between input attributes
See you tomorrow • For a detailed presentation of the mechanics of the software deployed, attend our workshop tomorrow morning. • Saturday: 8-10 AM • Kachina A • Microsoft SQL Server Business Intelligence Studio • Visualization Tools