1 / 13

Data Mining Methods Course

Data Mining Methods Course. Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University. What is Data Mining?. The process of discovering useful information in large data repositories. (Tan, P-N., Steinbach, M., and Kumar, V., Introduction to Data Mining, Addison-Wesley, 2006)

justise
Download Presentation

Data Mining Methods Course

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining Methods Course Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University

  2. What is Data Mining? • The process of discovering useful information in large data repositories. (Tan, P-N., Steinbach, M., and Kumar, V., Introduction to Data Mining, Addison-Wesley, 2006) • Discovered information should be: • Valid • Previously unknown • Actionable

  3. Course Objectives • Seven objectives of Lenox and Cuff in 2002 (based on ACM 2001 Ironman Report) • Prepare and warehouse data • Process data based on set of DM algorithms • Analyze results • Make predictions • Select proper algorithm • Make application • Motivated to continue graduate studies in DM • We have added • Get to know data using statistical analysis tools • Use visualization tools for analysis and review

  4. Overall Approach • Get to know the data. • Select an appropriate data mining algorithm based on the data and the mining objective. • Construct a model using the selected algorithm. • Analyze the results. • Make application.

  5. Get to Know the Data • How is it structured? • Single table/flat-file. • Multi-table – relationships • Number of observations • Number of dimensions (attributes) • Compute summary statistics using tool such as MS-Excel • Visually evaluate characteristics of the data

  6. Visual Exploration • Tools developed: • Correlation Matrix • Scatter Plot • Parallel Coordinate Plot

  7. Visual Exploration Objectives • Distributions of data • Data ranges of numeric attributes • Cardinality of discrete attributes • Shape of distribution • Skewed • Multi-model • Location of outliers • Identification possible relationships between attributes • Identification of subpopulations within the data

  8. The Data Mining Methodologies • Microsoft Business Intelligence Tools • Association Analysis – aka market basket analysis • Classification • Decision Trees • Artificial Neural Network • Bayesian Analysis • Regression • Cluster Analysis • Custom Tools with Embedded Visual Presentation • Artificial neural network for both classification and regression • Self-Organizing Map (SOM) for cluster analysis

  9. What do students need to know? • Purpose of each methodology • Steps of underlying algorithm • Data types supported • Issues in construction and application • Parameter settings • Results interpretation

  10. Issue - Overtraining • Does the model fit the training data too well? • Need to separate available into training and validation subsets. • Visual view of training progress valuable.

  11. Classification ErrorsWhat are the costs? • Mushroom edibility classifiers Classifier A Actual Edible Poisonous Predicted Edible 38% 0% Poisonous 8% 54% Classifier B Actual Edible Poisonous Predicted Edible 44% 1% Poisonous 2% 53%

  12. Prediction Model Evaluation • Black Box - models built using sophisticated methodologies (ANN’s for example) perform very well, but gaining an understanding of the model itself is difficult. • Contribution of individual input attributes • Nature of contribution (shape of curve) • Interaction between input attributes

  13. See you tomorrow • For a detailed presentation of the mechanics of the software deployed, attend our workshop tomorrow morning. • Saturday: 8-10 AM • Kachina A • Microsoft SQL Server Business Intelligence Studio • Visualization Tools

More Related