Data Mining The Art and Science of Obtaining Knowledge from Data

Data MiningThe Art and Science of Obtaining Knowledge from Data Saed Sayad iSmartsoft Inc.

Overview • Explosion of data • Introduction to data mining • Examples of data mining in science and engineering • Challenges and opportunities

Explosion of Data • Data in the world doubles every 20 months! • NASA’s Earth Orbiting System: forty-six megabytes of data per second 4,000,000,000,000 bytes a day • FBI fingerprints image library: 200,000,000,000,000 bytes • In-line image analysis for particle detection: 1 megabyte in one second

Explosion of Data (cont.) Fast, accurate, and scalable data analysis techniques are needed to extract useful information: The answer is Data Mining.

What is Data Mining? “Data Mining is the exploration and analysis by automatic or semi-automatic means, of large or small quantities of data in order to discover meaningful patterns and rules.”

Data Analysis Artificial Intelligence, Machine Learning Statistics Data Mining Database

The Role of Data Mining in Research Life Cycle • Questions • Needs Report Library Search Modeling Data Analysis Database Research Data Experiment

Database Technology • Moving from simple text file to advanced database technology (e.g., relational database) • Structural Query Language (SQL) • Data Definition Language to define different objects in the database (e.g. tables, views, and procedures). • Data Manipulation Language to select, add, update and delete data in the data base. • Parallel data access • Save time and $$$, …

Data Analysis • Classification • Regression • Clustering • Association • Sequence Analysis

Data Analysis (cont.) Y1 Numeric Numeric X1 f Regression 3, 4.5, 102, … Categorical X2 Y2 Categorical hot, cold, high, low, … Classification Crisp X3 Y3 Crisp 0, 1, yes, no, … Output Variables or Dependent Variables or Classes Input Variables or Independent Variables or Attributes

Data Analysis (cont.) Clustering Association Income 1, chips, coke, chocolate 2, gum, chips 3, chips, coke 4, … Probability (chips, coke) ? Age Sequence Analysis …ATCTTTAAGGGACTAAAATGCCATAAAAATCCATGGGAGAGACCCAAAAAA… Xt-1 Xt T

Data Analysis (cont.) • Classification • Regression • Linear Discriminant Analysis • Naïve Bayes / Bayesian Network • OneR • Neural Networks • Decision Tree (ID3, C4.5, …) • K-Nearest Neighbors • Support Vector Machines • … • Multiple Linear Regression • Principal Components Regression • Partial Least Square • Neural Networks • Regression Tree (CART, MARS, …) • K-Nearest Neighbors • Support Vector Machines • … • Clustering • Association & Sequence Analysis • K-Mean Clustering • Self Organizing Map • Bayesian Clustering • … • A Priori • Markov Chain • Hidden Markov Models • …

Basic Steps in Data Mining • Define the problem • Build data mining database • Explore data • Prepare data for modeling • Build model • Evaluate model • Deploy model

Overview • Explosion of data • Introduction to data mining • Examples of data mining in science and engineering • Challenges and opportunities

Examples of data mining in science & engineering • 1. Data mining in Biomedical Engineering • “Robotic Arm Control Using Data Mining Techniques” • 2. Data mining in Chemical Engineering • “Data Mining for In-line Image Monitoring of Extrusion Processing”

Supination Pronation Flexion Extension 1. Define the problem “Control a robotic arm by means of EMG signals from biceps and triceps muscles.”

2. Build a data mining database • The dataset includes 80 records. • There are two input variables; biceps signal and triceps signal. • One output variable, with four possible values; supination, pronation, flexion and extension.

Scatter Plot Triceps Record# FlexionExtensionSupinationPronation 3. Explore data

Scatter Plot Biceps Record# FlexionExtensionSupinationPronation 3. Explore data (cont.)

4. Prepare data for modeling • Build a dataset with the ARFF format: @relation EMG @attribute Triceps real @attribute Biceps real @attribute Move {Flexion,Extension,Pronation,Supination} @data 13,31,Flexion 14,30,Flexion 10,31,Flexion 13,29,Flexion ……

5. Build Model • Classification • OneR • Decision Tree • Naïve Bayesian • K-Nearest Neighbors • Neural Networks • Linear Discriminant Analysis • Support Vector Machines • …

OneR • Construct the best rule by using the following pseudo-code: For each attribute For each value of that attribute, make a rule as follows: count how often each class appears find the most frequent class make the rule assign that class to this attribute-value Calculate the error rate of the rules Choose the rules with the smallest error rate` Triceps: < 17.5 -> Flexion < 33.5 -> Pronation < 46.5 -> Supination >= 46.5 -> Extension (65/80 instances correct)

Naïve Bayesian Prior Probability Likelihood Posterior Probability Normalization factor Rec#1: Triceps=13, Biceps=31 C = {Flexion, Extension, Supination and Pronation) P(C|Triceps=13, Biceps=31) = P(Triceps=13|C) x P(Biceps=31|C) x P(C) / P(X)

Neuron i Neural Networks Output Node(s) Input Nodes Weights

Decision Tree Find the attribute that best classifies the training data. Use this attribute as the root of the decision tree. Repeat the process for each subtree. Triceps >37 <=37 Triceps Biceps <=14 >14 <=17 >17 Flexion Pronation Extension Supination

Decision Tree (cont.) Q: What does it mean to be the “best classifying” attribute? A: Use the attribute with thehighest information gain Size of the subset Entropy Size of the set proportion of examples in S belonging to class i

K-Nearest Neighbors • KNN is a simple algorithm that stores all available examples and classifies new instances based on a similarity measure. Euclidean distance as the similarity function:

6. Evaluate Models • Simple validation : training set and test set • n-fold cross-validation • Leave-one-out 10 -fold cross-validation

7. Deploy Model • The neural network model was successfully implemented inside the robotic arm.

Examples of data mining in science & engineering • 1. Data mining in Biomedical Engineering • “Robotic Arm Control Using Data Mining Techniques” • 2. Data mining in Chemical Engineering • “Data Mining for In-line Image Monitoring of Extrusion Processing” K.Torabi, L D. Ing, S. Sayad, and S.T. Balke

Plastics Extrusion Plastic pellets Plastic Melt

Film Extrusion Defect due to particle contaminant Plastic Extruder Plastic Film

Transition Piece In-Line Monitoring Window Ports

In-Line Monitoring Optical Assembly Light Extruder and Interface Light Source Imaging Computer

Melt Without Contaminant Particles (WO)

Melt With Contaminant Particles (WP)

1. Define the problem Classify images into those with particles (WP) and those without particles (WO). WO WP

2. Build a data mining database • 2000 Images • 54 Input variables all numeric • One output variables with two possible values (With Particle and Without Particle)

3. Explore data

4. Prepare data for modeling • Pre-processed images to remove noise • Dataset 1 with sharp images: 1350 images including 1257 without particles and 91 with particles • Dataset 2 with sharp and blurry images: 2000 images including 1909 without particles and blurry particles and 91 with particles • 54 Input variables, all numeric • One output variable, with two possible values (WP and WO)

5. Build a model • Classification: • OneR • Decision Tree • 3-Nearest Neighbors • Naïve Bayesian

6. Evaluate Models 10 -fold cross-validation If pixel density Max < 142 then WP

7. Deploy model • A Visual Basic program will be developed to implement the model.

Overview • Explosion of data • Introduction to data mining • Examples of data mining in science & engineering • Challenges and opportunities

Challenges andOpportunities • Data mining is a ‘top ten’ emerging technology • Faster, more accurate and more scalable techniques • Incremental, on-line and real-time learning algorithms • Parallel and distributed data processing techniques

Data mining is an exciting and challenging field with the ability to solve many complex scientific problems. You can be part of the solution!

Data Knowledge Data Mining How a chemical engineer see the data mining!

Data Mining The Art and Science of Obtaining Knowledge from Data