1 / 47

Data Mining The Art and Science of Obtaining Knowledge from Data

Data Mining The Art and Science of Obtaining Knowledge from Data. Saed Sayad iSmartsoft Inc. Overview. Explosion of data Introduction to data mining Examples of data mining in science and engineering Challenges and opportunities. Explosion of Data.

williamgray
Download Presentation

Data Mining The Art and Science of Obtaining Knowledge from Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data MiningThe Art and Science of Obtaining Knowledge from Data Saed Sayad iSmartsoft Inc.

  2. Overview • Explosion of data • Introduction to data mining • Examples of data mining in science and engineering • Challenges and opportunities

  3. Explosion of Data • Data in the world doubles every 20 months! • NASA’s Earth Orbiting System: forty-six megabytes of data per second 4,000,000,000,000 bytes a day • FBI fingerprints image library: 200,000,000,000,000 bytes • In-line image analysis for particle detection: 1 megabyte in one second

  4. Explosion of Data (cont.) Fast, accurate, and scalable data analysis techniques are needed to extract useful information: The answer is Data Mining.

  5. What is Data Mining? “Data Mining is the exploration and analysis by automatic or semi-automatic means, of large or small quantities of data in order to discover meaningful patterns and rules.”

  6. Data Analysis Artificial Intelligence, Machine Learning Statistics Data Mining Database

  7. The Role of Data Mining in Research Life Cycle • Questions • Needs Report Library Search Modeling Data Analysis Database Research Data Experiment

  8. Database Technology • Moving from simple text file to advanced database technology (e.g., relational database) • Structural Query Language (SQL) • Data Definition Language to define different objects in the database (e.g. tables, views, and procedures). • Data Manipulation Language to select, add, update and delete data in the data base. • Parallel data access • Save time and $$$, …

  9. Data Analysis • Classification • Regression • Clustering • Association • Sequence Analysis

  10. Data Analysis (cont.) Y1 Numeric Numeric X1 f Regression 3, 4.5, 102, … Categorical X2 Y2 Categorical hot, cold, high, low, … Classification Crisp X3 Y3 Crisp 0, 1, yes, no, … Output Variables or Dependent Variables or Classes Input Variables or Independent Variables or Attributes

  11. Data Analysis (cont.) Clustering Association Income 1, chips, coke, chocolate 2, gum, chips 3, chips, coke 4, … Probability (chips, coke) ? Age Sequence Analysis …ATCTTTAAGGGACTAAAATGCCATAAAAATCCATGGGAGAGACCCAAAAAA… Xt-1 Xt T

  12. Data Analysis (cont.) • Classification • Regression • Linear Discriminant Analysis • Naïve Bayes / Bayesian Network • OneR • Neural Networks • Decision Tree (ID3, C4.5, …) • K-Nearest Neighbors • Support Vector Machines • … • Multiple Linear Regression • Principal Components Regression • Partial Least Square • Neural Networks • Regression Tree (CART, MARS, …) • K-Nearest Neighbors • Support Vector Machines • … • Clustering • Association & Sequence Analysis • K-Mean Clustering • Self Organizing Map • Bayesian Clustering • … • A Priori • Markov Chain • Hidden Markov Models • …

  13. Basic Steps in Data Mining • Define the problem • Build data mining database • Explore data • Prepare data for modeling • Build model • Evaluate model • Deploy model

  14. Overview • Explosion of data • Introduction to data mining • Examples of data mining in science and engineering • Challenges and opportunities

  15. Examples of data mining in science & engineering • 1. Data mining in Biomedical Engineering • “Robotic Arm Control Using Data Mining Techniques” • 2. Data mining in Chemical Engineering • “Data Mining for In-line Image Monitoring of Extrusion Processing”

  16. Supination Pronation Flexion Extension 1. Define the problem “Control a robotic arm by means of EMG signals from biceps and triceps muscles.”

  17. 2. Build a data mining database • The dataset includes 80 records. • There are two input variables; biceps signal and triceps signal. • One output variable, with four possible values; supination, pronation, flexion and extension.

  18. Scatter Plot Triceps Record# FlexionExtensionSupinationPronation 3. Explore data

  19. Scatter Plot Biceps Record# FlexionExtensionSupinationPronation 3. Explore data (cont.)

  20. 4. Prepare data for modeling • Build a dataset with the ARFF format: @relation EMG @attribute Triceps real @attribute Biceps real @attribute Move {Flexion,Extension,Pronation,Supination} @data 13,31,Flexion 14,30,Flexion 10,31,Flexion 13,29,Flexion ……

  21. 5. Build Model • Classification • OneR • Decision Tree • Naïve Bayesian • K-Nearest Neighbors • Neural Networks • Linear Discriminant Analysis • Support Vector Machines • …

  22. OneR • Construct the best rule by using the following pseudo-code: For each attribute For each value of that attribute, make a rule as follows: count how often each class appears find the most frequent class make the rule assign that class to this attribute-value Calculate the error rate of the rules Choose the rules with the smallest error rate` Triceps: < 17.5 -> Flexion < 33.5 -> Pronation < 46.5 -> Supination >= 46.5 -> Extension (65/80 instances correct)

  23. Naïve Bayesian Prior Probability Likelihood Posterior Probability Normalization factor Rec#1: Triceps=13, Biceps=31 C = {Flexion, Extension, Supination and Pronation) P(C|Triceps=13, Biceps=31) = P(Triceps=13|C) x P(Biceps=31|C) x P(C) / P(X)

  24. Neuron i Neural Networks Output Node(s) Input Nodes Weights

  25. Decision Tree Find the attribute that best classifies the training data. Use this attribute as the root of the decision tree. Repeat the process for each subtree. Triceps >37 <=37 Triceps Biceps <=14 >14 <=17 >17 Flexion Pronation Extension Supination

  26. Decision Tree (cont.) Q: What does it mean to be the “best classifying” attribute? A: Use the attribute with thehighest information gain Size of the subset Entropy Size of the set proportion of examples in S belonging to class i

  27. K-Nearest Neighbors • KNN is a simple algorithm that stores all available examples and classifies new instances based on a similarity measure. Euclidean distance as the similarity function:

  28. 6. Evaluate Models • Simple validation : training set and test set • n-fold cross-validation • Leave-one-out 10 -fold cross-validation

  29. 7. Deploy Model • The neural network model was successfully implemented inside the robotic arm.

  30. Examples of data mining in science & engineering • 1. Data mining in Biomedical Engineering • “Robotic Arm Control Using Data Mining Techniques” • 2. Data mining in Chemical Engineering • “Data Mining for In-line Image Monitoring of Extrusion Processing” K.Torabi, L D. Ing, S. Sayad, and S.T. Balke

  31. Plastics Extrusion Plastic pellets Plastic Melt

  32. Film Extrusion Defect due to particle contaminant Plastic Extruder Plastic Film

  33. Transition Piece In-Line Monitoring Window Ports

  34. In-Line Monitoring Optical Assembly Light Extruder and Interface Light Source Imaging Computer

  35. Melt Without Contaminant Particles (WO)

  36. Melt With Contaminant Particles (WP)

  37. 1. Define the problem Classify images into those with particles (WP) and those without particles (WO). WO WP

  38. 2. Build a data mining database • 2000 Images • 54 Input variables all numeric • One output variables with two possible values (With Particle and Without Particle)

  39. 3. Explore data

  40. 4. Prepare data for modeling • Pre-processed images to remove noise • Dataset 1 with sharp images: 1350 images including 1257 without particles and 91 with particles • Dataset 2 with sharp and blurry images: 2000 images including 1909 without particles and blurry particles and 91 with particles • 54 Input variables, all numeric • One output variable, with two possible values (WP and WO)

  41. 5. Build a model • Classification: • OneR • Decision Tree • 3-Nearest Neighbors • Naïve Bayesian

  42. 6. Evaluate Models 10 -fold cross-validation If pixel density Max < 142 then WP

  43. 7. Deploy model • A Visual Basic program will be developed to implement the model.

  44. Overview • Explosion of data • Introduction to data mining • Examples of data mining in science & engineering • Challenges and opportunities

  45. Challenges andOpportunities • Data mining is a ‘top ten’ emerging technology • Faster, more accurate and more scalable techniques • Incremental, on-line and real-time learning algorithms • Parallel and distributed data processing techniques

  46. Data mining is an exciting and challenging field with the ability to solve many complex scientific problems. You can be part of the solution!

  47. Data Knowledge Data Mining How a chemical engineer see the data mining!

More Related