1 / 50

Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

Data Mining Theory and Practice Dr. Azuraliza Abu Bakar http://www.ftsm.ukm.my/jabatan/ts/aab/index.htm. What is Pattern Recognition. Pattern Recognition by Human perceptual specialized – decision making Pattern Recognition by Computers benefit of automated pattern recognition

agrata
Download Presentation

Data Mining Theory and Practice Dr. Azuraliza Abu Bakar

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining Theory and Practice Dr. Azuraliza Abu Bakar http://www.ftsm.ukm.my/jabatan/ts/aab/index.htm

  2. What is Pattern Recognition • Pattern Recognition by Human • perceptual • specialized – decision making • Pattern Recognition by Computers • benefit of automated pattern recognition • advantage in complex calculations • Pattern Recognition from Data (Data Mining)

  3. Pattern Recognition from Data • Pattern recognition from data is a process of learning or observing the past data by studying the dependencies and extracting knowledge from data

  4. What is Data? Studies Education Works Income (D) 1 Poor SPM Poor None 2 Poor SPM Good Low 3 Moderate SPM Poor Low 4 Moderate Diploma Poor Low 5 Poor SPM Poor None 6 Moderate Diploma Poor Low 7 Good MSC Good Medium : 99 Poor SPM Good Low 100 Moderate Diploma Poor Low

  5. What is Knowledge?? studies(Poor) AND work(Poor) => income(None) studies(Poor) AND work(Good) => income(Low) education(Diploma) => income(Low) education(MSc) => income(Medium) OR income(High) studies(Mod) => income(Low) studies(Good) => income(Medium) OR income(High) education(SPM) AND work(Good) => income(Low)

  6. What is Data Mining?? • Extraction of knowledge from data • exploration and analysis of large quantities of data to discover meaningful pattern from data. • Discover Knowledge

  7. How data mining looks into data?? Data Data Data

  8. Data Mining : Motivation • Huge amounts of data • Important need for turning data into useful information • Fast growing amount of data, collected and stored in large and numerous databases exceeded the human ability for comprehension without powerful tools

  9. Questions?? • What goods should be promoted to this customer? • What is the probability that a certain customer will respond to a planned promotion? • Can one predict the most profitable securities to buy/sell during the next trading session? • Will this customer default on a loan or pay back on schedule? • What medical diagnose should be assigned to this patient? • What kind of cars should be sell this year??

  10. Data Mining is simply... • Finds relationship • make prediction

  11. Data Mining : 1-step of KDD KDD Data mining Task Techniques

  12. Knowledge Evaluation & Presentation Data Mining as a Step of KDD Patterns Data Mining Selection and Transformation Data Warehouse Cleaning and Intergration Databases Flat files

  13. Early Steps of Data Mining • Data preprocessing • handling incomplete data, noisy data, uncertain data • Data discretization/representation • transforms data into suitable values for the mining algorithm to find patterns • Data selection • selects the suitable data for mining purposes

  14. Data Mining Techniques Decision Trees Neural Network Genetic Algorithms Fuzzy Set Theory Rough Set Theory Statistical Method (Regression Analysis)

  15. Classification of Data Mining Systems Relational Data warehouse Transactional DB Advanced DB system Flat files WWW Kinds of DB Classification Association Clustering Prediction … … Kinds of Knowledge

  16. Classification of Data Mining Systems DB oriented techniques Statistic Machine learning Pattern recognition Neural Network Rough Set etc Techniques used Finance Marketing Medical Stock Telecommunication, etc Application adapted

  17. Data Mining: confluence of multiple discipline Database technology statistic HPerformance computing Machine learning visualization DATA MINING Information retrieval Pattern recognition Information science Spatial data analysis Neural network

  18. Data Mining • What we are looking at?? • What we are looking for??

  19. Data Mining Tasks • Prediction • Classification • Clustering • Association Rules • Sequential Analysis • Deviation analysis • Similarity analysis • Trend analysis

  20. Classification Classification algorithm Training data Classification Rules If studies=“poor” and work=“poor” then Income=“poor” • Studies Education Works Income (D) • 1 Poor SPM Poor None • 2 Poor SPM Good Low • 3 Moderate SPM Poor Low • 4 Moderate Diploma Poor Low • 5 Poor SPM Poor None • 6 Moderate Diploma Poor Low • 7 Good MSC Good Medium • : • 99 Poor SPM Good Low • 100 Moderate Diploma Poor Low

  21. Classification Classification rules New data studies=“poor” and work=“poor” Test data Studies Education Works Income (D) Moderate Diploma Poor ? Poor SPM Poor ? Moderate Diploma Poor ? Good MSC Good ? : classify poor

  22. Type of Classifiers • Neural Classifier • Hopfield Network • Multilayer Perceptron • Radial Basis Function • Kohonen Networks • Statistical Classifier • Bayesion approach • Multiple Regression • K-nearest neighbour • Naïve Bayes • Causal Network • Discriminant Analysis Rough Classifier

  23. DATASET Studies Education Works Income (D) 1 Poor SPM Poor None 2 Poor SPM Good Low 3 Moderate SPM Poor Low 4 Moderate Diploma Poor Low 5 Poor SPM Poor None 6 Moderate Diploma Poor Low 7 Good MSC Good Medium : 99 Poor SPM Good Low 100 Moderate Diploma Poor Low

  24. RULES studies(Poor) AND work(Poor) => income(None) studies(Poor) AND work(Good) => income(Low) education(Diploma) => income(Low) education(MSc) => income(Medium) OR income(High) studies(Mod) => income(Low) studies(Good) => income(Medium) OR income(High) education(SPM) AND work(Good) => income(Low)

  25. Comparing Classifiers • Predictive Accuracy • Speed • Robustness • Scalability • Interpretability

  26. Data Mining : Problems and Challenges Noisy data Large Databases Dynamic Databases Difficult Training Set Incomplete Data

  27. Performance Issues Cost of the Learning Set • Time and Memory Constraint • Predictive Ability

  28. Performance Issues Cost of the Learning Set -number of examples necessary for training -cost of assuring the good accuracy

  29. Performance Issues • Time and Memory Constraint -time complexity of the learning phase -time taken for evaluation -time it takes to reach a certain level of accuracy

  30. Performance Issues • Predictive Ability -to be able to predict the correct decision towards the test or unseen data -involve the generation of rules -measuring the quality or accuracy of rules

  31. DATA AGE SEX CP TREST BPS CHOL FBS RESTECG THALACH EXANG OLDPEAK SLOPE CA THAL DISEASE 1 63 Male Typical angina 145 233 T LV hyper 150 No 2.3 Downslope 0 Fixed No 2 67 Male Asymp 160 286 F LV hyper 108 Yes 1.5 Flat 3 Normal Yes 3 67 Male Asymp 120 229 F LV hyper 129 Yes 2.6 Flat 2 Reversable Yes 4 37 Male Non-anginal 130 250 F Normal 187 No 3.5 Downslope 0 Normal No 5 41 Female Atypical 130 204 F LV hyper 172 No 1.4 Upsloping 0 Normal No 6 56 Male Atypical 120 236 F Normal 178 No 0.8 Upsloping 0 Normal No 7 62 Female Asymp 140 268 F LV hyper 160 No 3.6 Downslope 2 Normal Yes 8 57 Female Asymp 120 354 F Normal 163 Yes 0.6 Upsloping 0 Normal No 9 63 Male Asymp 130 254 F LV hyper 147 No 1.4 Flat 1 Reversable Yes 10 53 Male Asymp 140 203 T LV hyper 155 Yes 3.1 Downslope 0 Reversable Yes 11 57 Male Asymp 140 192 F Normal 148 No 0.4 Flat 0 Fixed defect No 12 56 Female Atypical 140 294 F LV hyper 153 No 1.3 Flat 0 Normal No 13 56 Male Non-anginal 130 256 T LV hyper 142 Yes 0.6 Flat 1 Fixed defect Yes 14 44 Male Atypical 120 263 F Normal 173 No 0 Upsloping 0 Reversable No 15 52 Male Non-anginal 172 199 T Normal 162 No 0.5 Upsloping 0 Reversable No 16 57 Male Non-anginal 150 168 F Normal 174 No 1.6 Upsloping 0 Normal No 17 48 Male Atypical 110 229 F Normal 168 No 1 Downslope 0 Reversable Yes 18 54 Male Asymp 140 239 F Normal 160 No 1.2 Upsloping 0 Normal No 19 48 Female Non-anginal 130 275 F Normal 139 No 0.2 Upsloping 0 Normal No 20 49 Male Atypical 130 266 F Normal 171 No 0.6 Upsloping 0 Normal No Samples of the CLEV Dataset (before scaling)

  32. Rules generated from data mining process oldpeak(0.7) => disease(No) oldpeak(4.4) => disease(Yes) chol(233) AND restecg(LV hypertrophy) => disease(No) chol(204) AND restecg(LV hypertrophy) => disease(No) chol(236) AND restecg(Normal) => disease(No) chol(203) AND restecg(LV hypertrophy) => disease(Yes) chol(294) AND restecg(LV hypertrophy) => disease(No) chol(275) AND restecg(Normal) => disease(No) chol(266) AND restecg(Normal) => disease(No) chol(247) AND restecg(Normal) => disease(No) chol(219) AND restecg(LV hypertrophy) => disease(No) chol(266) AND restecg(LV hypertrophy) => disease(Yes) chol(304) AND restecg(Normal) => disease(No) chol(254) AND restecg(Normal) => disease(Yes) chol(267) AND restecg(Normal) => disease(Yes) chol(264) AND restecg(LV hypertrophy) => disease(No) chol(234) AND restecg(LV hypertrophy) => disease(No)

More Related