1 / 54

Fuzzy Machine Learning Methods for Biomedical Data Analysis

Fuzzy Machine Learning Methods for Biomedical Data Analysis. Yanqing Zhang Department of Computer Science Georgia State University Atlanta, GA 30302-5060 yzhang@gsu.edu. Outline. Background Fuzzy Association Rule Mining for Decision Support (FARM-DS) FARM-DS on Medical Data

Download Presentation

Fuzzy Machine Learning Methods for Biomedical Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fuzzy Machine Learning Methods for Biomedical Data Analysis Yanqing Zhang Department of Computer Science Georgia State University Atlanta, GA 30302-5060 yzhang@gsu.edu Yan-Qing Zhang, Georgia State University

  2. Outline • Background • Fuzzy Association Rule Mining for Decision Support (FARM-DS) • FARM-DS on Medical Data • FARM-DS on Microarray Expression Data • Fuzzy-Granular Gene Selection on Microarray Expression Data • Conclusion and Future Work Yan-Qing Zhang, Georgia State University

  3. Background • Theory • Computational Intelligence, Granular Computing, Fuzzy Sets • Knowledge Discovery and Data mining (KDD) • Decision Support system (DS) • Rule-Based Reasoning (RBR), Association Rule Mining • Application • Bioinformatics, Medical Informatics, etc. • Concern • Accuracy • Interpretability Yan-Qing Zhang, Georgia State University

  4. Outline • Background • Fuzzy Association Rule Mining for Decision Support (FARM-DS) • FARM-DS on Medical Data • FARM-DS on Microarray Expression Data • Fuzzy-Granular Gene Selection on Microarray Expression Data • Conclusion and Future Work Yan-Qing Zhang, Georgia State University

  5. Motivation – deal with numeric data • Traditional Association rule mining algorithm • If X, then Y • Conf = Pr(Y|X) Supp = Pr(X and Y) • don’t work on numeric data • Fuzzy Logic • Feature transform • Fuzzy AR mining (Zadeh, 1965) Yan-Qing Zhang, Georgia State University

  6. Motivation – decision support • FARs for classification • Accuracy vs. Interpretability • Very Few works • Hu et al. 2002 • Combinatorial rule explosion • Chatterjee et al. 2004 • Human intervention Yan-Qing Zhang, Georgia State University

  7. FARM-DS • Target • Numeric data • Binary classification • Effectiveness • Accuracy • Interpretability • Modeling process • Training • Testing Yan-Qing Zhang, Georgia State University

  8. Step 1: Fuzzy Interval Partition • 1-in-1-out 0-order TSK model • ANFIS for model optimization and parameter selection (Jang, 1993) Yan-Qing Zhang, Georgia State University

  9. Step 2: Data Abstraction positive cluster • Clustering • K-Means • Fuzzy C-means • Validation • #clusters • Optimal cluster • Silhouette Value negative cluster Yan-Qing Zhang, Georgia State University

  10. Step 3: Generating Fuzzy Discrete Transactions • Project the center of each cluster on each feature • Create transactions • With positive cluster, +1 is inserted • With negative cluster, -1 is inserted Yan-Qing Zhang, Georgia State University

  11. Step 3 - example f2 • 5-2 = 3 transactions • 1 f1_1 • 1 f1_1 • 1 f1_1 f1 • Avoid combinatorial rule explosion • Number of different transactions are decided by number of clusters Yan-Qing Zhang, Georgia State University

  12. Step 4: Association Rule Mining • Association Rule Mining on fuzzy discrete transactions • Traditional Apriori algorithm (Agrawal and Srikant 1994) If f1 is low, f2 is high, …, fh is low, then y=1/-1 • Rule pruning: • For a pair of rules A and B, if B is more specific than A (that means A is included by B), and B has the same support value as A, A is eliminated. A: If f1 is low, then y=1, sup=50% B: If f1 is low and f2 is high, then y=1, sup=50% Yan-Qing Zhang, Georgia State University

  13. Testing Phase Yan-Qing Zhang, Georgia State University

  14. Adaptive FARM-DS • Train • Fuzzy intervals partition • Data abstraction • Generate fuzzy discrete transactions • AR mining • Test He, et al. 2006a, IJDMB Yan-Qing Zhang, Georgia State University

  15. Outline • Background • Fuzzy Association Rule Mining for Decision Support (FARM-DS) • FARM-DS on Medical Data • FARM-DS on Microarray Expression Data • Fuzzy-Granular Gene Selection on Microarray Expression Data • Conclusion and Future Work Yan-Qing Zhang, Georgia State University

  16. Empirical Studies • Classification algorithms • C4.5 decision trees (Quinlan, 1993) • Support vector machines (Vapnik, 1995) • FARM-DS (He, et al. 2006a, IJDMB) • Accuracy Estimation • 5-folds cross validation • Interpretability Yan-Qing Zhang, Georgia State University

  17. Evaluation metrics • Accuracy • Classification Error • Area under ROC curve (future work) • Interpretability • Rule numbers • Average rule lengths Bradley, 1997 Yan-Qing Zhang, Georgia State University

  18. Datasets Merz, et al. UCI repository of machine learning databases, 1998 Yan-Qing Zhang, Georgia State University

  19. Result analysis on Accuracy • FARM-DS ≈ SVM > C4.5 • SVM2 and C4.5 results from (Bennett et al. 1997) Yan-Qing Zhang, Georgia State University

  20. Result analysis on Interpretability • SVM, high accuracy, hard to interpret • C4.5, low accuracy , easy to interpret • FARM-DS, high accuracy, easy to interpret Yan-Qing Zhang, Georgia State University

  21. Interpretability (1) • FARs extracted by FARM-DS are short and compact, and hence, easy to understand. • 22 positive rules and 8 negative rules are extracted. • In average, • the length of a positive rule is 2.6, • the length of a negative rule is 4.3, • and every sample activates • 3.3 positive rules and • 5.6 negative rules. Yan-Qing Zhang, Georgia State University

  22. Interpretability (2) • FARs may help human experts to correct the wrongly classified samples. Yan-Qing Zhang, Georgia State University

  23. Interpretability (3) • The larger support of the negative rules may help human experts to make final correct decisions and find inherent disease-resulting mechanisms. Yan-Qing Zhang, Georgia State University

  24. Interpretability (4) • FARs are helpful to select important features. • Higher activation frequency means more important feature Yan-Qing Zhang, Georgia State University

  25. Outline • Background • Fuzzy Association Rule Mining for Decision Support (FARM-DS) • FARM-DS on Medical Data • FARM-DS on Microarray Expression Data • Fuzzy-Granular Gene Selection on Microarray Expression Data • Conclusion and Future Work Yan-Qing Zhang, Georgia State University

  26. Microarray Expression Data • Extremely high dimensionality • Gene selection • Cancer classification • Rule-based reasoning Yan-Qing Zhang, Georgia State University

  27. Empirical Studies • Rule-Based Reasoning/Classification • CART for decision trees modeling (Breiman, et al. 1984) • ANFIS for fuzzy neural networks modeling (Jang, 1993) • FARM-DS (He, et al. 2006a, IJDMB) Yan-Qing Zhang, Georgia State University

  28. Evaluation metrics • Accuracy • Classification Error • Area under ROC curve • Accuracy Estimation • Leave-one-out cross validation • Interpretability • Rule numbers • Average rule lengths Bradley, 1997 Yan-Qing Zhang, Georgia State University

  29. AML/ALL leukemia dataset Tang, et al. 2006 Yan-Qing Zhang, Georgia State University

  30. Result analysis:AML/ALL leukemia dataset • Higher accuracy than CART • Easier to interpret than ANFIS Yan-Qing Zhang, Georgia State University

  31. Rules extracted by FARM-DS:AML/ALL leukemia dataset • IF • gene2 (Y12670), • gene3 (D14659) and • gene5 (M80254) are down-regulated, • THEN the tissue is ALL(-1) Yan-Qing Zhang, Georgia State University

  32. Prostate cancer dataset Tang, et al. 2006 Yan-Qing Zhang, Georgia State University

  33. Result analysis:prostate cancer dataset • Higher accuracy than CART • Easier to interpret than ANFIS Yan-Qing Zhang, Georgia State University

  34. Rules extracted by FARM-DS: prostate cancer dataset Yan-Qing Zhang, Georgia State University

  35. Outline • Background • Fuzzy Association Rule Mining for Decision Support (FARM-DS) • FARM-DS on Medical Data • FARM-DS on Microarray Expression Data • Fuzzy-Granular Gene Selection on Microarray Expression Data • Conclusion and Future Work Yan-Qing Zhang, Georgia State University

  36. Gene Selection and Cancer Classification on Microarray Expression Data • Extremely high dimensionality • AML/ALL leukemia dataset 72 * 7129 • no more than 10% relevant genes (Golub, et al. 1999) • Gene selection • accurate classification • helpful for cancer study Yan-Qing Zhang, Georgia State University

  37. Gene Categorization and Gene Ranking • Informative genes • Redundant genes • Irrelevant genes • Noisy genes Yan-Qing Zhang, Georgia State University

  38. Information Loss • Noise • Overfitting themselves • Complementary to redundant/irrelevant genes • Conflict with informative genes • Imbalanced gene selection • Inflexibility How to decrease information loss? Granulation! Yan-Qing Zhang, Georgia State University

  39. Coarse Granulation with Relevance Indexes • Target: remove irrelevant genes imbalance imbalance balance • Target: tune thresholds to select genes in balance Yan-Qing Zhang, Georgia State University

  40. Fine Granulation with Fuzzy C-Means Clustering • clustering in the training samples space • genes with similar expression patterns have similar functions • a gene may have multiple functions (Fuzzy works here!) Yan-Qing Zhang, Georgia State University

  41. Conquer with correlation-based Ranking • Lower-ranked genes are removed as redundant genes Yan-Qing Zhang, Georgia State University

  42. Aggregation with Data Fusion • Pick up genes from different clusters in balance • An informative gene is more possible to survive • (due to fuzzy clustering) Yan-Qing Zhang, Georgia State University

  43. Original Gene Set Relevance Indexes -based pre-filtering Relevant Gene Set Correlation-based Gene Ranking 1 Gene Cluster 1 Fuzzy C-Means Clustering Correlation-based Gene Ranking 2 Gene Cluster 2 Correlation-based Gene Ranking K Gene Cluster K Final Gene Set Yan-Qing Zhang, Georgia State University

  44. Empirical Study • Comparison • Signal to Noise (S2N) (Furey, et al. 2000) • Fuzzy-Granular + S2N • Fisher Criterion (FC) (Pavlidis, et al. 2001) • Fuzzy-Granular + FC • T-Statistics (TS) (Duan, et al. 2004) • Fuzzy-Granular + TS Yan-Qing Zhang, Georgia State University

  45. Evaluation Methods Metrics • Accuracy • Sensitivity • Specificity • Area under ROC curve Estimation • Leave-1-out CV • .632 bootstrapping .632 Perf = 0.368 * training perf + 0.632 * testing perf Yan-Qing Zhang, Georgia State University

  46. prostate cancer dataset Yan-Qing Zhang, Georgia State University

  47. Result analysis:prostate cancer dataset Yan-Qing Zhang, Georgia State University

  48. Colon cancer dataset Yan-Qing Zhang, Georgia State University

  49. Result analysis:colon cancer dataset Yan-Qing Zhang, Georgia State University

  50. Conclusion • High-level data abstraction • data clustering techniques • Quantitative data transformed to fuzzy discrete transactions • Fuzzy interval partition • Apriori algorithm for AR mining • Strong decision support for biomedical study • High accuracy and easy to interpret • More accurate cancer classification • Eliminate irrelevant/redundant genes to decrease noise • Select informative genes in balance Yan-Qing Zhang, Georgia State University

More Related