1 / 31

Sorin Alexe RUTCOR, Rutgers University, Piscataway, NJ e-mail: salexe@rutcor.rutgers

DIMACS Mixer Series, September 19, 2002. Datascope - a new tool for Logical Analysis of Data (LAD). Sorin Alexe RUTCOR, Rutgers University, Piscataway, NJ e-mail: salexe@rutcor.rutgers.edu URL: rutcor.rutgers.edu/~salexe. Hidden Function. LAD Approximation. LAD - Problem.

Download Presentation

Sorin Alexe RUTCOR, Rutgers University, Piscataway, NJ e-mail: salexe@rutcor.rutgers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DIMACS Mixer Series, September 19, 2002 Datascope - a new tool for Logical Analysis of Data (LAD) Sorin Alexe RUTCOR, Rutgers University, Piscataway, NJ e-mail: salexe@rutcor.rutgers.edu URL: rutcor.rutgers.edu/~salexe

  2. Hidden Function LADApproximation LAD - Problem Dataset

  3. Negative Pattern LAD - Patterns Positive Pattern

  4. Negative Theory Model LAD - Theories, Models, Classifications Positive Theory

  5. Datascope Functions Support Set Identification Space Discretization Pattern Detection Model Construction Discriminant / Prognostic Index Classification Feature Analysis

  6. Raw Data Cutpoints, Support Set Pattern Report Pre-Processing Discretization Pandect Generation Significant Features Theories/Models Discriminant Construction User Excel Model Internal Solver Matlab Solver Feature Analysis Pattern Space Diagnosis Prognosis Risk Stratification Datascope Dataflow

  7. 1. Support Set Identification Selects Small Subset of Significant Features Preserves Hidden Knowledge Feature Ranking Criteria: Statistical Correlation with Outcome Combinatorial Entropy Distribution Monotonicity Class Separation Envelope Eccentricity E.g., 10 proteins selected out of 15,144

  8. Data Spreadsheet Oriented OLE (via Clipboard)/ Excel Spreadsheet / dBase tables Training / Test Generation Bootstrap k-Folding Jackknife New Features Correlation

  9. Data: Training/Test

  10. Parameter Choice: • User Defined • Minimizing Support Set • Quality Measures: • Entropy • Separability 2. Space Discretization Criteria: Entropy Correlation with Output Bins (equipartitioning) Intervals Clustered Class Separation

  11. Entropy Correlation with Output Bins Intervals Clustered Class Separation

  12. 3. Generation of Maximal Patterns Pattern Type Selection: Prime Cones Intervals Spanned • Parameter Bound Settings: Prevalence: • % of positive observations • % of negative observations Homogeneity: • on positive patterns • on negative patterns Degree. Post-Generation Filters: By Characteristics Maximality Strongness

  13. i.e., Pattern Definition Training Set Test Set Positive Patterns

  14. Pattern Definition Training Set Test Set Negative Patterns

  15. Model Selection: 2 Set-Covering Problems Quadratic Set-Covering Problem 4. Theories and Models Pandect Theory Selection: via: Greedy Bottleneck Greedy Lexicographic Greedy Set Covering Heuristics

  16. 4. Example (Model)

  17. 5. Example (Classification)

  18. 5. Discriminants (weighted sums of patterns) Weight Selection Methods: Direct 1. Prognostic Index 2. Weighted Prognostic Index LP-Based 3. Distance Maximizing Separator (SVM) 4. Cost Minimizing Separator 5. Expected Value Separator NLP-Based 6. Regression in Pattern Space (ANN) 7. Best Correlation with Output

  19. Prognostic Index Weighted Prognostic Expected Value Index Separator Distance Maximizing Cost Minimizing Best Correlation Separator Separator with Output

  20. Accuracy Specificity Sensitivity

  21. Reporting Cutpoints Discretized Space Pandect Coverage of Observations by Patterns Pattern Report (Compact/Full Versions) Theories/Models Attribute Analysis Log File

  22. Test + + + + + + - - - Patterns Pattern Space Positive Observations Unclassified Observations Training Negative Observations + + + + + + - - - Patterns

  23. ClusteredPattern Space

  24. Validation Procedures Raw Data Stratified Random Partition Bootstrap K-Folding Jackknife LAD Model on Training Set Accuracy Sensitivity Specificity Performance Evaluation

  25. Special Features Generating User Model Generation (Excel Files) Datascope Macro Language Multiple and Complex Experiments Interface with Other Applications (Datascope Server)

  26. Performance Tjen-Sien Lim, Wei-Yin Loh and Yu-Shan Shin A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-three Old and New Classification Algorithms, by, Machine Learning, 40, 203-229 (2000) http://www.ics.uci.edu/~mlearn/MLRepository.html

  27. LAD Case Studies Assessing Long-Term Mortality Risk After Exercise Electrocardiography Ovarian Cancer Detection Using Proteomic Data Combinatorial Analysis of Breast Cancer Data from Image Cytometry and Gene Expression Microarrays Cell Proliferation on Medical Implants Country Risk Rating

More Related