Machine Learning Methods for Decision Support and Discovery Constantin Aliferis M.D., Ph.D., Ioannis Tsamardinos Ph.D. Discovery Systems Laboratory, Department of Biomedical Informatics, Vanderbilt University 2004 MEDINFO Tutorial 7 September 2004
Acknowledgments • Alexander Statnikov for code and for putting together the Resource Web page and CD • Doug Hardin, Pierre Massion, Yindalon Aphinyanaphongs, Laura E. Brown, and Nafeh Fananapazir for access to data and results for case studies
Goal The purpose of this tutorial is: • To help participants develop a solid understanding of some of the most useful machine learning methods. • To give several examples of how these methods can be applied in practice, and • To provide resources for expanding the knowledge gained in the tutorial.
Outline Part I: Overview and Foundations 1. Tutorial Overview and goals 2. Importance of Machine Learning for discovery and decision-support system construction 3. A framework for inductive Machine Learning 4. Generalization and Over-fitting 5. Quick review of data preparation and model evaluation 6. Families of methods a. Bayesian classifiers break b. Neural Networks c. Support Vector Machines break 7. Quick Review of Additional families a. K-Nearest Neighborhs, b. Clustering, c. Decision Tree Induction, d. Genetic Algorithms
Outline (cont’d) Part II.More Advanced Methods and Case Studies 1. More Advanced Methods a. Causal Discovery methods using Causal Probabilistic Networks b. Feature selection break 2: Case Studies • Building a diagnostic model from gene expression data • Building a diagnostic model from mass spectrometry data • Categorizing text into content categories break d. Discovery of causal structure using Causal Probabilistic Network induction (demo) 3. Conclusions and wrap-up a. Resources for machine learning b. Questions & feedback
A (Simplified) Motivating Example • Assume we wish to create a decision support system capable of diagnosing patients according to two categories: Lung Cancer and Normal. • The input to the system will be of array gene expression measurements from tissue biopsies.
A (Simplified) Motivating Example • Little is currently known about how gene expression values differentiate human lung cancer tissue from normal tissue. • Thus we will use an automated approach in which a computer system will examine patients’ array gene expression measurements and the correct diagnosis (provided by a pathologist).
A (Simplified) Motivating Example • The system will produce a program that implements a function that assigns the correct diagnosis to any pattern of array gene expression data to the correct diagnostic label (and not just the input-output patterns of the training data). • Thus the system will learn (i.e., generalize) from training data the general input-output function for our diagnosis problem.
A (Simplified) Motivating Example • What are the principles and specific methods that enable the creation of such learning systems? • What flavors of learning systems currently exist? • What are their capabilities and limitations? …These are some of the questions we will be addressing in this tutorial
What is Machine Learning (ML)? How is it different than Statistics and Data Mining? • Machine Learning is the branch of Computer Science (Artificial Intelligence in particular) that studies systems that learn. • Systems that learn = systems that improve their performance with experience.
What is Machine Learning (ML)? How is it different than Statistics and Data Mining? • Typical tasks: • image recognition, • Diagnosis, • elicitation of possible causal structure of problem domain, • game playing, • solving optimization problems, • prediction of structure or function of biomolecules, • text categorization, • identification of relevant variables, etc.
Indicative Example applications of ML in Biomedicine • Bioinformatics • Prediction of Protein Secondary Structure • Prediction of Signal Peptides • Gene Finding and Intron/Exon Splice Site Prediction • Diagnosis using cDNA and oligonucleotide array gene expression data • Identification of molecular subtypes of patients with various forms of cancer • Clinical problem areas • Survival after Pneumonia (CAP) • Survival after Syncope • Diagnosis of Acute M.I. • Diagnosis of Prostate Cancer • Diagnosis of Breast Cancer • Prescription and monitoring in hemodialysis • Prediction of renal transplant graft failure
Importance of ML: Task Types • Diagnosis (what is the most likely disease given a set of clinical findings?), • Prognosis (what will be the outcome after a certain treatment has been given to a patient?), • Treatment selection (what treatment to give to a specific patient?), • Prevention (what is the likelihood that a specific patient will develop disease X if preventable risk factor Y is present?). ML has practically replaced Knowledge Acquisition for building Decision Support (“Expert”) Systems.
Importance of ML: Task Types (cont’d) • Discovery • Feature selection (e.g., what is a minimal set of laboratory values needed for pneumonia diagnosis?); • Concept formation (e.g., what are patterns of genomic instability as measured by array CGH that constitute molecular subtypes of lung cancer capable of guiding development of new treatments?); • Feature construction (e.g., how can mass-spectrometry signals be decomposed into individual variables that are highly predictive for detection of cancer and can be traced back to individual proteins that may play important roles in carcinogensis?); information retrieval query construction (e.g., what are PubMed Mesh terms that predict with high sensitivity and specificity whether medical journals talk about treatment?); • Questions about function, interactions, and structure (e.g., how do genes and proteins regulate each other in the cells of lower and higher organisms? what is the most likely function of a protein given the sequence of its aminoacids?), etc.
What is Machine Learning (ML)? How is it different than Statistics and Data Mining? • Broadly speaking ML, DM, and Statistics have similar goals (modeling for classification and hypothesis generation or testing). • Statistics has traditionally emphasized models that can be solved analytically (for example various versions of the Generalized Linear Model – GLM). To achieve this both restrictions in the expressive power of models and their parametric distributions are heavily used. • Data Mining emphasizes very large-scale data storage, integration, retrieval and analysis (typically the last one as a secondary focus). • Machine Learning seeks to use computationally powerful approaches to learn very complex non- or quasi-parametric models of the data. Some of these models are closer to human representations of the problem domain per se (or of problem solving in the domain)
Importance of ML: Data Types and Volume • Overwhelming production of data: • Bioinformatics (mass-throughput assays for gene expression, protein abundance, SNPs…) • Clinical Systems (EPR, POE) • Bibliographic collections • The Web: web pages, transaction records,…
Importance of ML: Reliance on Hard data and evidence • Machine learning has become critical for Decision Support System Construction given extensive cognitive biases and the corresponding need to base MDSSs on hard scientific evidence and high-quality data
Supplementary: Cognitive Biases • Main thesis: • human cognitive abilities are tailored to support instinctive, reflexive, life-preserving reactions traced back in our evolution as species. They are not designed for rational, rigorous reasoning such as the reasoning needed in science and engineering. • In other words, there is a disconnect between our innate cognitive ability and the complexity of reasoning tasks required by the explosive advances in science and technology in the last few hundred years.
Supplementary: But is the Cognitive Biases Thesis Correct? • Psychology of Judgment and Decision Making (Plous) • Tversky and Kahneman (Judgment under uncertainty: Heuristics and Biases) • Methods of Influence (Cialdini) And highly-recommended supplementary information can be found in: • Professional Judgment (Elstein) • Institute of Medicine’s Report in Medical Errors (1999)
Supplementary: Tversky and Kahneman “Judgment under uncertainty: Heuristics and Biases” • This work (a constellation of psychological studies converging to a description of human decision making under uncertainty) is very highly regarded and influential It was recently (2002) awarded the Nobel Prize of Economics. • Main points: • People use a few simple heuristics when making judgments under uncertainty • These heuristics sometimes are useful and other times lead to severe and systematic errors • These heuristics are: representativeness, availability and anchoring
Supplementary:Representativeness • E.g., : the probability P that patient X has disease D given that she has findings F is assessed by the similarity of X to a prototypical description of D (found in a textbook, or recalled from earlier practice and training). • Why is this wrong? • Reason #1: similarity ignores base-rate of D • Reason #2: similarity ignores sample size • Reason #3: similarity ignores predictability • Reason #4: similarity is affected by redundant features
Supplementary: Availability • E.g., : the probability P that patient X with disease D given that she is given treatment T will become healthy is assessed by recalling such occurrences in one’s prior experience • Why is this wrong? • Reason #1: availability is influenced by familiarity • Reason #2: availability is influenced by salience • Reason #3: availability is influenced by elapsed time • Reason #4: availability is influenced by rate of abstract terms • Reason #5: availability is influenced by imaginability • Reason #6: availability is influenced by perceived association
Supplementary: Adjustment and Anchoring • E.g., : the probability P that patient X has disease D given that she has findings F is assessed by making an initial estimate P1 for findings F1 and updating it when new evidence F2, F3, …, and so on, becomes available. • What goes wrong? • Problem #1: initial estimate over-influences the final estimate • Problem #2: initial estimate is often based on quick and then extrapolated calculations • Problem #3: people overestimate the probability of conjunctive events • Problem #4: according to initial anchor, people’s predictions are calibrated differently
Supplementary: Additional • Methods of Influence (Cialdini, 1993): • Reciprocation • Commitment & Consistency • Social Proof • Liking • Authority • Scarcity • Professional Judgment (Dowie and Elstein 1988) • Institute of Medicine’s Report in Medical Errors (1999)
Supplementary: Putting MDSSs and Machine Learning in Historical Context • 40s • Foundations of Formal Decision-Making Theory by VonNeuman and Morgerstern • 50s • Ledley and Lusted lay out how logic and probabilistic reasoning can help in diagnosis and treatment selection in medicine • 60s • Applications of Bayes theorem for diagnosis and treatment selection pioneered by Warner and DeDombal • Medline (NLM) • Early 70s • Ad-hoc systems (Myers et al; Pauker et al) • Study of Cognitive Biases (Kahneman, Tversky) • Late 70s • Rule-based systems (Buchanan & Shortliffe)
Supplementary: Milestones in MDSSs • 80s • Analysis of ad-hoc and RBSs (Heckerman et al.) • Bayesian Networks (Pearl, Cooper, Heckerman et al.) • Medical Decision Making as discipline (Pauker) • Literature-driven decision support (Renels & Shortliffe) • Early 90s • Web-enabled decision support & wide-spread information retrieval • Computational Causal Discovery (Pearl, Spirtes et al. Cooper et al.) • Sound re-formulation of very large ad-hoc systems (Shwe et al) • Analysis of Bayesian systems (Domingos et al, Henrion et al.) • Proliferation of focused Statistics and Machine Learning MDDSs • First-order Logics that combine classical FOL with probabilistic reasoning, causation and planning (Haddaway)
Supplementary: Milestones in MDSSs • Late 90s • Efficient Inference for very large probabilistic systems (Jordan et al) • Kernel-based methods for sample-efficient learning (Vapnik) • Evidence-Based Medicine (Haynes et al) • 21st Century • Diagnosis, Prognosis and Treatment selection (a.k.a. “Personalized medicine” or “Pharmacogenomics”) based on molecular information (proteomic spectra, gene expression arrays, SNPs) collected via mass-throughput assaying technology, and modeleld using machine learning methods • Provide-order entry delivery of advanced decision support • Advanced representation, storage, retrieval and application of EBM information (guidelines, journals, meta-analyses, clinical bioinformatics models)
Importance of ML • How often ML techniques are being used? #Articles in Medline (in parentheses last 2 years): • Artificial Intelligence: 12,441 (2,358) • Expert systems: 2,271 (121) • Neural Networks 5,403 (1,158) • Support Vector Machines 163 (121) • Clustering 17,937 (4,080) • Genetic Algorithms 2,798 (969) • Decision Trees 4,958 (752) • Bayesian (Belief) Networks 1,627 (585) • Bayes (Bayesian Statistics + Nets) 4,369 (561) Compare to: • Regression 164,305 (28,134) • Knowledge acquisition 310 (56) • Knowledge representation 227 (27) • 4 major Symbolic DSS 145 (10) (Internist-I, QMR, ILIAD, DxPlain) • Rule-based systems 802 (151)
Importance of ML • Importance of ML becomes very evident in cases where: • data analysis is too time consuming (e.g., classify web pages or medline documents into content or quality categories) • There is little or no domain theory What is the diagnosis? Is this an early cancer?
What is the difference between supervised and unsupervised ML methods? • Supervised learning: • Give to the learning algorithm several instances of input-output pairs; the algorithm learns to predict the correct output that corresponds to some inputs (not only previously seen but also previously unseen ones (“generalization”)). • In our original example: show to learning algorithm array gene expression measurements from several patient cases as well as normal subjects; then the learning algorithm induces a classifier that can classify a previously unseen subject to the correct diagnostic category given the gene expression values observed in that subject
Classification A CLASSIFIER- INDUCTIVE ALGORITHM TRAIN INSTANCES B C CLASSIFIER D E APPLICATION INSTANCES A1, B1, C1, D1, E1 A2, B2, C2, D2, E2 An, Bn, Cn, Dn, En CLASSIFICATION PERFORMANCE
What is the difference between supervised and unsupervised ML methods? • Unsupervised learning: • - Discover the categories (or other structural properties of the • domain) • - Example: give the learning algorithm gene expression • measurements of patients with Lung Cancer as well as normal subjects; the algorithm finds sub-types (“molecular profiles”) of patients that are very similar to each other, and different to the rest of the types. Or another algorithm may discover how various genes interact among themselves to determine development of cancer.
Discovery A STRUCTURE- INDUCTION ALGORITHM A TRAIN INSTANCES B C B C E D E D A1, B1, C1, D1, E1 A2, B2, C2, D2, E2 An, Bn, Cn, Dn, En PERFORMANCE
A first concrete attempt at solving our hypothetical diagnosis problem using a particular type of learning approach (decision tree induction)
Decision Tree Induction • An example decision tree to solve the problem of how to classify subjects into lung cancer vs normal Gene139 Over-expressed Normally- expressed Under-expressed Gene202 Lung cancer Gene8766 Over-expressed Normally- expressed Over-expressed Normally- expressed Lung cancer Normal Lung cancer Normal
How Can I Learn Such A Decision Tree Automatically? • A basic induction procedure is very simple in principle: • Start with an empty tree • Put at the root of the tree the variable that best classifies the training examples • Create branches under the variable corresponding to its values • Under each branch repeat the process with the remaining variables • Until we run out of variables or sample
A General Description of supervised Inductive ML • Inductive Machine Learning algorithms can be designed and analyzed using the following framework: • A language L in which we express models. The set of all possible models expressible in L constitutes our hypothesis space H • A scoring metric M tells us how good is a particular model • A search procedure S helps us identify the best model in H Space of all possible models Models inH x x x x x x x x x x x x x x
A General Description of supervised Inductive ML • In our decision tree example: • A language L in which we express models = decision trees • The hypothesis space H = space of all decision trees that can be constructed with genes 1 to n • A scoring metric M telling us how good is a particular model = min (classification error + model complexity) • A search procedure S = greedy search
How can ML methods fail? • Wrong language Bias: best model is not in H • Example: we look for models expressible as discrete decision trees but the domain is continuous Space of all possible models Models inH x x x x x x x x x x x x x x
How can ML methods fail? • Search Failure: best model is in H but search fails to examine it • Example: greedy search fails to capture a strong gene-gene interaction effect Space of all possible models Models inH x x x x x x x x x x x x x x
Generalization & Over-fitting • It was mentioned previously that a good learning program learns something about the data beyond the specific cases that have been presented to it. • Indeed, it is trivial to just store and retrieve the cases that have been seen in the past (“rote learning” implemented as a lookup table). This does not address the problem of how to handle new cases, however.
Generalization & Over-fitting • In supervised learning we typically seek to minimize “i.i.d.” error, that is error over future cases (not used in training). Such cases contain both previously encountered as well as new cases. • “i.i.d.” = independently sampled and identically distributed problem instances. • In other words, the training and application samples come from the same population (distribution) with identical probability to be selected for inclusion and this population/distribution is time-invariant. (Note: if not time invariant then by incorporating time as independent variable or by other appropriate transformations we restore the i.i.d. condition)
Supplementary: Generalization & Over-fitting • Consider now the following simplified diagnostic classification problem: classify patients into cancer (red/vertical pattern) versus normal (green/no pettern) on the basis of the values of two gene values (gene1, gene2) Gene1 Gene2
Supplementary: Generalization & Over-fitting • The diagonal line represents a perfect classifier for this problem (do not worry for the time being how to mathematically represent or computationally implement the line – we will see how to do so in the Neural Network and Support Vector Machine segments): Gene1 Gene2
Supplementary: Generalization & Over-fitting • Let’s solve the same problem from a small sample; one such possible small sample is: Gene1 Gene2
Supplementary: Generalization & Over-fitting • We may be tempted to solve the problem with a fairly complicated line: Gene1 Gene2