1 / 47

Multifactor Dimensionality Reduction

Multifactor Dimensionality Reduction. Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007. The Inspiration For a Method. The Nature of Complex Diseases. Most common diseases are complex Caused by multiple genes Often interacting with one another

natan
Download Presentation

Multifactor Dimensionality Reduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

  2. The Inspiration For a Method

  3. The Nature of Complex Diseases • Most common diseases are complex • Caused by multiple genes • Often interacting with one another This interaction is termed Epistasis

  4. Epistasis • When an allele at one locus masks the effect of an allele at another locus

  5. The Failure of Traditional Methods • Traditional gene hunting methods successful for rare Mendelian (single gene) diseases • Unsuccessful for complex diseases: • Since many genes interact to cause the disease, the effect of any single gene is too small to detect • They do not take this interaction into account

  6. MDR: The Algorithm

  7. Multifactor Dimensionality Reduction • A data mining approach to identify interactions among discrete variables that influence a binary outcome • A nonparametric alternative to traditional statistical methods such as logistic regression • Driven by the need to improve the power to detect gene-gene interactions

  8. Multifactor Dimensionality Reduction

  9. MDR Step 0 Divide data (genotypes, discrete environmental factors, and affectation status) into 10 distinct subsets

  10. Multifactor Dimensionality Reduction

  11. MDR Step 1 Select a set of n genetic or environmental factors (which are suspected of epistasis together) from the set of all variables in the training set

  12. Multifactor Dimensionality Reduction

  13. MDR Step 2 Create a contingency table for these multilocus genotypes, counting the number of affected and unaffected individuals with each multilocus genotype

  14. Multifactor Dimensionality Reduction

  15. MDR Step 3 Calculate the ratio of cases to controls for each multilocus genotype

  16. Multifactor Dimensionality Reduction

  17. MDR Step 4 Label each multilocus genotype as “high-risk” or “low-risk”, depending on whether the case-control ratio is above a certain threshold ****This is the dimensionality reduction step • Reduces n-dimensional space to 1 dimension with 2 levels

  18. Multifactor Dimensionality Reduction

  19. MDR Step 5 Use labels to classify individuals as cases or controls, and calculate the misclassification rate

  20. Multifactor Dimensionality Reduction

  21. Repeat steps 1-5 for: • All possible combinations of n factors • All possible values of n • Across all 10 training and testing sets

  22. The Best Model • Minimizes prediction error: the average misclassification rate acrossall the 10 cross-validation subsets • Maximizes cross-validation consistency: the number of times a particular model was the best model across cross-validation subsets

  23. Hypothesis test of best model: • Evaluate magnitude of cross-validation consistency and prediction error estimates by permutation testing: • Randomize disease labels • Repeat MDR analysis several times to get distribution of cross-validation consistencies and prediction errors • Use distributions to determine p-values for your actual cross-validation consistencies and prediction errors

  24. Permutation Testing: An illustration Sample Quantiles: 0.4500 The probability that we would see results as, or more, extreme than 0.4500, simply by chance, is between 5% and 10%

  25. Strengths • Facilitates simultaneous detection and characterization of multiple genetic loci associated with a discrete clinical endpoint by reducing the dimensionality of the multilocus data • Non-parametric – no values are estimated • Assumes no particular genetic model • False-positive rate is minimized due to multiple testing

  26. Weaknesses • Computationally intensive (especially with >10 loci) • The curse of dimensionality: decreased predictive ability with high dimensionality and small sample due to cells with no data

  27. MDR Software

  28. The Authors Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Hahn, Ritchie, Moore, 2003. www.sourceforge.net

  29. Values Calculated by MDR

  30. Sign Test n = number of cross-validation intervals C = number of cross-validation intervals with testing accuracy ≥ 0.5 The probability of observing c or more cross-validation intervals with testing accuracy ≥ 0.5 if each case were actually classified randomly

  31. The Problem of Alcoholism A Case Study

  32. Acetaldehyde Genes Associated With Alcoholism Alcohol ADH (alcohol dehydrogenase) and ALDH2 (acetaldehyde dehydrogenase 2) genes are associated with alcoholism involved in alcohol metabolism ADH enzymes ALDH2 enzyme Acetate

  33. ADH Genes Chromosome 4 370 kb 3’ 5’ ADH7 ADH1C ADH1B ADH1A ADH6 ADH4 ADH5 Class V Class II Class III Class IV Class I

  34. Taste Receptors and Aversion to Alcohol • a person must be willing to drink in order to be an alcoholic • TAS2R38 affects the amount of alcohol a person is willing to drink PTC TAS2R38 • therefore, it is related to alcoholism, although no direct association has been found • we hope to provide a direct link between TAS2R38 and alcoholism, by demonstrating that it acts epistatically with other genes associated with alcoholism Tasters Non-Tasters Alcohol Tastes Bitter Alcohol Tastes Sweet Drink Less Alcohol Drink More Alcohol

  35. Actual Analysis

  36. Data • A sample of cases and controls (alcoholics and non-alcoholics) from three East Asian populations: the Ami, Atayal, and Taiwanese • Genotyped for 98 markers within several genes: ALDH2, all ADH genes, and 2 taste receptor genes, TAS2R16 and TAS2R38 (PTC)

  37. Computational Limitations • The software package has a problem reading missing data I was forced to use only complete records, dwindling my (already small) sample to 79 complete records

  38. Computational Limitations • The computation time is way too long for higher order models, especially for high numbers of attributes I was advised to restrict my attributes to markers within ADHIC, and the 2 taste receptor genes, which left me with 36 attributes I considered models only up to order 4

  39. Summary of Results: All Populations Instances: 79 Attributes: 36 Ratio: 1.3235

  40. Summary of Results: Ami Instances: 30 Attributes: 36 Ratio: 0.8750

  41. Cross Validation Statistics Sign Test: 10 (p = 0.0010) Cross-validation Consistency: 10/10

  42. Whole Dataset Statistics: • Training Balanced Accuracy: 0.9688 • Training Accuracy: 0.9667 • Training Sensitivity: 1.0000 • Training Specificity: 0.9375 • Training Odds Ratio: ∞ • Training Χ²: 26.2500 (p < 0.0001) • Training Precision: 0.9333 • Training Kappa: 0.9333 • Training F-Measure: 0.9655

  43. Graphical Model

  44. Classification Rules

  45. Locus Dendrogram

  46. Future Work • Simulations to calculate the power of MDR, especially in relation to sample size • Comparison of MDR with logistic regression, and other proposed methods to detect epistasis, with respect to the current data set and simulated data • Research how different methods to search the sample space can be incorporated into MDR implementation to improve computational feasibility

More Related