1 / 15

Data Mining in Linkage Disequilibrium Mapping

Data Mining in Linkage Disequilibrium Mapping. Jing Hua Zhao Epidemiology j.zhao@public-health.ucl.ac.uk June 2003. Outline of the Talk. The problem Why data mining? Haplotype construction Challenging issues. Current Paradigm. Complex traits (Lander & Schork 1994)

harley
Download Presentation

Data Mining in Linkage Disequilibrium Mapping

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining in Linkage Disequilibrium Mapping Jing Hua Zhao Epidemiology j.zhao@public-health.ucl.ac.uk June 2003

  2. Outline of the Talk • The problem • Why data mining? • Haplotype construction • Challenging issues

  3. Current Paradigm • Complex traits (Lander & Schork 1994) • Association mapping (Risch & Merikangas 1996) • The need of both family and population-based study (Hodge et al. 2003) • # SNPs

  4. Linkage Disequilibrium • The raw data is genetic markers • LD is the non-random association between alleles at different loci • Contains information on genetics of population (selection, mutation, recombination, admixture)

  5. An Model with LDs • Log-linear model to allow for higher order interaction (Weir & Wilson 1986) • Applicable to a variety of null hypotheses (Huttley & Wilson 2000) • Number of terms is exponential

  6. Why Data Mining? • 1.8 million SNPs, 1,240 hits on “haplotype and data mining” in 0.15 seconds • Data mining is the process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and results (Berry & Linoff, 1997, 2000)

  7. A Statistical Perspective • Traditionally EDA, for a particular question • Sheer size of data is problematic • Now DM could be defined as the process of secondary analysis of large datrabases aimed at finding unsuspected relationships which are of interest or value to the database owners (Hand 1998)

  8. Haplotype Pattern Mining • Figure 1 (a) Strongly disease-associated haplotype patterns • Enumeration • DFS, which has good running time property

  9. Significance • A simple Chi-squared statistic: by a 2x2 table containing disease-associated and control chromosomes, in accordance with D’, significance determined via simulation • Simulation on prevalence, evolutionary history and sample size, robustness • Applicable to family data (Zhang et al. 2001)

  10. Emerging Rules • LD patterns are highly strutured (Daly et al. 2001) • 5-8 markers (Niu et al. 2002; Zaykin et al. 2002;Toivonen et al. 2000) • htSNPs (Johnson et al. 2001)

  11. Problem of Haplotype Uncertainty • EM (Cepellini et al. 1955) • MCMC (Guo & Thompson 1992; Lazzaroni & Lange 1997; Stephens et al. 2001, Niu et al. 2002) • Heuristic algorithms

  12. Haplotype Reconstruction • Table of genotypes (Xie & Ott 1993) • Table of sufficient statistics (Zhao et al. 2000) and linked list • Binary trees (Zhao & Sham 2002) • Mixed-radix number (Zhao & Sham 2003) • QuickSort (Zhao & Qian submitted)

  13. Examples • HLA (the evolution of EM algorithms, information content of SNP and SSR) • ALDH2 (missing data, effectiveness of heuristic method) • APOC (the disadvantage of QuickSort, heuristics, the inclusion of covariates)

  14. Challenging Issues • Genotype/Phenotype relationship by Whitehall II data (10,308 civil servants, with 7.000 APOE genotypings) • Associated with cognitive declines • Need longitudinal data • Will tie up with BioBank project

  15. Statistical Methodology • GLM needs to be extended • The same with LDA models such as GLMM • Search and Sort paradigm (Knuth)

More Related