1 / 78

Imputation 2

Imputation 2. Presenter: Ka -Kit Lam. Outline. Big Picture and Motivation IMPUTE IMPUTE2 Experiments Conclusion and Discussion Supplementary : GWAS Estimate on mutation rate . Big Picture and Motivation. Background. Genome-wide association study:

china
Download Presentation

Imputation 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Imputation 2 Presenter: Ka-Kit Lam

  2. Outline • Big Picture and Motivation • IMPUTE • IMPUTE2 • Experiments • Conclusion and Discussion • Supplementary : • GWAS • Estimate on mutation rate

  3. Big Picture and Motivation

  4. Background • Genome-wide association study: • Identify common genetic factors that influence health/disease

  5. Background • Important to know the SNPs • However, . . . , • Not all SNPs are genotyped for all individuals in the case-control study in GWAS. • How can we guess the missing parts? ? ? ? Individual 1: ACCCAATTACCAGTATTTA… Individual 2: CCCCATTTACCACTATTTA… Individual 3: ACCCATTTACCACTATTTA… Individual 4: CCCCATTTACCAGTATTTA… ? ?

  6. Information known • Luckily, we now have references for human DNA: • But, how can we use the reference genomes?

  7. Main Question • Objective: • Design algorithms • to impute the missing genotypes of the individuals being studied • Criteria for algorithms • Scalable • Accurate

  8. Big Picture on Algorithm Design SNPs in study, reference haplotype/genotype Imputed genotype, associated confidence Algorithms In practice, it works In theory, it makes sense 1. Experimental validation 2. Application Scalability Accuracy

  9. IMPUTE

  10. Notations and Setting Reference Haplotypes : N L Genotype in the study sample: K L (Rmk: 0-00 , 1-01, 2-11)

  11. Formulation • Observed genotype and missing genotype • Classical inference problem: • A reasonable estimate: • Confidence:

  12. Modeling (HMM model):Relationship btw (H,G) • Assumptions: • Study individuals are independent • Copying process of haplotypes as a mosaic of reference captured by a Hidden Markov Model • Mutation at different sites are conditionally independent given the copied haplotype

  13. Modeling (HMM model):Relationship btw (H,G) Reference Haplotypes : N L Study Individual:

  14. Modeling (HMM model):Relationship btw (H,G) N L … …

  15. Modeling (Transition Probability) • States • Transition • What is the intuition?

  16. Modeling :relationship btw transition Probability and Recombination • Recombination Process:

  17. Modeling :relationship btw transition Probability and Recombination • Recombination Process: • More reference, longer the copy length • Copy length in our model depends on genetic distance btw SNPs Ref panel 1 Ref panel 2 More likely to have longer copy length here Study individual:

  18. Modeling (Transition Probability) • States • Transition

  19. Modeling (Emission Probability) • Emission probability • Define mutation rate : • Since mutation is assumed independent across site

  20. Extension (completely missing) • Problem: • Missing genotype across all references and study samples. How to impute? • What can we expect? • Generate information from no information? • We cannot expect to know the genotype • But we can guess the relationship btw them • Our friend : population genetics may help !

  21. Imputation on Reference • Illustration 0 0 1 0 1

  22. Imputation on Reference Algorithm: 1. Randomly select an ordering 2. Sample the first mutation according to 3. Treat previous as references and impute 4. Repeat several time to get a stable output 5. Use the imputed reference to impute the study

  23. Computational Complexity:Imputation … … O(N2L) for each individual

  24. Computational Complexity:Imputation O(N2L) for each individual

  25. Computational Complexity:Forward-Backward Algorithm • Forward Equations: • Naïve application takes O(N4)

  26. Computational Complexity:Forward-Backward Algorithm • Q : How to compute the following in O(N2) ? • A: (suggested in fastPhase)

  27. Computational Complexity:Forward-Backward Algorithm • Finally, we have • Similarly for the backward part O(N2) O(N2) totally O(N) for each j O(N2) totally O(N) for each i O(N2) totally

  28. ./impute -h example/haplo.txt -l example/legend.txt -g example/geno.txt -m example/map.txt -s example/strand.txt -Ne 11400 -int 62000000 63000000 Demo

  29. Demo

  30. IMPUTE2

  31. Motivation • Accuracy: • Not all information used during imputation (e.g. other study individuals) • Complexity: • Need to scale well if we incorporate all information (e.g. previously it is O(LN2)) • New data type: • Diploid reference (1000 genome project) • Q: How to design algorithms to handle this?

  32. Description of Setting(Scenario A) Reference Haplotypes : Nhap L Genotype in the inference panel: Ninf L (Rmk : sets of index of SNPs) :T, :U (Rmk: 0-00 , 1-01, 2-11)

  33. Description of Setting(Scenario B) Reference Haplotypes : Nhap L Diploid reference panel Ndip Ninf Inference panel L :T, :U1 (Rmk : sets of index of SNPs) (Rmk: 0-00 , 1-01, 2-11) , :U2

  34. Algorithm for Scenario A • Illustration:

  35. Algorithm for Scenario A • Illustration (Burn in)

  36. Algorithm for Scenario A • Illustration (Phasing) Update i (genotype) (1) (0) (1)

  37. Algorithm for Scenario A • Illustration (Imputing) Update i (genotype) (1) (0) (1)

  38. Phasing Step: Path Sampling • How to sample path? … …

  39. Imputation Step: Extract Posterior Probability • After many rounds, we can get : • For each individual and for each missing site • Assuming independence in sampling the haploid pair Take average then

  40. Algorithm for Scenario A:Complexity Analysis • A) Burn in phase • B) MCMC iterations for m times: • For each individual i • i) phase(i,T,hap+inf) • ii) impute(i,T+U,hap) • iii) record(posterior probability) • C) Average over different runs of MCMC to get the genotype and confidence O((Nhap + Ninf)2LT) O(NhapLT+U) O(LT+U)

  41. Benefits of the Algorithm • Faster: • Reducing the load in the imputation step • More accurate: • Utilize information available to guess

  42. Algorithm for Scenario B • Illustration: Nhap Ndip Ninf :T, :U1 , :U2

  43. Algorithm for Scenario B • Illustration: (Burn in ) Nhap Ndip Ninf :T, :U1 , :U2

  44. Algorithm for Scenario B • Illustration: (Phase T and U2 in diploid ref) Nhap Ndip Update i Ninf :T, :U1 , :U2

  45. Algorithm for Scenario B • Illustration: (Impute U1 in diploid ref) Nhap Ndip Update i Ninf :T, :U1 , :U2

  46. Algorithm for Scenario B • Illustration: (Phase T in inference panel) Nhap Ndip Ninf Update i :T, :U1 , :U2

  47. Algorithm for Scenario B • Illustration: (Impute U2 in inference panel) Nhap Ndip Ninf Update i :T, :U1 , :U2

  48. Algorithm for Scenario B • Illustration: (Impute U1 in inference panel) Nhap Ndip Ninf Update i :T, :U1 , :U2

  49. Algorithm for Scenario B:Complexity Analysis • A) Burn in phase • B) MCMC iterations for m times: • For each individual i in dip: • i) phase(i,T+U2,hap+dip) • ii) impute(i,T+U1,hap) • Iii) record(posterior probability) • For each individual i in inference : • i) phase(i,T,hap+dip+inf) • ii) impute(i,T+U2,hap+dip) • iii) impute(i,U1, hap) • iv) record(posterior probability) • C) Average over different runs of MCMC to get the genotype and confidence O((Nhap + Ninf)2LT+U2) O(NhapLT+U1) O(LT+U1) O((Nhap + Ndip + Ninf)2LT) O(Nhap+dipLT+U2) O(NhapLU1) O(LT+U1+U2)

  50. Benefits of the Algorithm • Able to handle new data type • Faster and more accurate

More Related