810 likes | 1.12k Views
Imputation 2. Presenter: Ka -Kit Lam. Outline. Big Picture and Motivation IMPUTE IMPUTE2 Experiments Conclusion and Discussion Supplementary : GWAS Estimate on mutation rate . Big Picture and Motivation. Background. Genome-wide association study:
E N D
Imputation 2 Presenter: Ka-Kit Lam
Outline • Big Picture and Motivation • IMPUTE • IMPUTE2 • Experiments • Conclusion and Discussion • Supplementary : • GWAS • Estimate on mutation rate
Background • Genome-wide association study: • Identify common genetic factors that influence health/disease
Background • Important to know the SNPs • However, . . . , • Not all SNPs are genotyped for all individuals in the case-control study in GWAS. • How can we guess the missing parts? ? ? ? Individual 1: ACCCAATTACCAGTATTTA… Individual 2: CCCCATTTACCACTATTTA… Individual 3: ACCCATTTACCACTATTTA… Individual 4: CCCCATTTACCAGTATTTA… ? ?
Information known • Luckily, we now have references for human DNA: • But, how can we use the reference genomes?
Main Question • Objective: • Design algorithms • to impute the missing genotypes of the individuals being studied • Criteria for algorithms • Scalable • Accurate
Big Picture on Algorithm Design SNPs in study, reference haplotype/genotype Imputed genotype, associated confidence Algorithms In practice, it works In theory, it makes sense 1. Experimental validation 2. Application Scalability Accuracy
Notations and Setting Reference Haplotypes : N L Genotype in the study sample: K L (Rmk: 0-00 , 1-01, 2-11)
Formulation • Observed genotype and missing genotype • Classical inference problem: • A reasonable estimate: • Confidence:
Modeling (HMM model):Relationship btw (H,G) • Assumptions: • Study individuals are independent • Copying process of haplotypes as a mosaic of reference captured by a Hidden Markov Model • Mutation at different sites are conditionally independent given the copied haplotype
Modeling (HMM model):Relationship btw (H,G) Reference Haplotypes : N L Study Individual:
Modeling (Transition Probability) • States • Transition • What is the intuition?
Modeling :relationship btw transition Probability and Recombination • Recombination Process:
Modeling :relationship btw transition Probability and Recombination • Recombination Process: • More reference, longer the copy length • Copy length in our model depends on genetic distance btw SNPs Ref panel 1 Ref panel 2 More likely to have longer copy length here Study individual:
Modeling (Transition Probability) • States • Transition
Modeling (Emission Probability) • Emission probability • Define mutation rate : • Since mutation is assumed independent across site
Extension (completely missing) • Problem: • Missing genotype across all references and study samples. How to impute? • What can we expect? • Generate information from no information? • We cannot expect to know the genotype • But we can guess the relationship btw them • Our friend : population genetics may help !
Imputation on Reference • Illustration 0 0 1 0 1
Imputation on Reference Algorithm: 1. Randomly select an ordering 2. Sample the first mutation according to 3. Treat previous as references and impute 4. Repeat several time to get a stable output 5. Use the imputed reference to impute the study
Computational Complexity:Imputation … … O(N2L) for each individual
Computational Complexity:Imputation O(N2L) for each individual
Computational Complexity:Forward-Backward Algorithm • Forward Equations: • Naïve application takes O(N4)
Computational Complexity:Forward-Backward Algorithm • Q : How to compute the following in O(N2) ? • A: (suggested in fastPhase)
Computational Complexity:Forward-Backward Algorithm • Finally, we have • Similarly for the backward part O(N2) O(N2) totally O(N) for each j O(N2) totally O(N) for each i O(N2) totally
./impute -h example/haplo.txt -l example/legend.txt -g example/geno.txt -m example/map.txt -s example/strand.txt -Ne 11400 -int 62000000 63000000 Demo
Motivation • Accuracy: • Not all information used during imputation (e.g. other study individuals) • Complexity: • Need to scale well if we incorporate all information (e.g. previously it is O(LN2)) • New data type: • Diploid reference (1000 genome project) • Q: How to design algorithms to handle this?
Description of Setting(Scenario A) Reference Haplotypes : Nhap L Genotype in the inference panel: Ninf L (Rmk : sets of index of SNPs) :T, :U (Rmk: 0-00 , 1-01, 2-11)
Description of Setting(Scenario B) Reference Haplotypes : Nhap L Diploid reference panel Ndip Ninf Inference panel L :T, :U1 (Rmk : sets of index of SNPs) (Rmk: 0-00 , 1-01, 2-11) , :U2
Algorithm for Scenario A • Illustration:
Algorithm for Scenario A • Illustration (Burn in)
Algorithm for Scenario A • Illustration (Phasing) Update i (genotype) (1) (0) (1)
Algorithm for Scenario A • Illustration (Imputing) Update i (genotype) (1) (0) (1)
Phasing Step: Path Sampling • How to sample path? … …
Imputation Step: Extract Posterior Probability • After many rounds, we can get : • For each individual and for each missing site • Assuming independence in sampling the haploid pair Take average then
Algorithm for Scenario A:Complexity Analysis • A) Burn in phase • B) MCMC iterations for m times: • For each individual i • i) phase(i,T,hap+inf) • ii) impute(i,T+U,hap) • iii) record(posterior probability) • C) Average over different runs of MCMC to get the genotype and confidence O((Nhap + Ninf)2LT) O(NhapLT+U) O(LT+U)
Benefits of the Algorithm • Faster: • Reducing the load in the imputation step • More accurate: • Utilize information available to guess
Algorithm for Scenario B • Illustration: Nhap Ndip Ninf :T, :U1 , :U2
Algorithm for Scenario B • Illustration: (Burn in ) Nhap Ndip Ninf :T, :U1 , :U2
Algorithm for Scenario B • Illustration: (Phase T and U2 in diploid ref) Nhap Ndip Update i Ninf :T, :U1 , :U2
Algorithm for Scenario B • Illustration: (Impute U1 in diploid ref) Nhap Ndip Update i Ninf :T, :U1 , :U2
Algorithm for Scenario B • Illustration: (Phase T in inference panel) Nhap Ndip Ninf Update i :T, :U1 , :U2
Algorithm for Scenario B • Illustration: (Impute U2 in inference panel) Nhap Ndip Ninf Update i :T, :U1 , :U2
Algorithm for Scenario B • Illustration: (Impute U1 in inference panel) Nhap Ndip Ninf Update i :T, :U1 , :U2
Algorithm for Scenario B:Complexity Analysis • A) Burn in phase • B) MCMC iterations for m times: • For each individual i in dip: • i) phase(i,T+U2,hap+dip) • ii) impute(i,T+U1,hap) • Iii) record(posterior probability) • For each individual i in inference : • i) phase(i,T,hap+dip+inf) • ii) impute(i,T+U2,hap+dip) • iii) impute(i,U1, hap) • iv) record(posterior probability) • C) Average over different runs of MCMC to get the genotype and confidence O((Nhap + Ninf)2LT+U2) O(NhapLT+U1) O(LT+U1) O((Nhap + Ndip + Ninf)2LT) O(Nhap+dipLT+U2) O(NhapLU1) O(LT+U1+U2)
Benefits of the Algorithm • Able to handle new data type • Faster and more accurate