Genotype Phasing and Imputation in 1x Sequencing Data. Warren W. Kretzschmar. DPhil Genomic Medicine and Statistics Wellcome Trust Centre for Human Genetics, Oxford , UK Supervisor: Jonathan Marchini. Major Depression.
Warren W. Kretzschmar
DPhil Genomic Medicine and Statistics
Wellcome Trust Centre for Human Genetics, Oxford, UK
Supervisor: Jonathan Marchini
DALY : Disability adjusted life year
: number of years lost due to ill-health, disability or early death
Major Depressive Disorder Working Group of the Psychiatric GWAS Consortium (2012). A mega-analysis of genome-wide association studies for major depressive disorder. Molecular Psychiatry18.4:497-511.
(China, Oxford and VCU Experimental Research on Genetic Epidemiology)
59 hospitals, 45 cities, 21 provinces.
Genetically Homogeneous : All subjects are female and their grandparents are Han Chinese
6,000 cases : typically severe affected: 85% qualify for a diagnosis of melancholia by DSM-IV. >25% reported a family history of MD in one or more first-degree relatives
6,000 controls : patients undergoing minor surgical procedures.
Extensive Phenotyping: primary disorder of major depression, common comorbid disorders (e.g. generalized anxiety disorder, panic disorder), within disorder symptoms (e.g. suicidal ideation), disorder subtypes (e.g. melancholia, dysthymia), possible endophenotypes (e.g. neuroticism) and a range of risk factors (e.g. child abuse, stressful life events, social and marital relationships, parenting, post-natal depression, demographics).
Sequencing : mean depth 1.7X using llluminaHiSeqat Beijing Genomics Institute
Current status Sequencing finished. We have data on 12,000 samples. For now we have only considered ~13M sites polymorphic 1000 Genomes Asian samples. Analysis ongoing…
Phase 1: genotype likelihood estimation
One sample at a time
Phase 2: phasing and imputation
All samples together
2.7 CPU years
Base quality GATK
5 CPU years
Genotype likelihood SNPTools
4.6 CPU years
Example SNP chip data
Unphased: G/G A/T A/A T/T G/T A/T T/T A/A G/G G/C
Hap 1: G A A T T T T A G C
Hap 2: G T A T G A T A G G
J Marchiniand B Howie. Nature Rev. Genet. 2010
Genotype likelihoods (aka GL) are defined on a site by site basis.
GLs are conditional probabilities.
Genotype Likelihood = Pr( R | G )
R = Reads; also known as the “observed data”
G = Genotype; usually one of ref/ref, ref/alt, alt/alt
Genotype likelihoods allow us to quantify how much the reads support each possible genotype independent of other information.
To determine the most likely genotype call, we need a genotype probability.
Genotype Probability = Pr ( G | R ) proportional to Pr( R | G ) * Pr( G )
Pr( G ) = prior probability of G.
May be determined through haplotype phasing and imputation approaches.
Genotype Likelihood Creation with SNPTools
Pr(R|G = ref/ref) = 0.06
Pr(R|G = alt/alt) = 10e-6
Pr(R|G = ref/alt) = 10e-3
Y Wang, J Lu, J Yu, RA Gibbs, FL Yu. Genome Research. 2013
Hap 1: G A A T T A C A G G
Hap 2: G T A T T A T A G G
Hap 3: G T A T G A C A G G
Hap 4: G T A T G A T A G C
Example GL data
Pr(ref/ref): G/G A/AA/A T/T G/G A/A T/T A/A G/G G/G
Pr(ref/alt): G/AA/TA/GT/A G/T A/T T/C A/G G/C G/C
Pr(alt/alt): A/A T/TG/GA/A T/TT/TC/C G/G C/C C/C
Plausible Haplotypes after Phasing
Hap 5: G A A T T A T A G C
Hap 6: G T A T T A T A G G
A Bioinformatician’s Best Practices
according to Nick Loman & Mick Watson. Nature Biotechnology. 2013
see also: W. S. Noble. PLoS Computational Biology. 2009
according to W. S. Noble. PLoS Computational Biology. 2009