Genotype phasing and imputation in 1x sequencing data
This presentation is the property of its rightful owner.
Sponsored Links
1 / 19

Genotype Phasing and Imputation in 1x Sequencing Data PowerPoint PPT Presentation


  • 73 Views
  • Uploaded on
  • Presentation posted in: General

Genotype Phasing and Imputation in 1x Sequencing Data. Warren W. Kretzschmar. DPhil Genomic Medicine and Statistics Wellcome Trust Centre for Human Genetics, Oxford , UK Supervisor: Jonathan Marchini. Major Depression.

Download Presentation

Genotype Phasing and Imputation in 1x Sequencing Data

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Genotype phasing and imputation in 1x sequencing data

Genotype Phasing and Imputation in 1x Sequencing Data

Warren W. Kretzschmar

DPhil Genomic Medicine and Statistics

Wellcome Trust Centre for Human Genetics, Oxford, UK

Supervisor: Jonathan Marchini


Genotype phasing and imputation in 1x sequencing data

Major Depression

  • Commonest psychiatric disorder and the second ranking cause of morbidity world-wide.

  • Affects 1 in 10 people in their lifetime.

  • Estimates of heritability range between 30-40%.


Genotype phasing and imputation in 1x sequencing data

Top Ten causes of DALYs

DALY : Disability adjusted life year

: number of years lost due to ill-health, disability or early death


Genotype phasing and imputation in 1x sequencing data

Genetics of Major Depression

Major Depressive Disorder Working Group of the Psychiatric GWAS Consortium (2012). A mega-analysis of genome-wide association studies for major depressive disorder. Molecular Psychiatry18.4:497-511.

  • Study Design

  • Unrelated Europeans

  • 9240 cases

  • 9519 controls

  • 1.2 million SNPs

  • Hypotheses

  • Depression has heterogeneous environmental and genetic causes

  • Depression is a complex trait with genetic components of small effect size


Genotype phasing and imputation in 1x sequencing data

CONVERGE

(China, Oxford and VCU Experimental Research on Genetic Epidemiology)

59 hospitals, 45 cities, 21 provinces.

Genetically Homogeneous : All subjects are female and their grandparents are Han Chinese

6,000 cases : typically severe affected: 85% qualify for a diagnosis of melancholia by DSM-IV. >25% reported a family history of MD in one or more first-degree relatives

6,000 controls : patients undergoing minor surgical procedures.

Extensive Phenotyping: primary disorder of major depression, common comorbid disorders (e.g. generalized anxiety disorder, panic disorder), within disorder symptoms (e.g. suicidal ideation), disorder subtypes (e.g. melancholia, dysthymia), possible endophenotypes (e.g. neuroticism) and a range of risk factors (e.g. child abuse, stressful life events, social and marital relationships, parenting, post-natal depression, demographics).

Sequencing : mean depth 1.7X using llluminaHiSeqat Beijing Genomics Institute

Current status Sequencing finished. We have data on 12,000 samples. For now we have only considered ~13M sites polymorphic 1000 Genomes Asian samples. Analysis ongoing…


Genotype phasing and imputation in 1x sequencing data

Sequence analysis pipeline

Phase 1: genotype likelihood estimation

One sample at a time

Phase 2: phasing and imputation

All samples together

Raw reads

Genotype likelihoods

My focus!

MappingStampy

48 TB

350 GB

2.7 CPU years

Phasing and

imputation

Duplicate Picard

marking

Base qualityGATK

recalibration

5 CPU years

Genotype probabilities

Genotype likelihoodSNPTools

estimation

650 GB

4.6 CPU years

Genotype likelihoods


Genotype phasing and imputation

Genotype Phasing and imputation


Genotype phasing and imputation in 1x sequencing data

Genotype Phasing

Example SNP chip data

Unphased: G/G A/T A/A T/T G/T A/T T/T A/A G/G G/C

After Phasing

Hap 1: G A A T T T T A G C

Hap 2: G T A T G A T A G G

Phase-informative Sites


Genotype phasing and imputation in 1x sequencing data

Genotype Imputation from Haplotypes

J Marchiniand B Howie. Nature Rev. Genet. 2010


Genotype likelihoods

Genotype Likelihoods


Genotype phasing and imputation in 1x sequencing data

What is a Genotype Likelihood?

Genotype likelihoods (aka GL) are defined on a site by site basis.

GLs are conditional probabilities.

Genotype Likelihood = Pr( R | G )

R = Reads; also known as the “observed data”

G = Genotype; usually one of ref/ref, ref/alt, alt/alt


Genotype phasing and imputation in 1x sequencing data

How are Genotype Likelihoods Useful?

Genotype likelihoods allow us to quantify how much the reads support each possible genotype independent of other information.

To determine the most likely genotype call, we need a genotype probability.

Genotype Probability = Pr ( G | R ) proportional to Pr( R | G ) * Pr( G )

Pr( G ) = prior probability of G.

May be determined through haplotype phasing and imputation approaches.


Genotype phasing and imputation in 1x sequencing data

Genotype Likelihood Creation with SNPTools

Three distributions

observed reads

Pr(R|G = ref/ref) = 0.06

Pr(R|G = alt/alt) = 10e-6

Pr(R|G = ref/alt) = 10e-3

Y Wang, J Lu, J Yu, RA Gibbs, FL Yu. Genome Research. 2013


Genotype phasing and imputation in 1x sequencing data

Genotype Phasing using Genotype Likelihoods

Hap 1: G A A T T A C A G G

Reference Haplotypes

Hap 2: G T A T T A T A G G

Hap 3: G T A T G A C A G G

Hap 4: G T A T G A T A G C

Example GL data

Pr(ref/ref): G/G A/AA/A T/T G/G A/A T/T A/A G/G G/G

Pr(ref/alt): G/AA/TA/GT/A G/T A/T T/C A/G G/C G/C

Pr(alt/alt): A/A T/TG/GA/A T/TT/TC/C G/G C/C C/C

Plausible Haplotypes after Phasing

Hap 5: G A A T T A T A G C

Hap 6: G T A T T A T A G G


Genotype phasing and imputation in 1x sequencing data

General MCMC Scheme for Phasing from GLs

  • When using GLs, haplotype estimation is currently done in an iterative Markov Chain Monte Carlo (MCMC) scheme

  • Initalize haplotypes for each sample randomly

  • for a predetermined number of iterations

    • for each sample

      • Find a plausible haplotype pair using its GLs and all other haplotypes as a reference panel

      • Update that sample’s haplotypes with the plausible haplotype pair

  • Return each sample’s current pair of haplotypes


Genotype phasing and imputation in 1x sequencing data

The Tools/Languages I use


Genotype phasing and imputation in 1x sequencing data

A Bioinformatician’s Best Practices

according to Nick Loman & Mick Watson. Nature Biotechnology. 2013

see also: W. S. Noble. PLoS Computational Biology. 2009

  • Understand your goals and choose appropriate methods

  • Be suspicious and trust nobody

    • Set traps for your own scripts and other people’s

  • Be a detective

  • You're a scientist, not a programmer

  • Use version control software

  • Pipelineitis is a nasty disease

  • An Obama frame of mind

  • Someone has already done this. Find them!


Genotype phasing and imputation in 1x sequencing data

Good Directory Structure

according to W. S. Noble. PLoS Computational Biology. 2009


Genotype phasing and imputation in 1x sequencing data

Thank you. Questions?


  • Login