1000 Genomes Project Haplotype Integration

1000 Genomes Project Haplotype Integration Androniki Menelaou University Medical Center Utrecht

Phase 1 integrated haplotypes • Haplotypes from 1,092 samples. • The official release can be found here:http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets/ • It includes: • 38 million single nucleotide polymorphisms, • 1.4 million short insertions and deletions

Phase 1 integrated haplotypes • Information on the samples:http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets/integrated_call_samples.20101123.ped • Build37

Use of phased haplotypes • used to infer human demographic history • inference of points of recombination • helps in understanding the interplay of genetic variation and disease • imputation of un-typed genetic variation

Human disease genetics SNPs (usually > 500,000 genome-wide) g/g a/c g/t g/a t/t a/t a/c a/g t/c g/g g/g t/c t/a a/a g/g c/c t/g g/g t/c t/a c/a g/a t/t g/t g/a t/c a/a c/a Cases and Controls (usually > 1000) Genome-wide SNP microarray genotypes Jonathan Marchini

Human disease genetics SNPs (usually > 500,000 genome-wide) g a g g t a a g c t a t t c a t g g t t a g c g g c a a g c t g t t c g c g g c a a g t g g t a c a t t a c a a Cases and Controls (usually > 1000) Haplotypes are estimated using statistical methods Jonathan Marchini

Human disease genetics a g a g t t g a g g g a a c c t g a g a a t g a g a c g a g g g a a a t t g a g a c t g c g a c g g t g a t t c t c c a g a c a g c g a c g a t g g t a c t t g a t c a t a a g t t a g t a a t t c c c g a g c a t g c a a t g a g g g a a a t t g t t a a a g a g a c g g g g g a a a t t c t g c c Reference haplotypes via sequencing studies 1000 Genomes Project g a g g t a a g c t a t t c a t g g t t a g c g g c a a g c t g t t c g c g g c a a g t g g t a c a t t a c a a Cases and Controls (usually > 1000) Jonathan Marchini

Human disease genetics a g a g t t g a g g g a a c c t g a g a a t g a g a c g a g g g a a a t t g a g a c t g c g a c g g t g a t t c t c c a g a c a g c g a c g a t g g t a c t t g a t c a t a a g t t a g t a a t t c c c g a g c a t g c a a t g a g g g a a a t t g t t a a a g a g a c g g g g g a a a t t c t g c c Reference haplotypes via sequencing studies 1000 Genomes Project (~2,200 haplotypes) g a g g t a a g c t a t t c a t g g t t a g c g g c a a g c t g t t c g c g g c a a g t g g t a c a t t a c a a Cases and Controls (usually > 1000) Imputation of unobserved alleles via matching of shared haplotypes Jonathan Marchini

Human disease genetics a g a g t t g a g g g a a c c t g a g a a t g a g a c g a g g g a a a t t g a g a c t g c g a c g g t g a t t c t c c a g a c a g c g a c g a t g g t a c t t g a t c a t a a g t t a g t a a t t c c c g a g c a t g c a a t g a g g g a a a t t g t t a a a g a g a c g g g g g a a a t t c t g c c Reference haplotypes via sequencing studies 1000 Genomes Project (~2,200 haplotypes) a g a g t a g a g g g t a c t t g a t c a t g c g a c g g t g a t t c t t c t g c c t a a a a t g a g g g a a a t t g t t a a t g a g a c g a g g g a a c c c g a g c a a g c g a c g a t g g t a a t t c t g c c a g a g a c g a g g g a a c c t g a g a a t g c a a t g a g g g a a a t t g a g a c t a a g t t a g t a a t t c c t g a t c a Cases and Controls (usually > 1000) Imputation of unobserved alleles via matching of shared haplotypes Jonathan Marchini

Human disease genetics a g a g t t g a g g g a a c c t g a g a a t g a g a c g a g g g a a a t t g a g a c t g c g a c g g t g a t t c t c c a g a c a g c g a c g a t g g t a c t t g a t c a t a a g t t a g t a a t t c c c g a g c a t g c a a t g a g g g a a a t t g t t a a a g a g a c g g g g g a a a t t c t g c c • GWAS of imputed genotypes • Increased power • Better resolution • Facilitates meta-analysis a g a g t a g a g g g t a c t t g a t c a t g c g a c g g t g a t t c t t c t g c c t a a a a t g a g g g a a a t t g t t a a t g a g a c g a g g g a a c c c g a g c a a g c g a c g a t g g t a a t t c t g c c a g a g a c g a g g g a a c c t g a g a a t g c a a t g a g g g a a a t t g a g a c t a a g t t a g t a a t t c c t g a t c a

Imputation using the 1000 Genomes data • Samples are genotyped on a microarray (e.g. Affy500k, Illumina1M etc) • Quality Control • Choose an imputation algorithm: • BEAGLE (http://faculty.washington.edu/browning/beagle/beagle.html) • IMPUTE2 (http://mathgen.stats.ox.ac.uk/impute/impute_v2.html) • MINIMAC(http://genome.sph.umich.edu/wiki/Minimac)

Imputation using the 1000 Genomes data • NOTE : All imputation software have converted the 1000 Genomes haplotypes to their required format (check their websites) • Impute samples using the 1000 Genomes as a reference panel.

Imputation using the 1000 Genomes data ./impute2 \  -m ./Example/example.chr22.map \  -h ./Example/example.chr22.1kG.haps \  -l ./Example/example.chr22.1kG.legend \  -g ./Example/example.chr22.study.gens \  -strand_g ./Example/example.chr22.study.strand \  -int 20.4e6 20.5e6 \  -Ne 20000 \  -o ./Example/example.chr22.one.phased.impute2

Extracting information from the haplotypes • Interested on the allele frequency for a variant in the 1000 Genomes • Focus on a specific set of samples (e.g. only the European samples) • Filter some positions

Extracting information from the haplotypes VCFTools : http://vcftools.sourceforge.net/index.html • A program package designed for working with VCF files • Validate, merge, compare and calculate some basic population genetic statistics.

Extracting information from the haplotypes • E.g. ./vcftools \ --gzvcf ALL.chr1.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz \ --freq \ --out chr1

From Phase 1 to Phase 3 • ~2,500 samples • Site detection : Multiple methods are employed for site detection • Type of variants : Different types of variants to be included (e.g. STRs, multi-allelic variants) • Integrated haplotypes : Haplotypes to include SNPs, indels, complex variants and SVs

SNPs, indels, MNPs and multi-allelic variants Local Assembly Alignment based Global assembly Freebayes Haplotype Caller Platypus SNPTools Unified Genotyper samtools RTG snp GotCloud SGA / DINDEL Cortex

Summary Adrian Tan, Hyun Min Kang, Goncalo Abecasis *Autosomes only, unfiltered set

Structural Variants • Variant classes • deletions (26k, length : 204bp – 100kb) • bi-allelic tandem and dispersed duplications • multi allelic CNVs • balanced inversions • mobile element insertions • nuclear mitochondrial insertions Jan Korbel

STRs • Two methods are used for STR detection : • lobSTR • RepeatSeq • ~1.5m STRs detected Gareth Highnam, Thomas Willems, David Mittelman, YanivErlich

Phase 3 reference panel • The pipeline for the construction of the reference panel combines both the microarray and sequencing data of the samples in the project. • Genotype calling and phasing software used : SHAPEIT2 and MVNcall

Step 1 : Create scaffold Individual microarray SNPs

Step 2 : Phase bi-allelic sites Individual Bi-allelic SNPs, indels, Structural variants

Step 3 : Phase multi-allelic variants Individual Multi-allelic and other complex variants

Downstream imputation experiment Performance of haplotypes sets for imputation July 2012 release (phase 1) New pipeline (phase 1) New pipeline (phase 3) Genotypes at Chip SNPs Genotypes at SNPs not on chip Imputation Complete Genomics High Coverage Genotypes Compare R2 between imputed and true genotypes

Downstream imputation accuracy New pipeline phase 3 New pipeline phase 1 July 2012 phase 1 Olivier Delaneau, Jonathan Marchini

Downstream imputation accuracy New pipeline phase 3 c New pipeline phase 1 July 2012 phase 1 Olivier Delaneau, Jonathan Marchini

Phase 3 reference panel • Includes more samples from diverse populations • The number of SNPs will increase (~75m) • Inclusion of different types of variants • Higher haplotype accuracy due to methods development which will lead to higher downstream imputation accuracy. • Timeline: Summer 2014

1000 Genomes Project Haplotype Integration

1000 Genomes Project Haplotype Integration

Presentation Transcript

Structural Variation in the 1000 Genomes Project

1000 Genomes Project Data Tutorial

Lessons learnt from the 1000 Genomes Project about sequencing in populations

The 1000 Genomes Project

The 1000 Genomes Project Lessons From Variant Calling and Genotyping

Towards Completion of the 1000 Genomes Project

1000 Genomes SV detection Boston College

The 1000 Genomes Project Tutorial

Accessing the 1000 Genomes Data

Released 1000 Genomes indels : 328,528

The 1000 Genomes Project

Haplotype analysis

The 1000 Genomes project, Data Availability and Accessibility

The 1000 Genomes Project Advanced Information Laura Clarke

1000-films project

The 1000 Genomes Project Advanced Information Laura Clarke

Haplotype analysis

1000 Genomes Project Phase III Tutorial Structural Variants (SVs ) Eugene J. Gardner

The 1000 Genomes Project: A Tutorial

1000 Genomes Tutorial

Disease, natural selection and the 1000 Genomes Project

Completion of the 1000 Genomes Project