1 / 29

1000 Genomes Project Haplotype Integration

1000 Genomes Project Haplotype Integration. Androniki Menelaou University Medical Center Utrecht. Phase 1 integrated haplotypes. Haplotypes from 1,092 samples. The official release can be found here: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets /

lona
Download Presentation

1000 Genomes Project Haplotype Integration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 1000 Genomes Project Haplotype Integration Androniki Menelaou University Medical Center Utrecht

  2. Phase 1 integrated haplotypes • Haplotypes from 1,092 samples. • The official release can be found here:http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets/ • It includes: • 38 million single nucleotide polymorphisms, • 1.4 million short insertions and deletions

  3. Phase 1 integrated haplotypes • Information on the samples:http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets/integrated_call_samples.20101123.ped • Build37

  4. Use of phased haplotypes • used to infer human demographic history • inference of points of recombination • helps in understanding the interplay of genetic variation and disease • imputation of un-typed genetic variation

  5. Human disease genetics SNPs (usually > 500,000 genome-wide) g/g a/c g/t g/a t/t a/t a/c a/g t/c g/g g/g t/c t/a a/a g/g c/c t/g g/g t/c t/a c/a g/a t/t g/t g/a t/c a/a c/a Cases and Controls (usually > 1000) Genome-wide SNP microarray genotypes Jonathan Marchini

  6. Human disease genetics SNPs (usually > 500,000 genome-wide) g a g g t a a g c t a t t c a t g g t t a g c g g c a a g c t g t t c g c g g c a a g t g g t a c a t t a c a a Cases and Controls (usually > 1000) Haplotypes are estimated using statistical methods Jonathan Marchini

  7. Human disease genetics a g a g t t g a g g g a a c c t g a g a a t g a g a c g a g g g a a a t t g a g a c t g c g a c g g t g a t t c t c c a g a c a g c g a c g a t g g t a c t t g a t c a t a a g t t a g t a a t t c c c g a g c a t g c a a t g a g g g a a a t t g t t a a a g a g a c g g g g g a a a t t c t g c c Reference haplotypes via sequencing studies 1000 Genomes Project g a g g t a a g c t a t t c a t g g t t a g c g g c a a g c t g t t c g c g g c a a g t g g t a c a t t a c a a Cases and Controls (usually > 1000) Jonathan Marchini

  8. Human disease genetics a g a g t t g a g g g a a c c t g a g a a t g a g a c g a g g g a a a t t g a g a c t g c g a c g g t g a t t c t c c a g a c a g c g a c g a t g g t a c t t g a t c a t a a g t t a g t a a t t c c c g a g c a t g c a a t g a g g g a a a t t g t t a a a g a g a c g g g g g a a a t t c t g c c Reference haplotypes via sequencing studies 1000 Genomes Project (~2,200 haplotypes) g a g g t a a g c t a t t c a t g g t t a g c g g c a a g c t g t t c g c g g c a a g t g g t a c a t t a c a a Cases and Controls (usually > 1000) Imputation of unobserved alleles via matching of shared haplotypes Jonathan Marchini

  9. Human disease genetics a g a g t t g a g g g a a c c t g a g a a t g a g a c g a g g g a a a t t g a g a c t g c g a c g g t g a t t c t c c a g a c a g c g a c g a t g g t a c t t g a t c a t a a g t t a g t a a t t c c c g a g c a t g c a a t g a g g g a a a t t g t t a a a g a g a c g g g g g a a a t t c t g c c Reference haplotypes via sequencing studies 1000 Genomes Project (~2,200 haplotypes) a g a g t a g a g g g t a c t t g a t c a t g c g a c g g t g a t t c t t c t g c c t a a a a t g a g g g a a a t t g t t a a t g a g a c g a g g g a a c c c g a g c a a g c g a c g a t g g t a a t t c t g c c a g a g a c g a g g g a a c c t g a g a a t g c a a t g a g g g a a a t t g a g a c t a a g t t a g t a a t t c c t g a t c a Cases and Controls (usually > 1000) Imputation of unobserved alleles via matching of shared haplotypes Jonathan Marchini

  10. Human disease genetics a g a g t t g a g g g a a c c t g a g a a t g a g a c g a g g g a a a t t g a g a c t g c g a c g g t g a t t c t c c a g a c a g c g a c g a t g g t a c t t g a t c a t a a g t t a g t a a t t c c c g a g c a t g c a a t g a g g g a a a t t g t t a a a g a g a c g g g g g a a a t t c t g c c • GWAS of imputed genotypes • Increased power • Better resolution • Facilitates meta-analysis a g a g t a g a g g g t a c t t g a t c a t g c g a c g g t g a t t c t t c t g c c t a a a a t g a g g g a a a t t g t t a a t g a g a c g a g g g a a c c c g a g c a a g c g a c g a t g g t a a t t c t g c c a g a g a c g a g g g a a c c t g a g a a t g c a a t g a g g g a a a t t g a g a c t a a g t t a g t a a t t c c t g a t c a

  11. Imputation using the 1000 Genomes data • Samples are genotyped on a microarray (e.g. Affy500k, Illumina1M etc) • Quality Control • Choose an imputation algorithm: • BEAGLE (http://faculty.washington.edu/browning/beagle/beagle.html) • IMPUTE2 (http://mathgen.stats.ox.ac.uk/impute/impute_v2.html) • MINIMAC(http://genome.sph.umich.edu/wiki/Minimac)

  12. Imputation using the 1000 Genomes data • NOTE : All imputation software have converted the 1000 Genomes haplotypes to their required format (check their websites) • Impute samples using the 1000 Genomes as a reference panel.

  13. Imputation using the 1000 Genomes data ./impute2 \
 -m ./Example/example.chr22.map \
 -h ./Example/example.chr22.1kG.haps \
 -l ./Example/example.chr22.1kG.legend \
 -g ./Example/example.chr22.study.gens \
 -strand_g ./Example/example.chr22.study.strand \
 -int 20.4e6 20.5e6 \
 -Ne 20000 \
 -o ./Example/example.chr22.one.phased.impute2

  14. Extracting information from the haplotypes • Interested on the allele frequency for a variant in the 1000 Genomes • Focus on a specific set of samples (e.g. only the European samples) • Filter some positions

  15. Extracting information from the haplotypes VCFTools : http://vcftools.sourceforge.net/index.html • A program package designed for working with VCF files • Validate, merge, compare and calculate some basic population genetic statistics.

  16. Extracting information from the haplotypes • E.g. ./vcftools \ --gzvcf ALL.chr1.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz \ --freq \ --out chr1

  17. From Phase 1 to Phase 3 • ~2,500 samples • Site detection : Multiple methods are employed for site detection • Type of variants : Different types of variants to be included (e.g. STRs, multi-allelic variants) • Integrated haplotypes : Haplotypes to include SNPs, indels, complex variants and SVs

  18. SNPs, indels, MNPs and multi-allelic variants Local Assembly Alignment based Global assembly Freebayes Haplotype Caller Platypus SNPTools Unified Genotyper samtools RTG snp GotCloud SGA / DINDEL Cortex

  19. Summary Adrian Tan, Hyun Min Kang, Goncalo Abecasis *Autosomes only, unfiltered set

  20. Structural Variants • Variant classes • deletions (26k, length : 204bp – 100kb) • bi-allelic tandem and dispersed duplications • multi allelic CNVs • balanced inversions • mobile element insertions • nuclear mitochondrial insertions Jan Korbel

  21. STRs • Two methods are used for STR detection : • lobSTR • RepeatSeq • ~1.5m STRs detected Gareth Highnam, Thomas Willems, David Mittelman, YanivErlich

  22. Phase 3 reference panel • The pipeline for the construction of the reference panel combines both the microarray and sequencing data of the samples in the project. • Genotype calling and phasing software used : SHAPEIT2 and MVNcall

  23. Step 1 : Create scaffold Individual microarray SNPs

  24. Step 2 : Phase bi-allelic sites Individual Bi-allelic SNPs, indels, Structural variants

  25. Step 3 : Phase multi-allelic variants Individual Multi-allelic and other complex variants

  26. Downstream imputation experiment Performance of haplotypes sets for imputation July 2012 release (phase 1) New pipeline (phase 1) New pipeline (phase 3) Genotypes at Chip SNPs Genotypes at SNPs not on chip Imputation Complete Genomics High Coverage Genotypes Compare R2 between imputed and true genotypes

  27. Downstream imputation accuracy New pipeline phase 3 New pipeline phase 1 July 2012 phase 1 Olivier Delaneau, Jonathan Marchini

  28. Downstream imputation accuracy New pipeline phase 3 c New pipeline phase 1 July 2012 phase 1 Olivier Delaneau, Jonathan Marchini

  29. Phase 3 reference panel • Includes more samples from diverse populations • The number of SNPs will increase (~75m) • Inclusion of different types of variants • Higher haplotype accuracy due to methods development which will lead to higher downstream imputation accuracy. • Timeline: Summer 2014

More Related