completion of the 1000 genomes project n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Completion of the 1000 Genomes Project PowerPoint Presentation
Download Presentation
Completion of the 1000 Genomes Project

Loading in 2 Seconds...

play fullscreen
1 / 24

Completion of the 1000 Genomes Project - PowerPoint PPT Presentation


  • 444 Views
  • Uploaded on

Completion of the 1000 Genomes Project. Goncalo Abecasis University of Michigan School of Public Health. Project Goals (2008). >95% of accessible genetic variants with a frequency of >1% in each of multiple continental regions

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Completion of the 1000 Genomes Project


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
    Presentation Transcript
    1. Completion of the 1000 Genomes Project Goncalo Abecasis University of Michigan School of Public Health

    2. Project Goals (2008) • >95% of accessible genetic variants with a frequency of >1% in each of multiple continental regions • Extend discovery effort to lower frequency variants in coding regions of the genome • Define haplotype structure in the genome

    3. Pilot Projects (2010) • 2 deeply sequenced trios • 179 whole genomes sequenced at low coverage • 8,820 exons deeply sequenced in 697 individuals • 15M SNPs, 1M indels, 20,000 structural variants

    4. Phase I (2012) • More diverse set of populations sequenced • Total >1,092 individuals (EUR, ASN, AFR, AMR groupings) • >38.5 million SNP • 8.5M sites discovered before project (dbSNP 129) • 30M sites newly discovered • 98.9% of HapMap III sites rediscovered • Transition/transversion ratio of 2.16 vs 2.04 in pilot • ~1.5M insertion deletion polymorphisms • ftp://ftp.1000genomes.ebi.ac.uk • ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ The 1000 Genomes Project (Nature, 2012)

    5. Phase 1 samples Bubble size = sample size

    6. Samples in the final phase Bubble size = sample size

    7. 1000 Genomes data generation 80 1000 Genomes Data Total Dataset: 84 TB of BAM Files Data Generation Complete: May 2013 60 Terabases Phase 1 data freeze 40 Pilot data 20 0 2010 2011 2012 2013

    8. Multiple input variant callsets Short Variants (SNPs, indels) • Baylor College of Medicine - SNPTOOLs • Sanger - SAMtools • Sanger - SGA-Dindel • Broad - UnifiedGenotyper • Broad - HaplotypeCaller • Boston College - Freebayes • Umich-GotCloud • Oxford - Platypus • Oxford - Cortex • Stanford - RealtimeGenomics Structural Variants • Variation Hunter • MD Anderson TIGRA • Delly • Invy • Genome STRiP • Breakdancer • Dinumt • CNVnator • Boston College ALUs • Pindel • UMaryland MELT • UW Deletions

    9. Phase 3 variant calling pipeline Raw data 24 initial callsets Consensus callsets Final callset Integration Phasing Genotyping arrays Callset 1 SNPs and high confidence indels Low Coverage and Exome Read Data Integrated Haplotypes Phasing of multi-allelic variants onto haplotype scaffold Callset 2 Multi-allelic SNPs, indels, and MNPs Callset 3 Callset 4 Structural Variants Quality assessment and filtering Callset 5 Short Tandem Repeats PCR-free data Callset n 10 SNP/INDEL callsets, 2 STR callsets, 12 SV callsets

    10. More and longer indels 4bp 4bp 8bp 8bp 12bp 12bp

    11. Quality Control of Short Variants • For short variants, the high coverage PCR-free data from 26 individuals was used to assess the false discovery rate for each variant type. • An allele is considered ‘validated’ if multiple supporting reads can be identified in PCR-free data. • Sites included in the Phase 3 haplotypes have been selected to control the allele False Discovery Rate at 5%.

    12. Quality Control of Structural Variants • CNVs (deletions, duplications) • Custom 400K aCGHarray • Custom 1M aCGH array • Omni 2.5 probe intensities • Affy6.0 probe intensities • Complete Genomics data • Mobile element insertions (MEIs) • PCR (N ~ 100 per MEI class) • Custom aCGH, incl. targeted breakpoint probes • NUMTS (Nuclear mitochondrial sequence insertions) • PCR (N= 50) • PacBioAmplicon-Seq. • Inversions • PCR (N~ 50) • PacBioAmplicon-Seq • Multiple SV types • PacBioNA12878 WGS data to resolve breakpoints & verify sites • PCR-free trios to verify sites and genotypes

    13. Verification & further characterization of inversions by PacBio sequencing Regular (“simple”) inversion Inversion with flanking deletion Complex SVs with inverted sequences

    14. Contribution of the 1000G to dbSNP 61.9 million novel variants discovered by 1000G (62% of dbSNP) 22.4 million variants validated by 1000G (58.4% of non-1000G variants)

    15. Private vs. Shared Variation (Population View)

    16. Private vs. Shared Variation (Individual View) 160,000 5 million Variants per Genome (Zoom) Variants per Genome

    17. Variants per genome

    18. Population histories

    19. Biases in Variation Databases? ClinVar Variants per Genome Genetic Distance (Fst) to CEU

    20. Imputation Accuracy TODO: Multiallelic SNPs and indels to be renamed

    21. AMD Imputation Example #1

    22. Imputation Example #2 del443ins54

    23. Parting Thoughs • Variation is extremely rare • In any one genome, nearly all variation is shared … • But almost all variants are unique to a population or continent • Great benefits to integrated analyses • But analyses still requires time comparable to data generation • Major improvements in genome coverage, variant quality and integration • Advances can be transferred to disease studies through imputation

    24. Phase 3 release • The Phase 3 final release is available at www.1000genomes.org .