1 / 32

Extracting genetic variation from human genome sequences Stephen Sherry, PhD

Extracting genetic variation from human genome sequences Stephen Sherry, PhD. pipeline for 1000 genomes cSRA deployment software to support use of NGS data post NGS analysis data sets. 1000 genomes project: goals.

viveka
Download Presentation

Extracting genetic variation from human genome sequences Stephen Sherry, PhD

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extracting genetic variation from human genome sequencesStephen Sherry, PhD • pipeline for 1000 genomes • cSRA deployment • software to support use of NGS data • post NGS analysis data sets

  2. 1000 genomes project: goals • A public database of essentially all SNPs and detectable CNVs with allele frequency >1% in each of multiple human population samples • N=2,600 = 100 each from 26 populations • Pioneer and evaluate methods for: • Generating data from next-generation sequencing platforms • Exchanging and combining data and analytical methods • Discovering and genotyping SNPs and CNVs from nextgendata • Imputation with and from next generation sequencing data

  3. 1000 Genomes Project Sampling Sites Finland United Kingdom Beijing, China Italy Xishuangbanna, China Utah, U.S. Southwest U.S. Japan Mississippi, U.S. Pakistan Spain Puerto Rico Shenzhen, China California, U.S. Gambia India Vietnam Barbados Nigeria Colombia Kenya Ghana Peru Malawi

  4. Primary project data formats • FASTQ • sequences with base qualities • @IL11_193:4:1:878:501 • TATTTTGACTTTGAGCGTATCGAGGCTCTTTAACCTGAACGTCAGAAGCAGCCTTATGGCCGTCAACATACC • + • IIIIIIIIIIIIIIIIIIIIIIIIIIIIII1IDII<IIIIIIIIIIIIIIIIIIIIIIIIII(I&/97.,8& SAM/BAM multiple sequence alignments • @HD VN:1.0 • @SQ SN:chr20 LN:62435964 • @RG ID:L1 PU:SC_1_10 LB:SC_1 SM:NA12891 • @RG ID:L2 PU:SC_2_12 LB:SC_2 SM:NA12891 • read_28833_29006_6945 99 chr20 28833 20 10M1D25M = 28993 195 \ • AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG<<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<< \ • NM:i:1 RG:Z:L1 • read_28701_28881_323b 147 chr20 28834 30 35M = 28701 -168 \ • ACCTATATCTTGGCCTTGGCCGATGCGGCCTTGCA<<<<<;<<<<7;:<<<6;<<<<<<<<<<<<7<<<< \ • MF:i:18 RG:Z:L2

  5. Primary project data formats VCF variants with genomic location & genotypes • ##fileformat=VCFv4.0 • ##fileDate=20100721 • ##source=VCFtools • ##reference=NCBI36 (preferred use is assembly accession.version) • ##INFO= <ID=AA, Number=1, Type=String, Description="Ancestral Allele"> • ##INFO= <ID=H2, Number=0, Type=Flag, Description="HapMap2 membership"> • ##FORMAT=<ID=GT, Number=1, Type=String, Description="Genotype"> • ##FORMAT=<ID=GQ, Number=1, Type=Integer, Description="Genotype Quality"> • ##FORMAT=<ID=DP, Number=1, Type=Integer, Description="Read Depth"> • ##ALT= <ID=DEL, Description="Deletion"> • ##INFO= <ID=SVTYPE, Number=1, Type=String, Description="Type of structural variant"> • ##INFO= <ID=END, Number=1, Type=Integer, Description="End position of the variant"> • #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 SAMPLE2 • 1 1 . ACG A,AT 40 PASS . GT:DP 1/1:13 2/2:29 • 1 2 . C T,CT . PASS H2;AA=T GT 0|1 2/2 • 1 5 rs12 A G 67 PASS . GT:DP 1|0:16 2/2:20 • X 100 . T <DEL> . PASS SVTYPE=DEL;END=300 GT:GQ:DP 1:12:15 0/0:20:13

  6. Gabor Marth

  7. Chunlin Xiao

  8. 1061 genomes produced 23M novel filtered calls

  9. BAM FASTQ VCF VCF

  10. cSRA testing and deployment

  11. Chunlin Xiao and Eugene Yaschenko

  12. Chunlin Xiao

  13. Chunlin Xiao

  14. Chunlin Xiao

  15. Chunlin Xiao

  16. Results & Benefits Whole genomes and exomes can be efficiently stored in 1/3 to 1/10 of the space of a BAM file cSRA lossless compression achieved ~3x reduction in size (bits per base) as compared to the original BAM files. Near-lossless compression (4- and 8-levels quantization on base qualities) furthered reduced bits/base to <2 bits per base demonstrating reduced storage requirement for these sequences. cSRA can be quickly ‘sliced’ to extract genomic intervals of interest in BAM, SAM or FASTQformat The format stores original base qualities (OQ) or recalibrated quality scores (RQ) and can produce recalibrated quality scores (RQ) during extraction. Conversion times are rapid: BAM-to-cSRA can be encoded at 15-20 GB per hour per 2 CPU core. Processing requires significant RAM resources to match up paired read names during mate pair reconstruction. Memory requirements are typically 1/3 of the size of the BAM input file.

  17. Sources of problematic alignments: • low complexity regions • imperfect aligner technology • lack of essential quality control • Errors corrected by cSRA • incorrect mate flags • inconsistent quality flags • errors in CIGAR strings • multiple placements • Errors impact variant detection and may introduce false positives into the final variant call set.

  18. Variation detection: comparison of unfiltered call sets produced by variation detection pipeline using submitted and archive-restored BAMs • Comparisons will include measures of false negatives — the potential variations that would be ‘lost’ by archive treatment • Lists of variants that truncate proteins should be evaluated. These include nonsense (stop gain), loss of transcription start site, and splice site donor and acceptor positions. • SV Performance with and without map quality (CREST) Empirical testing & validation possible • Individual genotyping accuracy • Accuracy of remapping to new assemblies. • Pipeline consequences for dropping secondary base calls

  19. Software to support NGS

  20. Stationary night blindness due to premature termination in TRPM1 Nonsense CA in TRPM1 Data for NA11918 placed by two different aligners (mosaik & bwa) All individual genotypes For rs3784589 Deanna Church & Eugene Yaschenko

  21. Deanna Church & Eugene Yaschenko

  22. Developing characterized gDNA reference material for NGS • NCBI contributions: • analyze sequence data and variant calls in target gene regions • create consensus VCFs for NA12878 and NA19240 • host a genome-specific browser for published sequences and genotype calls • NIST will use this information to further develop standard reference materials for NGS Technology-specific genotypes from publications and Groups collaborating in the GET-RM project.

  23. Don Preuss & Chris Cope

  24. Post-NGS analysis & data

  25. A genotype dataset is a very large matrix with orthogonal access patterns Problem: Solution: Divide the data into chunks List all genotypes for a given variation List all variations for a subject vcf asn.1 json xml SciDB Cluster (array-based storage) • 1000G November 2010 release (pilot) • 18GB compressed VCF • 38.8m SNPs • 1000G May 2011 release (phase 1) • 164.4 GB compressed VCF • 38.2M SNPs • 3.9M Short Indels • 14K Deletions querygt.cgi Douglas Slotta

  26. ClinVar: organizing allele significance relative to disorders e.g. Severe Combined Immunodeficiency Disease Variants co-observed in affected patients Allele focus of the report Semantic properties of the disorder or phenotype Donna Maglott and Wendy Rubinstein

  27. The translational research process has archives at each stage Genome Biology Medicine PheGenI OMIM 1000 Genomes Genetic Test Registry NHGRI GWAS Gene SRA dbVar Clinvar dbGaP dbSNP Pharm GKB RefSeq Gene

  28. THANK YOU 1000 Genomes Roadmap Chunlin Xiao Genotype archive Chunlei Liu Douglas Slotta Variation Pipeline Chunlin Xiao Anatoly Mnev GonçaloAbecasis, U Mich Tom Blackwell, U Mich Gabor Marth, Boston College Alistair Ward, Boston College 1000 Genomes Browser Victor Ananiev Deanna Church Cliff Clausen Rob Cohen Peter Meric The sequence viewer team! The SRA team! cSRA deployment Chunlin Xiao Michael Kimelman Eugene Yaschenko The VDB team! Systems Chris Cope Don Preuss GetRM Browser Richa Agarwala Deanna Church Donna Maglott Chris O’Sullivan Chunlin Xiao Eugene Yaschenko The dbSNP team! The Clinical Variation team!

More Related