1 / 31

Charles Schmitt Director, Informatics and Data Sciences Senior Researcher – Data Mining

Charles Schmitt Director, Informatics and Data Sciences Senior Researcher – Data Mining Renaissance Computing Institute. Searching for the Genetic Causes of Disease with Hadoop (and other big data technologies…). Who is involved?. Biomedical Informatics Group Kirk Wilhelmsen, M.D.

tasya
Download Presentation

Charles Schmitt Director, Informatics and Data Sciences Senior Researcher – Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Charles Schmitt Director, Informatics and Data Sciences Senior Researcher – Data Mining Renaissance Computing Institute Searching for the Genetic Causes of Disease with Hadoop (and other big data technologies…)

  2. Who is involved? Biomedical Informatics Group Kirk Wilhelmsen, M.D. Chris Bizon, Ph.D. Xiaoshu Wang, Ph.D. Jason Reilly Phil Owens Guifeng Jin Michael Spiegel, Ph.D. Joshua Salisbury, Ph.D. Data Sciences Group Charles Schmitt, Ph.D. Erik Scott Nassib Nasser Keary Cavin MichealShoffner Collaborators Jonathan Berg, M.D. Jim Evans , M.D. Kari North, Ph.D. Ethan Lange, Ph.D. Rob Fowler, Ph.D. UNC HTSF UNC LCCC UNC Center for Bioinformatics UNC ITS RC RENCI ACIS UNC IPIT Multiple remote collaboration sites

  3. Human DNA • Dynamic 3-d structure • 23 chromosomes • Nearly identical copies

  4. Human Genetic Variations ATCGATCGATCAGACTA__GGGCTAGACTACGATCGATC – reference genome ATCGATCGGTCAGACTATCGGGCTA__CTACGAGCGCTC – patient maternal ATCGATCGGTCAGACTATCGGGCTA__CTACGATCGCTC – patient paternal SNPs: low millions Indels: low 100k • Structural variations • ~5-15% of genome is larger structural variants • (Nature Biotechnology Volume: 29, Pages: 723–730 Year published: (2011))

  5. Next-Generation Sequencing 4x coverage Genome Exon Exon Reads x x x x • Low coverage/targeted sequencing: cheaper and faster to sequence, less data to store, • But… • Greater reliance on making statistical inferences • Different strategies for research and clinical use

  6. Identifying variations Likely heterozygous (6 C, 9 Gs) (7 T, 9 G) Likely sequencing error (2 C, 14 T) (1 C, 15 A)

  7. Identifying variations 2 homozygous SNPs unclear (6 C, 14 T)

  8. Identifying variations CTT deletion (deltaF508)is the most common cause of cystic fibrosis

  9. Clinical Binning – the critical information Slide provided by Jim Evans, M.D., Ph.D., Department of Genetics, UNC-CH

  10. The promise of genetics requires a greater understanding of the underlying structure of the data

  11. Computing on the Genome: Imputation ATCGATCGATCAG - reference ATCGGTCGATCAG – patient TCGGTNNNTCAG GTCGGTCAG ATCGGTCGGTCA ATCGGTCGGTC Its unclear if this patient is A/A, A/G, or G/G

  12. Computing on the Genome: Imputation Population Evidence ATCGATCGATCAG - reference ATCGGTCGATCAG – patient TCGGTNNNTCAG GTCGGTCAG ATCGGTCGGTCA ATCGGTCGGTC ATCGGTCGGTCAG - patient 2 ATCGGTCGGTCAG – patient 3 ATCGGTCGGTCAG – patient 4 ATCGGTCGGTCAG – patient 5 ATCGATCGATCAG – patient 6 ATCGATCGATCAG – patient 7 ATCGATCGATCAG – patient 8 Infer the patient is homozygous for GG

  13. Computing on the Genome: Imputation Population Evidence ATCGATCGATCAG - reference ATCGGTCGATCAG – patient TCGGTNNNTCAG GTCGGTCAG ATCGGTCGGTCA ATCGGTCGGTC ATCGGTCGGTCAG - patient 2 ATCGGTCGGTCAG – patient 3 ATCGGTCGGTCAG – patient 4 ATCGGTCGGTCAG – patient 5 ATCGATCGATCAG – patient 6 ATCGATCGATCAG – patient 7 ATCGATCGATCAG – patient 8 Hidden Markov Models for cross-genome statistical correlations (Thunder*) Imputation on 708 samples takes over 200,000 CPU hours to complete, or 22 CPU years How many samples do we need to impute on rare variants? * Li Y, Sidore C, Kang HM, Boehnke M, Abecasis GR. Low-coverage sequencing: Implications for design of complex trait association studies. Genome Res. 2011 Jun;21(6):940-51

  14. Convergent Haplotype Association Tagging Identifying moderate penetrant mutations from cross-population genetic structures CHAT: developed by Kirk Wilhelmsen

  15. 1 2 3 4 5 Using Graph Theory in CHAT A B C A B C 1 1 1 2 2 2 5 5 5 4 4 4 3 3 3

  16. Discovered CHAT is 2800 SNPs in length and 26 mb

  17. The promise of genetics requires better approaches to store and analyze large data

  18. The cost of storing 100,000 genomes Empirical data, assuming ~100 Gb per sample compressed fastq, bam, vcf, and ancillary data files at coverage between 3-15x Empirical data, assuming ~20 Gb per sample at around 30x only storing compressed fastq and bam file 10 Pb = full human genomes at low coverage (1) 2 Pb = human exomes at medium coverage (2) Or: $5 to $25 million dollars for UNC Health Care System • Every patient’s genome once on enterprise data storage • Not including archived copies, not including analysis data sets Or: $15 to $75 billion dollars for the US to store every patient’s genome once Cost of disk space alone, not including refresh of equipment

  19. There is more clinical genetic data … … gene expression (rna-seq) … per tissue data …time series data …the personal micro-biome Courtesy of NIH via WikiCommons

  20. An Informatics Ecosystem for Clinical GenomicsAt ~8K genomes, will scale to ~10-20K genomesNeed to scale to 100,000-200,000+ genomes

  21. High Performance Computing (HPC) Leverages: Traditional bioinformatics tools Traditional HPC workflow systems Computing • KillDevil (ITS RC) • 706 Traditional, GPU based, and large memory compute nodes • BlueRidge (RENCI) • 204 Traditional, GPU based, and large memory compute nodes • Croatan (RENCI) • 30 node big-data configuration with 1 Pb spinning disk • Topsail (UNC Genomics) • 400 traditional compute nodes • Kure (ITS RC) • 220 Traditional and large memory compute Nodes • Open Science Grid • Distributed cycle scavenging grid across research institutions • Teragrid • National HPC grid Storage • PB+ Dell/Isilonsystem at UNC • PB+ DDN/NetApp/Dell systems at RENCI

  22. Aggregating genomic knowledge NCBI RefSeq Leverages strengths of RDBMS in structured knowledge representation dbSNP VarDB. Annotations of Clinical Variations PolyPhen HGMD (commercial) Protein Effects • VarDB: several TB database • Reference Genomes • Canonical Variants • Annotations • Indexes • AnnoBot: automated query system to update VarDB Other tools… Other databases

  23. HadoopVCF Example: Allele Frequency Variant Data file 1 Variant Data file 2 HadoopVCFdeveloped by Chris Bizon

  24. HadoopVCF Example: Generalized Samples Genomic Variants, Genomic Loci Each file holds different data for different samples and locations.

  25. Hadoop: Generalized algorithm • Mapper • Key = subset of sample and loci • Value = intermediate sums • Reducer • Calculation over intermediate sums • Allele Frequencies, %missing, HWE p-values,… • Hadoop Distributed Cache • Context from VCF headers and/or RDBMS for each mapper

  26. Why Hadoop? • Scalability for certain genome analysis patterns • Challenges: • Other analysis patterns: Hidden Markov Models, Permutation testing, Haplotype blocks, Graphs, Hierarchical graphs? • Share resources • Running on scheduled HPC clusters • Running on centralized HP storage system + local disks • Moving data to and from the worker nodes

  27. Managing an R&D ecosystem with big data External Partner Resources Open Science Grid Teragrid Lab Machines UNC STORAGE (Tape, Drives) RENCI STORAGE (Tape, Drives) Genomics Storage Genomics HPC RENCI HPC IT Machines UNC HPC Clouds RENCI Hadoop Genomics Hadoop • Research versus production use: • Life Cycle: control increases over time as • Work scope increases • Expertise and technology matures • Risk increases • Number of groups touching data increases Wild West Analysts Automated Processes Developers IT Staff External Partners Data Providers

  28. iRODS Data Virtualization UserClient Views & Manages Data Data Grid User Sees Single “Virtual Collection” /cuahsi/catalog /cuahsi/modeling /cuahsi/terrain SDSC /cuahsi/terrain RENCI /cuahsi/modeling Utah State Univ /cuahsi/catalog The iRODS Data Grid installs in a “layer” over storage systems, so you can view, manage, access, add, and share part or all of your data and metadata in a unified Collection.

  29. Managing an R&D ecosystem with big data External Partners Open Science Grid Teragrid Lab Machines UNC STORAGE (Tape, Drives) RENCI STORAGE (Tape, Drives) Genomics Storage Genomics HPC RENCI HPC IT Machines UNC HPC Clouds RENCI Hadoop Genomics Hadoop • Control over: • Data movement and replication • Metadata standards • Archival, deletion, and retention • Integration with workflows, hadoop, databases • Hiding complexities • Automation • …, all policy driven • …, without breaking the in-place systems Posix DDN WOS RDBMS Web services NFS Hadoop Integrated Rules-Oriented Data System (iRODS) Data Services Programmatic APIs Data Workflows iRODS Clients Analysts Automated Processes Developers IT Staff External Partners Data Providers

  30. Thank You

More Related