1 / 25

Cluster-based SNP Calling on Large Scale Genome Sequencing Data

Cluster-based SNP Calling on Large Scale Genome Sequencing Data. Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio State University. CCGrid 2014, Chicago, IL. What is SNP?. Stands for Single-Nucleotide Polymorphism

homer
Download Presentation

Cluster-based SNP Calling on Large Scale Genome Sequencing Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cluster-based SNP Calling on Large Scale Genome Sequencing Data MucahidKutluGaganAgrawal Department of Computer Science and Engineering The Ohio State University CCGrid 2014, Chicago, IL

  2. What is SNP? • Stands for Single-Nucleotide Polymorphism • DNA sequence variation that occurs when a single nucleotide differs between members of biological species. • Essential for medical researches and developing personalized-medicine. • A single SNP may cause a Mendelian disease. *Adapted from Wikipedia CCGrid 2014

  3. Motivation • The sequencing costs are decreasing *Adapted from genome.gov/sequencingcosts CCGrid2014

  4. Motivation • Big data problem • 1000 Human Genome Project already produced 200 TB data • Parallel processing is inevitable! *Adapted from https://www.nlm.nih.gov/about/2015CJ.html CCGrid2014

  5. Outline CCGrid 2014 Motivation Parallel SNP Calling Proposed Scheduling Schemes Experiments Conclusion

  6. General Idea of SNP Calling Algorithms ✖ ✓ ✖ Alignment File-1 Alignment File-2 CCGrid 2014 Two main observations: In order to detect an SNP at a certain location, we have to check the alignments in ALL genomes at that location. The existence of an SNP is independent than others

  7. Parallel SNP Calling Location-based Sample-based Checkerboard Proc 1 Proc 2 Proc 3 Proc 4 Proc1 Proc2 Processor 1 Processor 2 Proc3 Processor 3 Proc4 Processor 4 Genome files Requires communication among processes CCGrid 2014 CCGrid 2014 How to distribute data among nodes?

  8. Challenges 1 3 4 Coverage Variance 8 CCGrid 2014 • Load Imbalance due to nature of genomic data • It is not just an array of A, G, C and T characters • I/O contention • High overhead of random access to a particular region

  9. Histogram Showing Coverage Variance CCGrid 2014 Chromosome: 1 Locations: 1-200M Number of samples: 256 Interval size: 1M

  10. Outline CCGrid 2014 Motivation Parallel SNP Calling Proposed Scheduling Schemes Experiments Conclusion

  11. Proposed Scheduling Schemes CCGrid 2014 Dynamic Scheduling Static Scheduling Combined Scheduling …Each scheduling scheme uses location-based data division. That is, the genome is divided into regions and each task is responsible for a region.

  12. Dynamic Scheduling Alignment File -1 Alignment File -2 • Big chunks are assigned first, then small chunks are assigned B B CCGrid 2014 • Master & Worker Approach • Tasks are assigned dynamically • Two types of data-chunks are used • Big chunk: covers B locations • Small chunk: cover S locations • B > S

  13. Static Scheduling Alignment File -1 Alignment File -2 • Tasks are scheduled statically. No master & Slave approach CCGrid 2014 • Pre-processing step • We count the number of alignments for each region and generate a histogram • Estimated Cost • We use an estimation function and our histogram for data partitioning. • k : histogram interval k • TR : cost of accessing/reading the region • TP: processing an alignment • N(l): Number of alignments in location l • Each task is responsible for regions having same estimated cost.

  14. Combined Scheduling Alignment File -1 Alignment File -2 Big chunks Small chunks CCGrid 2014 Combination of Static and Dynamic Scheduling We use small and big chunks as in dynamic scheduling The size of the chunks are determined according to histogram Master-Worker approach

  15. Parameters of Scheduling Schemes CCGrid 2014 • Our proposed scheduling schemes have user-defined parameters • Dynamic Scheduling • Length of big and small chunks • Static Scheduling • Histogram interval size • Estimation function parameters • Combined Scheduling • All parameters for dynamic and static scheduling • All parameters can be determined with a offline training phase

  16. Outline CCGrid 2014 Motivation Parallel SNP Calling Proposed Scheduling Schemes Experiments Conclusion

  17. Experiments CCGrid 2014 • Local cluster with nodes • 2 quad-core 2.53 GHz Xeon(R) processors with12 GB RAM • We obtained genomes of 256 samples from 1000 Human Genome Project • The data is replicated to all local disks unless noted otherwise • Parallel implementation: • We implemented VarScan in C programming language • We also modified VarScansuch that BAM files can be read directly. • Used MPI library for parallelization

  18. Experiments: Scalability First 192M location of Chr.1 CCGrid 2014

  19. Experiments: Data Size Impact 128 cores are allocated CCGrid 2014

  20. Experiments: I/O Contention Impact 128 cores are allocated I/O Contention Impact CCGrid 2014

  21. Comparison with Hadoop • First 192M location of Chr.2 in 512 samples are analyzed • Lower (dark) portions of the bars show pre-processing time. CCGrid 2014

  22. Scheduling With Replication IPDPS'14 • Data-Intensive Processing Motivates New Schemes • Replicate each chunk fixed/variable number of times • Dynamic scheduling while processing only local chunks • Interesting new tradeoffs • Under submission

  23. Other Work IPDPS'14 • PAGE: A Map-Reduce-like middleware for easy parallelization of genomic applications (IPDPS 2014) • Mappers and reducers are executable programs • Allows us to exploit existing applications • No restriction on programming language

  24. PAGE vs. State-of-the-Art IPDPS'14 • Amiddleware system • Specific for parallel genetic data processing • Allow parallelization of a variety of genetic algorithms • Be able to work with different popular genetic data formats • Allows use of existing programs

  25. Conclusion CCGrid 2014 We have developed a methodology for parallel identification of variants in large-scale genome sequencing data. Coverage variance and I/O contetionare two main problems We proposed 3 scheduling schemes Combined scheduling gives best results. Our approach has good speedup and outperforms Hadoop

More Related