1 / 24

The Chinese University of Hong Kong Bioinformatics Regular Group Meeting

The Chinese University of Hong Kong Bioinformatics Regular Group Meeting. Presenter: Yip Kit Sang Danny Date and Venue : 4 th Mar , 201 4 at SHB1027. Agenda. Introduction Shuffle Mechanism in MapReduce MapReduce in Alignment MapReduce in Mapping and Assembly

badu
Download Presentation

The Chinese University of Hong Kong Bioinformatics Regular Group Meeting

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Chinese University of Hong KongBioinformatics Regular Group Meeting Presenter: Yip Kit Sang Danny Date and Venue: 4thMar, 2014at SHB1027

  2. Agenda Introduction Shuffle Mechanism in MapReduce MapReduce in Alignment MapReduce in Mapping and Assembly MapReduce in Gene Expression Analysis and SNP Analysis MapReduce in other Biological Applications Conclusions and Discussions Bioinformatics Regular Group Meeting | Danny Yip | 7th Mar, 2013

  3. Motivations • Bioinformatics always need handle large-scale data from high-throughput experiments • There are many challenges in parallel computing in bioinformatics • Data and databases, algorithms and solutions, data analyses, … etc Bioinformatics Regular Group Meeting | Danny Yip | 7th Mar, 2013

  4. Introduction Message Passing Interface (MPI) Graphics Processing Unit (GPU) Hadoop MapReduce Bioinformatics Regular Group Meeting | Danny Yip | 7th Mar, 2013

  5. Message Passing Interface (MPI) • Data transfer • Require cooperation of sender and receiver • For parallel computers, clusters, and heterogeneous networks • Designed to permit the development of parallel software libraries • Features: • Communications combine context and group for message security • Thread safety can’t be assumed for MPI programs • Cannot deal with node failure Bioinformatics Regular Group Meeting | Danny Yip | 7th Mar, 2013

  6. Graphics Processing Unit (GPU) Reference: Kent State University, wcheng • It is a processor optimized for 2D/3D graphics, video, visual computing, and display • It is highly parallel, highly multithreaded multiprocessor optimized for visual computing • It provide real-time visual interaction with computed objects via graphics images, and video • It serves as both a programmable graphics processor and a scalable parallel computing platform • Heterogeneous Systems: combine a GPU with a CPU Bioinformatics Regular Group Meeting | Danny Yip | 7th Mar, 2013

  7. Hadoop Reference: CSCI5120 Lecture note, Prof. James Cheng • Hadoop is an open-source project by Apache Software Foundation • The key concept of Hadoop is based on papers published by google in 2003, and 2004 (Google File System and MapReduce) • Hadoop is matured due to commitment by many organizations • Google, Yahoo, Facebook, Cloudera, etc Bioinformatics Regular Group Meeting | Danny Yip | 7th Mar, 2013

  8. MapReduce Reference: CSCI5120 Lecture note, Prof. James Cheng • MapReduceis a method for distributing a task across multiple nodes • Each node processes data stored on that node • Where possible • Consists of two phases: Map and Reduce • Features • Automatic parallelization • Fault-tolerance • A clean abstraction for programmers (Away from housekeeping works) Bioinformatics Regular Group Meeting | Danny Yip | 7th Mar, 2013

  9. Related Software and Projects on MapReduce Table 1: Related software and projects on MapReduce Bioinformatics Regular Group Meeting | Danny Yip | 7th Mar, 2013

  10. Shuffle Mechanism in MapReduce Reference: CSCI5120 Lecture note, Prof. James Cheng Bioinformatics Regular Group Meeting | Danny Yip | 7th Mar, 2013

  11. An Example on MapReduce Reference: CSCI5120 Lecture note, Prof. James Cheng Count the number of occurrences of each word in a large amount of input data Bioinformatics Regular Group Meeting | Danny Yip | 7th Mar, 2013

  12. An Example on MapReduce • Input: “hello hellohello world worldHeLLo” • After mapper: • (“hello”, 1), (“hello”, 1), (“hello”, 1), (“world”, 1), (“world”, 1), (“HeLLo”, 1) • After reducer (output): • (“hello”, 3) • (“world”, 2) • (“HeLLo”, 1) Bioinformatics Regular Group Meeting | Danny Yip | 7th Mar, 2013

  13. MapReduce in Alignment Exact match of fixed-length sub-sequences. Find the best matches using a simple score Re-evaluate the best matches using formal substitution matrices Combine the best matches by allowing gaps Use dynamic programming on the combined matches Image credit: Wikipedia The most famous alignment tool: BLAST Bioinformatics Regular Group Meeting | Danny Yip | 7th Mar, 2013

  14. MapReduce in Alignment • Alignment tools using parallel computing: • mpiBLAST • GPU-BLAST • CloudBLAST • NCBI BLAST2 • bClouldBLAST Bioinformatics Regular Group Meeting | Danny Yip | 7th Mar, 2013

  15. bCloudBLAST Bioinformatics Regular Group Meeting | Danny Yip | 7th Mar, 2013

  16. bCloudBLAST Bioinformatics Regular Group Meeting | Danny Yip | 7th Mar, 2013

  17. bCloudBLAST Bioinformatics Regular Group Meeting | Danny Yip | 7th Mar, 2013

  18. MapReduce in Mapping and Assembly • Mapping • CloudBurst • BlastReduce • SEAL • CloudAligner • Assembly • ABySS (MPI-based) • Ray (MPI-based) • Contrail (Hadoop) Image credit: Commins et al., Biological Procedures Online 11(1):52-78, (2009), Wikipedia Bioinformatics Regular Group Meeting | Danny Yip | 7th Mar, 2013

  19. BlastReduce Reference: BlastReduce, Schatz • Three stages of MapReduce • MerReduce: obtain all mers from both reads and reference genome • SeedReduce: Merge consistent mers with the sorted read position • ExtendReduce: Remove duplicate alignments Bioinformatics Regular Group Meeting | Danny Yip | 7th Mar, 2013

  20. MapReduce in Gene Expression Analysis and SNP Analysis • The reads are aligned to the reference and categorized by some features • Mapping can use MapReduce • The distribution of reads that are assigned to each feature is normalized • The normalization calculated can be parallelized • Statistical analysis is conducted to identify features with differential abundance • The evaluation of statistical significance of different features are usually independent Bioinformatics Regular Group Meeting | Danny Yip | 7th Mar, 2013

  21. MapReduce in Gene Expression Analysis and SNP Analysis • Some tools: Myrna, FX, Crossbow, Sequence_Analyzer, CloudTSS • Advantages • Scalability, Speedup • Disadvantages • Some bottlenecks • Inheritance of limitations of some previous work • Restrictive running environment Bioinformatics Regular Group Meeting | Danny Yip | 7th Mar, 2013

  22. MapReduce in other Biological Applications Mapping stage in NGS or de novo assembly of other analysis (e.g. gene expression analysis, miRNA study) Hadoop-based algorithm for constructing SA, BWT Hadoop-BAM, tools for analysing BAM files SeqWare Query Engine, a scalable NoSQLHbase database from Hadoop project Efficient graph algorithm for PPI network, other topological analysis Bioinformatics Regular Group Meeting | Danny Yip | 7th Mar, 2013

  23. Conclusions and Discussions • Pros: • Scalability, Speedup, Fault-tolerance • Wide range of applications in bioinformatics • Hadoop with open-source Apache implementation • Cons: • Limited running environment (e.g. system call used in MPI, GPU and cloud) • Readability of the framework of other parallel computing algorithms Bioinformatics Regular Group Meeting | Danny Yip | 7th Mar, 2013

  24. Thanks Bioinformatics Regular Group Meeting | Danny Yip | 7th Mar, 2013

More Related