1 / 11

Kepler + Hadoop A General Architecture Facilitating Data-Intensive Applications in Scientific Workflow Systems

Jianwu Wang, Daniel Crawl, Ilkay Altintas San Diego Supercomputer Center, University of California, San Diego 9500 Gilman Drive, MC 0505 La Jolla, CA 92093-0505, U.S.A. { jianwu , crawl, altintas }@ sdsc.edu Presentation by Woodrow H. Edwards.

adele
Download Presentation

Kepler + Hadoop A General Architecture Facilitating Data-Intensive Applications in Scientific Workflow Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Jianwu Wang, Daniel Crawl, IlkayAltintas San Diego Supercomputer Center, University of California, San Diego 9500 Gilman Drive, MC 0505 La Jolla, CA 92093-0505, U.S.A. {jianwu, crawl, altintas}@sdsc.edu Presentation by Woodrow H. Edwards Kepler + HadoopA General Architecture Facilitating Data-Intensive Applications in Scientific Workflow Systems

  2. Kepler • Open source scientific workflow system • Executable model of the many stages transforming data into the desired result in a scientific domain • Scientific domains using Kepler • Bioinformatics, Computational Chemistry, • Ecoinformatics, and Geoinformatics • All have large data sets and require a lot of computation

  3. Kepler • User friendly GUI to connect data sources to built-in procedures or independent applications with the ease of drag and drop • Promotes component reuse and sharing • Written in Java • Designed to run on clusters, grids, or the Web • A nice match to integrate with MapReduce

  4. Kepler • Components of a Kepler workflow • Actors • Independently process data • Atomic or composite • Ports input and ouput data (tokens) or signals • Could be R or MATLAB scripts or an outside application • Channels • Link actors • Carry data or other signals • Directors • Specify when actors run • Sequential (SPD) or parallel (PN)

  5. Figure 1: Example Kepler workflow [2]

  6. Hadoop • Open source implementation of MapReduce • map(in_key, in_value)  (out_key, intermediate_value) list • reduce(out_key, intermediate_value list)  out_value list • HDFS • Data partitioning, scheduling, load balancing, and fault tolerance • Also written in Java

  7. Kepler + Hadoop • Implement a MapReduce composite actor • Map actor • MapInputKey: in_key • MapInputValue: in_value • MapOutputList: (out_key, intermediate_value) list • Reduce actor • ReduceInputKey: out_key • ReduceInputList: intermediate_value list • ReduceOutputValue: out_value list Figure 2: (a) MapReduce composite actor. (b) Map actor. (c) Reduce actor. [1]

  8. Kepler + Hadoop Figure 3: Hierarchical execution of MapReduce composite actor with Hadoop [1]

  9. Kepler + Hadoop Figure 4: (a) Word Count workflow. (b) Map actor. (c) Reduce actor. (d) IterateOverArray actor. [1]

  10. Kepler + Hadoop • Takes 10 to 15% longer over native Hadoop MapReduce • Makes up for it in ease of implementation • Scientist can use MapReduce without needing to know the framework • They only need to know where they can benefit from parallelism in their workflow

  11. References • J. Wang, D. Crawl, and I. Altintas. Kepler+ Hadoop: A General Architecture Facilitating Data-Intensive Applications in Scientific Workflow Systems. In WORKS 09, ACM, Nov. 2009. • The Kepler Project. https://kepler-project.org. • The Apache Hadoop Project. http://hadoop.apache.org.

More Related