Jianwu
Download
1 / 11

Kepler + Hadoop A General Architecture Facilitating Data-Intensive Applications in Scientific Workflow Systems - PowerPoint PPT Presentation


  • 130 Views
  • Uploaded on

Jianwu Wang, Daniel Crawl, Ilkay Altintas San Diego Supercomputer Center, University of California, San Diego 9500 Gilman Drive, MC 0505 La Jolla, CA 92093-0505, U.S.A. { jianwu , crawl, altintas }@ sdsc.edu Presentation by Woodrow H. Edwards.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Kepler + Hadoop A General Architecture Facilitating Data-Intensive Applications in Scientific Workflow Systems' - adele


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Jianwu Wang, Daniel Crawl, IlkayAltintas

San Diego Supercomputer Center, University of California, San Diego

9500 Gilman Drive, MC 0505

La Jolla, CA 92093-0505, U.S.A.

{jianwu, crawl, [email protected]

Presentation by Woodrow H. Edwards

Kepler + HadoopA General Architecture Facilitating Data-Intensive Applications in Scientific Workflow Systems


Kepler
Kepler

  • Open source scientific workflow system

  • Executable model of the many stages transforming data into the desired result in a scientific domain

  • Scientific domains using Kepler

    • Bioinformatics, Computational Chemistry,

    • Ecoinformatics, and Geoinformatics

  • All have large data sets and require a lot of computation


Kepler1
Kepler

  • User friendly GUI to connect data sources to built-in procedures or independent applications with the ease of drag and drop

  • Promotes component reuse and sharing

  • Written in Java

  • Designed to run on clusters, grids, or the Web

  • A nice match to integrate with MapReduce


Kepler2
Kepler

  • Components of a Kepler workflow

    • Actors

      • Independently process data

      • Atomic or composite

      • Ports input and ouput data (tokens) or signals

      • Could be R or MATLAB scripts or an outside application

    • Channels

      • Link actors

      • Carry data or other signals

    • Directors

      • Specify when actors run

      • Sequential (SPD) or parallel (PN)


Figure 1: Example Kepler workflow [2]


Hadoop
Hadoop

  • Open source implementation of MapReduce

    • map(in_key, in_value)  (out_key, intermediate_value) list

    • reduce(out_key, intermediate_value list)  out_value list

  • HDFS

  • Data partitioning, scheduling, load balancing, and fault tolerance

  • Also written in Java


Kepler hadoop
Kepler + Hadoop

  • Implement a MapReduce composite actor

    • Map actor

      • MapInputKey: in_key

      • MapInputValue: in_value

      • MapOutputList: (out_key, intermediate_value) list

    • Reduce actor

      • ReduceInputKey: out_key

      • ReduceInputList: intermediate_value list

      • ReduceOutputValue: out_value list

Figure 2: (a) MapReduce composite actor. (b) Map actor. (c) Reduce actor. [1]


Kepler hadoop1
Kepler + Hadoop

Figure 3: Hierarchical execution of MapReduce composite actor with Hadoop [1]


Kepler hadoop2
Kepler + Hadoop

Figure 4: (a) Word Count workflow. (b) Map actor. (c) Reduce actor.

(d) IterateOverArray actor. [1]


Kepler hadoop3
Kepler + Hadoop

  • Takes 10 to 15% longer over native Hadoop MapReduce

  • Makes up for it in ease of implementation

  • Scientist can use MapReduce without needing to know the framework

  • They only need to know where they can benefit from parallelism in their workflow


References
References

  • J. Wang, D. Crawl, and I. Altintas. Kepler+ Hadoop: A General Architecture Facilitating Data-Intensive Applications in Scientific Workflow Systems. In WORKS 09, ACM, Nov. 2009.

  • The Kepler Project. https://kepler-project.org.

  • The Apache Hadoop Project. http://hadoop.apache.org.


ad