Jianwu
This presentation is the property of its rightful owner.
Sponsored Links
1 / 11

Kepler + Hadoop A General Architecture Facilitating Data-Intensive Applications in Scientific Workflow Systems PowerPoint PPT Presentation


  • 83 Views
  • Uploaded on
  • Presentation posted in: General

Jianwu Wang, Daniel Crawl, Ilkay Altintas San Diego Supercomputer Center, University of California, San Diego 9500 Gilman Drive, MC 0505 La Jolla, CA 92093-0505, U.S.A. { jianwu , crawl, altintas }@ sdsc.edu Presentation by Woodrow H. Edwards.

Download Presentation

Kepler + Hadoop A General Architecture Facilitating Data-Intensive Applications in Scientific Workflow Systems

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Kepler hadoop a general architecture facilitating data intensive applications in scientific workflow systems

Jianwu Wang, Daniel Crawl, IlkayAltintas

San Diego Supercomputer Center, University of California, San Diego

9500 Gilman Drive, MC 0505

La Jolla, CA 92093-0505, U.S.A.

{jianwu, crawl, [email protected]

Presentation by Woodrow H. Edwards

Kepler + HadoopA General Architecture Facilitating Data-Intensive Applications in Scientific Workflow Systems


Kepler

Kepler

  • Open source scientific workflow system

  • Executable model of the many stages transforming data into the desired result in a scientific domain

  • Scientific domains using Kepler

    • Bioinformatics, Computational Chemistry,

    • Ecoinformatics, and Geoinformatics

  • All have large data sets and require a lot of computation


Kepler1

Kepler

  • User friendly GUI to connect data sources to built-in procedures or independent applications with the ease of drag and drop

  • Promotes component reuse and sharing

  • Written in Java

  • Designed to run on clusters, grids, or the Web

  • A nice match to integrate with MapReduce


Kepler2

Kepler

  • Components of a Kepler workflow

    • Actors

      • Independently process data

      • Atomic or composite

      • Ports input and ouput data (tokens) or signals

      • Could be R or MATLAB scripts or an outside application

    • Channels

      • Link actors

      • Carry data or other signals

    • Directors

      • Specify when actors run

      • Sequential (SPD) or parallel (PN)


Kepler hadoop a general architecture facilitating data intensive applications in scientific workflow systems

Figure 1: Example Kepler workflow [2]


Hadoop

Hadoop

  • Open source implementation of MapReduce

    • map(in_key, in_value)  (out_key, intermediate_value) list

    • reduce(out_key, intermediate_value list)  out_value list

  • HDFS

  • Data partitioning, scheduling, load balancing, and fault tolerance

  • Also written in Java


Kepler hadoop

Kepler + Hadoop

  • Implement a MapReduce composite actor

    • Map actor

      • MapInputKey: in_key

      • MapInputValue: in_value

      • MapOutputList: (out_key, intermediate_value) list

    • Reduce actor

      • ReduceInputKey: out_key

      • ReduceInputList: intermediate_value list

      • ReduceOutputValue: out_value list

Figure 2: (a) MapReduce composite actor. (b) Map actor. (c) Reduce actor. [1]


Kepler hadoop1

Kepler + Hadoop

Figure 3: Hierarchical execution of MapReduce composite actor with Hadoop [1]


Kepler hadoop2

Kepler + Hadoop

Figure 4: (a) Word Count workflow. (b) Map actor. (c) Reduce actor.

(d) IterateOverArray actor. [1]


Kepler hadoop3

Kepler + Hadoop

  • Takes 10 to 15% longer over native Hadoop MapReduce

  • Makes up for it in ease of implementation

  • Scientist can use MapReduce without needing to know the framework

  • They only need to know where they can benefit from parallelism in their workflow


References

References

  • J. Wang, D. Crawl, and I. Altintas. Kepler+ Hadoop: A General Architecture Facilitating Data-Intensive Applications in Scientific Workflow Systems. In WORKS 09, ACM, Nov. 2009.

  • The Kepler Project. https://kepler-project.org.

  • The Apache Hadoop Project. http://hadoop.apache.org.


  • Login