workflow management n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Workflow Management PowerPoint Presentation
Download Presentation
Workflow Management

Loading in 2 Seconds...

play fullscreen
1 / 40

Workflow Management - PowerPoint PPT Presentation


  • 122 Views
  • Uploaded on

Workflow Management. CMSC 491/691 Hadoop-Based Distributed Computing Spring 2014 Adam Shook. Apache Oozie. Problem!. "Okay, Hadoop is great, but how do people actually do this?“ – A Real Person Package jobs? Chaining actions together? Run these on a schedule?

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Workflow Management' - alina


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
workflow management

Workflow Management

CMSC 491/691

Hadoop-Based Distributed Computing

Spring 2014

Adam Shook

problem
Problem!
  • "Okay, Hadoop is great, but how do people actually do this?“ – A Real Person
    • Package jobs?
    • Chaining actions together?
    • Run these on a schedule?
    • Pre and post processing?
    • Retry failures?
apache oozie workflow scheduler for hadoop
Apache OozieWorkflow Scheduler for Hadoop
  • Scalable, reliable, and extensible workflow scheduler system to manage Apache Hadoop jobs
  • Workflow jobs are DAGs of actions
  • Coordinator jobs are recurrent Oozie Workflow jobs triggered by time and data availability
  • Supports several types of jobs:
  • Java MapReduce
  • Streaming MapReduce
  • Pig
  • Hive
  • Sqoop
  • Distcp
  • Java programs
  • Shell scripts
why should i care
Why should I care?
  • Retry jobs in the event of a failure
  • Execute jobs at a specific time or when data is available
  • Correctly order job execution based on dependencies
  • Provide a common framework for communication
  • Use the workflow to couple resources instead of some home-grown code base
layers of oozie
Layers of Oozie
  • Bundles
  • Coordinators
  • Workflows
  • Actions
actions
Actions
  • Have a type, and each type has a defined set of configuration variables
  • Each action must specify what to do based on success or failure
workflow dags
Workflow DAGs

M/R

streaming

job

OK

start

Java

Main

OK

fork

join

decision

Pig

job

MORE

OK

M/R

job

ENOUGH

OK

Java Main

end

FS

job

OK

OK

oozie workflow application
Oozie Workflow Application
  • An HDFS Directory containing:
  • Definition file: workflow.xml
  • Configuration file: config-default.xml
  • App files: lib/ directory with JAR and other dependencies
wordcount workflow
WordCount Workflow

<workflow-app name='wordcount-wf'>

<start to='wordcount'/>

<action name='wordcount'>

<map-reduce>

<job-tracker>foo.com:9001</job-tracker>

<name-node>hdfs://bar.com:9000</name-node>

<configuration>

<property>

<name>mapred.input.dir</name>

<value>${inputDir}</value>

</property>

<property>

<name>mapred.output.dir</name>

<value>${outputDir}</value>

</property>

</configuration>

</map-reduce>

<ok to='end'/>

<error to='kill'/>

</action>

<kill name='kill'/>

<end name='end'/>

</workflow-app>

Start

End

M-R

wordcount

OK

Start

Error

Kill

coordinators
Coordinators
  • Oozie executes workflows based on
    • Time Dependency
    • Data Dependency

Tomcat

Check

Data Availability

Oozie

Coordinator

WS API

Oozie

Workflow

Oozie Client

Hadoop

time triggers
Time Triggers

<coordinator-app name="coord1"

start="2009-01-01T00:00Z"

end="2010-01-01T00:00Z"

frequency="15"

xmlns="uri:oozie:coordinator:0.1">

<action>

<workflow>

<app-path>hdfs://bar:9000/apps/processor-wf</app-path>

<configuration>

<property>

<name>key1</name>

<value>value1</value>

</property>

</configuration>

</workflow>

</action>

</coordinator-app>

data triggers
Data Triggers

<coordinator-app name="coord1" frequency="${1*HOURS}"...>

<datasets>

<dataset name="logs" frequency="${1*HOURS}" initial-instance="2009-01-01T00:00Z">

<uri-template>hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template>

</dataset>

</datasets>

<input-events>

<data-in name="inputLogs" dataset="logs">

<instance>${current(0)}</instance>

</data-in>

</input-events>

<action>

<workflow>

<app-path>hdfs://bar:9000/usr/abc/logsprocessor-wf</app-path>

<configuration>

<property>

<name>inputData</name>

<value>${dataIn('inputLogs')}</value>

</property>

</configuration>

</workflow>

</action>

</coordinator-app>

bundle
Bundle
  • Bundles are higher-level abstractions that batch a set of coordinators together
  • No explicit dependencies between them, but they can be used to define a pipeline
interacting with oozie
Interacting with Oozie
  • Read-Only Web Console
  • CLI
  • Java client
  • Web Service Endpoints
  • Directly with Oozie DB using SQL
extending oozie
Extending Oozie
  • Minimal workflow language containing a handful of controls and actions
  • Extensibility for custom action nodes
  • Creation of a custom action requires:
    • Java implementation, extending ActionExecutor
    • Implementation of the action’s XML schema, which defines the action’s configuration parameters
    • Packaing of Java implementation and configuration schema into a JAR, which is added to Oozie WAR
    • Extending oozie-site.xml to register information about custom executor
what do i need to deploy a workflow
What do I need to deploy a workflow?
  • coordinator.xml
  • workflow.xml
  • Libraries
  • Properties
    • Contains things like NameNode and ResourceManager addresses and other job-specific properties
configuring workflows
Configuring Workflows
  • Three mechanisms to configure a workflow
    • config-default.xml
    • job.properties
    • Job Arguments
  • Processed as such:
    • Use all of the parameters from command line invocation
    • Anything unresolved? Use job.properties
    • Use config-default.xml for everything else
okay i ve built those
Okay, I've built those
  • Now you can put it in HDFS and run it

hdfsdfs -put my_joboozie/app

oozie job -run -configjob.properties

java action
Java Action
  • A Java action will execute the main method of the specified Java class
  • Java classes should be packaged in a JAR and placed with workflow application's lib directory
    • wf-app-dir/workflow.xml
    • wf-app-dir/lib
    • wf-app-dir/lib/myJavaClasses.JAR
java action1
Java Action

$ java -Xms512m a.b.c.MyMainClass arg1 arg2

<actionname='java1'>

<java>

...

<main-class> a.b.c.MyJavaMain </main-class>

<java-opts> -Xms512m </java-opts>

<arg> arg1 </arg>

<arg> arg2 </arg>

...

</java>

</action>

java action execution
Java Action Execution
  • Executed as a MR job with a single task
  • So you need the MR information

<actionname='java1'>

<java>

<job-tracker>foo.bar:8021</job-tracker>

<name-node>foo1.bar:8020</name-node>

...

<configuration>

<property>

<name>abc</name>

<value>def</value>

</property>

</configuration>

</java>

</action>

capturing output
Capturing Output
  • How to pass parameter from my Java action to other actions?
  • Add the <capture-output/> element to your Java action
  • Reference the parameter in your following actions
  • Write some Java code to link them
slide25

<actionname='java1'>

<java>

<job-tracker>${jobTracker}</job-tracker>

<name-node>${nameNode}</name-node>

<configuration>

<property>

<name>mapred.job.queue.name</name>

<value>${queueName}</value>

</property>

</configuration>

<main-class>org.apache.oozie.test.MyTest</main-class>

<arg>${outputFileName}</arg>

<capture-output/>

</java>

<okto="pig1"/>

<errorto="fail"/>

</action>

slide26

<actionname='pig1'>

<pig>

<job-tracker>${jobTracker}</job-tracker>

<name-node>${nameNode}</name-node>

<configuration>

<property>

<name>mapred.job.queue.name</name>

<value>${queueName}</value>

</property>

</configuration>

<script>script.pig</script>

<param>MY_VAR=${wf:actionData('java1')['PASS_ME']}</param>

</pig>

<okto="end"/>

<errorto="fail"/>

</action>

slide27

publicstaticvoidmain (String[] args)

{

String fileName = args[0];

try{

File file = newFile(

System.getProperty("oozie.action.output.properties"));

Properties props = newProperties();

props.setProperty("PASS_ME", "123456");

OutputStreamos = newFileOutputStream(file);

props.store(os, "");

os.close();

System.out.println(file.getAbsolutePath());

}

catch(Exception e) {

e.printStackTrace();

}

  • System.exit(0);

}

a use case hourly jobs
A Use Case: Hourly Jobs
  • Replace a CRON job that runs a bash script once a day
    • Java main class that pulls data from a file stream and dumps it to HDFS
    • Runs a MapReduce job on the files
    • Emails a person when finished
    • Start within X amount of time
    • Complete within Y amount of time
    • And retry Z times on failure
slide37

1

<workflow-app name="filestream_wf" xmlns="uri:oozie:workflow:0.1">

<start to="java-node"/>

<action name="java-node"/>

<java>

<job-tracker>foo:9001</job-tracker>

<name-node>bar:9000</name-node>

<main-class>org.foo.bar.PullFileStream</main-class>

</java>

<ok to="mr-node"/>

<error to="fail"/>

</action>

<action name="mr-node">

<map-reduce>

<job-tracker>foo:9001</job-tracker>

<name-node>bar:9000</name-node>

<configuration>

...

</configuration>

</map-reduce>

<ok to="email-node">

<error to="fail"/>

</action>

...

2

3

...

<action name="email-node">

<email xmlns="uri:oozie:email-action:0.1">

<to>customer@foo.bar</to>

<cc>employee@foo.bar</cc>

<subject>Email notification</subject>

<body>The wf completed</body>

</email>

<ok to="myotherjob"/>

<error to="errorcleanup"/>

</action>

<end name="end"/>

<kill name="fail"/>

</workflow-app>

slide38

6

<?xml version="1.0"?>

<coordinator-app end="${COORD_END}"

frequency="${coord:days(1)}"

name="daily_job_coord"

start="${COORD_START}"

timezone="UTC"

xmlns="uri:oozie:coordinator:0.1"

xmlns="uri:oozie:sla:0.1">

<action>

<workflow>

<app-path>hdfs://bar:9000/user/hadoop/oozie/app/test_job</app-path>

</workflow>

<sla:info>

<sla:nominal-time>${coord:nominalTime()}</sla:nominal-time>

<sla:should-start>${X * MINUTES}</sla:should-start>

<sla:should-end>${Y * MINUTES}</sla:should-end>

<sla:alert-contact>foo@bar.com</sla:alert-contact>

</sla:info>

</action>

</coordinator-app>

4, 5

review
Review
  • Oozie ties together many Hadoop ecosystem components to "productionalize" this stuff
  • Advanced control flow and action extendibility lets Oozie do whatever you would need it to do at any point in the workflow
  • XML is gross
references
References
  • http://oozie.apache.org
  • https://cwiki.apache.org/confluence/display/OOZIE/Index
  • http://www.slideshare.net/mattgoeke/oozie-riot-games
  • http://www.slideshare.net/mislam77/oozie-sweet-13451212
  • http://www.slideshare.net/ChicagoHUG/everything-you-wanted-to-know-but-were-afraid-to-ask-about-oozie