oozie hcatalog integration n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Oozie-HCatalog Integration PowerPoint Presentation
Download Presentation
Oozie-HCatalog Integration

Loading in 2 Seconds...

play fullscreen
1 / 13

Oozie-HCatalog Integration - PowerPoint PPT Presentation


  • 205 Views
  • Uploaded on

Oozie-HCatalog Integration. Oozie Team. Agenda. Why does Oozie need HCatalog supports ? Architecture How to support existing Synchronous data processing using HCatalog ? Examples Future work. Current Oozie Coordinator.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Oozie-HCatalog Integration' - cadman-buck


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
agenda
Agenda
  • Why does Oozie need HCatalog supports?
  • Architecture
  • How to support existing Synchronous data processing using HCatalog?
  • Examples
  • Future work
current oozie coordinator
Current Oozie Coordinator

<coordinator-app frequency=“${coord:hours(4)}” start="2011-01-01T04:00Z“ end="2011-01-10T00:00Z" ..>

<datasets>

<dataset name="input1" frequency="60" initial-instance="2011-01-01T00:00” timezone="UTC“>

<uri-template>

hdfs://<namenode>:8020/data/click/${YEAR}/${MONTH}/${DAY}/${HOUR} </uri-template>

</dataset>

</datasets>

<input-events>

<data-in name="coordInput1" dataset="input1“>

<start-instance>${coord:current(-3)}</start-instance>

<end-instance>${coord:current(0)}</end-instance>

</data-in>

</input-events>

……..

<workflow>

<configuration>

<property>myinput</property>

<value> ${coord:dataIn(‘coordInput1’)}</value>

<property>MY_VAR</property> <value> ANYVALUE</value>

</configuration>

</workflow>

high level diagram oozie hcat notification
High Level Diagram (Oozie-HCat-Notification)

HCatalog

1. Query/Poll Partition

Oozie

4. Push <New Partition>

2. Register Topic

MessageQ

3. Notify New Partition

architecture
Architecture

Message Bus

HCat Server

Oozie

Cold start

Partition is available

Partition

Dependency

Manager

Service

Mark as available

Add entry

Materialize Action

JMS Message handler

Recovery

Service

Action READY?

Database

Update action table for dependencies

Persist missing dependencies

hcat based dataset in coordinator
Hcat-based Dataset in Coordinator
  • <dataset name=”my_ds” initial-instance=”DateStamp” frequency=5 type=“metadata”> <uri-template>

hcat://server:port/db/mydb/table/T1/ ?p_key1=v1;p_key2=v2;p_key=v3

</uri-template></dataset>

  • URI-template Example:

hcat://server:port/db/mydb/table/clicks/?datestamp=$YEAR$MONTH$DAY;region=us

input output partition
Input/Output partition
  • Oozie will pass input/output partitions to WF application as string through <configuration> section.
  • Example of Resolved Set of Partitions :
  • [hcat://server:port/db/mydb/table/clicks/?datestamp=20120915;region=us][hcat://server:port/db/mydb/table/clicks/?datestamp=20120916;region=us]
pig script using hcatalog
Pig script Using HCatalog
  • A typical pig script
  • A = LOAD ’dbname1.tablename1' USING org.apache.hcatalog.pig.HCatLoader();
  • B = filter A by (datestamp= '2012-09-12’ AND regios=‘us’) OR (datestamp= '2012-09-11’ AND regios=‘us’);
  • my_processed_data = ...
  • STORE my_processed_data INTO 'dbname2.tablename2' USING org.apache.hcatalog.pig.HCatStorer(’date=20120912','a:int,b:chararray,c:map[]');
map reduce job using hcatalog
Map-Reduce Job using HCatalog
  • Configuration conf = new Configuration();
  • Job job = new Job(conf, "hcatmapreduce read test");
  • job.setJarByClass(this.getClass());
  • job.setMapperClass(HCatMapReduceTest.MapRead.class);
  • job.setInputFormatClass(HCatInputFormat.class);
  • job.setOutputFormatClass(TextOutputFormat.class);
  • InputJobInfoinputJobInfo = InputJobInfo.create(dbName,tableName,filter,thriftUri,null);
  • HCatInputFormat.setInput(job, inputJobInfo);
a typical hcatalog app needs
A Typical HCatalog App Needs
  • DB Name
  • Table Name
  • Thrift URI of HCat server
    • For pig- it could be pass as –D option
    • Q: Will there be any other protocol (other than thrift) supported for HCAT?
  • Filters
    • Same partition: Keys are separated by AND
      • Ex: region = us AND date = 20110811
    • Different partitions: Partitions are separated by OR
      • Ex: region = us AND date = 20110811 OR region = us AND date = 20110812
parameter passing from coordinator
Parameter Passing from Coordinator
  • Oozie provides multiple EL functions for the followings:
    • Get DB name of input/output datasets
      • e.g. getDatabaseIn(‘dsName’) & getDatabaseOut(‘dsName’)
    • Get table-name of a dataset
      • e.g. getTableIn(‘dsName’) & getTableOut(‘dsName’)
    • Get partition filter string for each input-event
      • getPartitionsPigFilter(‘in-event’)
    • Get specific partition-key’s value for use in range filtering
      • getPartitionValue(‘key’,’dsName’)
    • Get partition definition for each output-event.
      • getOutputPartitionsPig(‘out-event’)
future work
Future work
  • To support Asynchronous data Processing?
  • To support wild-card like support through HCatalog Mark-set-done feature.
challenges
Challenges ..
  • Scalability, scalability….