1 / 13

Oozie-HCatalog Integration

Oozie-HCatalog Integration. Oozie Team. Agenda. Why does Oozie need HCatalog supports ? Architecture How to support existing Synchronous data processing using HCatalog ? Examples Future work. Current Oozie Coordinator.

cadman-buck
Download Presentation

Oozie-HCatalog Integration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Oozie-HCatalog Integration Oozie Team

  2. Agenda • Why does Oozie need HCatalog supports? • Architecture • How to support existing Synchronous data processing using HCatalog? • Examples • Future work

  3. Current Oozie Coordinator <coordinator-app frequency=“${coord:hours(4)}” start="2011-01-01T04:00Z“ end="2011-01-10T00:00Z" ..> <datasets> <dataset name="input1" frequency="60" initial-instance="2011-01-01T00:00” timezone="UTC“> <uri-template> hdfs://<namenode>:8020/data/click/${YEAR}/${MONTH}/${DAY}/${HOUR} </uri-template> </dataset> </datasets> <input-events> <data-in name="coordInput1" dataset="input1“> <start-instance>${coord:current(-3)}</start-instance> <end-instance>${coord:current(0)}</end-instance> </data-in> </input-events> …….. <workflow> <configuration> <property>myinput</property> <value> ${coord:dataIn(‘coordInput1’)}</value> <property>MY_VAR</property> <value> ANYVALUE</value> </configuration> </workflow>

  4. High Level Diagram (Oozie-HCat-Notification) HCatalog 1. Query/Poll Partition Oozie 4. Push <New Partition> 2. Register Topic MessageQ 3. Notify New Partition

  5. Architecture Message Bus HCat Server Oozie Cold start Partition is available Partition Dependency Manager Service Mark as available Add entry Materialize Action JMS Message handler Recovery Service Action READY? Database Update action table for dependencies Persist missing dependencies

  6. Hcat-based Dataset in Coordinator • <dataset name=”my_ds” initial-instance=”DateStamp” frequency=5 type=“metadata”> <uri-template> hcat://server:port/db/mydb/table/T1/ ?p_key1=v1;p_key2=v2;p_key=v3 </uri-template></dataset> • URI-template Example: hcat://server:port/db/mydb/table/clicks/?datestamp=$YEAR$MONTH$DAY;region=us

  7. Input/Output partition • Oozie will pass input/output partitions to WF application as string through <configuration> section. • Example of Resolved Set of Partitions : • [hcat://server:port/db/mydb/table/clicks/?datestamp=20120915;region=us][hcat://server:port/db/mydb/table/clicks/?datestamp=20120916;region=us]

  8. Pig script Using HCatalog • A typical pig script • A = LOAD ’dbname1.tablename1' USING org.apache.hcatalog.pig.HCatLoader(); • B = filter A by (datestamp= '2012-09-12’ AND regios=‘us’) OR (datestamp= '2012-09-11’ AND regios=‘us’); • my_processed_data = ... • STORE my_processed_data INTO 'dbname2.tablename2' USING org.apache.hcatalog.pig.HCatStorer(’date=20120912','a:int,b:chararray,c:map[]');

  9. Map-Reduce Job using HCatalog • Configuration conf = new Configuration(); • Job job = new Job(conf, "hcatmapreduce read test"); • job.setJarByClass(this.getClass()); • job.setMapperClass(HCatMapReduceTest.MapRead.class); • job.setInputFormatClass(HCatInputFormat.class); • job.setOutputFormatClass(TextOutputFormat.class); • InputJobInfoinputJobInfo = InputJobInfo.create(dbName,tableName,filter,thriftUri,null); • HCatInputFormat.setInput(job, inputJobInfo);

  10. A Typical HCatalog App Needs • DB Name • Table Name • Thrift URI of HCat server • For pig- it could be pass as –D option • Q: Will there be any other protocol (other than thrift) supported for HCAT? • Filters • Same partition: Keys are separated by AND • Ex: region = us AND date = 20110811 • Different partitions: Partitions are separated by OR • Ex: region = us AND date = 20110811 OR region = us AND date = 20110812

  11. Parameter Passing from Coordinator • Oozie provides multiple EL functions for the followings: • Get DB name of input/output datasets • e.g. getDatabaseIn(‘dsName’) & getDatabaseOut(‘dsName’) • Get table-name of a dataset • e.g. getTableIn(‘dsName’) & getTableOut(‘dsName’) • Get partition filter string for each input-event • getPartitionsPigFilter(‘in-event’) • Get specific partition-key’s value for use in range filtering • getPartitionValue(‘key’,’dsName’) • Get partition definition for each output-event. • getOutputPartitionsPig(‘out-event’)

  12. Future work • To support Asynchronous data Processing? • To support wild-card like support through HCatalog Mark-set-done feature.

  13. Challenges .. • Scalability, scalability….

More Related