YSmart : Yet Another SQL-to- MapReduce Translator

YSmart: Yet Another SQL-to-MapReduce Translator Rubao Lee1, Tian Luo1, Yin Huai1, Fusheng Wang2, Yongqiang He3, Xiaodong Zhang1 • 1Dept. of Computer Sci. & Eng., The Ohio State University • 2Center for Comprehensive Informatics, Emory University • 3 Data Infrastructure Team, Facebook

Data Explosion in Human Society The global storage capacity The amount of digital information created and replicated in a year 2007 Analog 18.86 billion GB Analog Storage 2000 Digital 276.12 billion GB Digital Storage Source: Exabytes: Documenting the 'digital age' and huge growth in computing capacity, The Washington Post

Challenges of Big Data Management and Analytics • Existing DB technology is not prepared for such huge volumes • Storage and system scale (e.g. 70TB/day and 3000 nodes in Facebook) • Many applications in science, medical and business follow the same trend • Big data demands more complex analytics than data transactions • Data mining, time series analysis and etc. gain deep insights • The conventional DB business model is not affordable • Software license, maintain fees, $10,000/TB* • The conventional DB processing model is to “scale up” • By updating CPU/memory/storage/networks (HPC model) • The big data processing model is to “scale-out”: continuously adding low cost computing and storage nodes in a distributed way • Hadoop is the most widely used implementation of MapReduce • in hundreds of society-dependent corporations/organizations for big data analytics: • AOL, Baidu, EBay, Facebook, IBM, NY Times, Yahoo! …. The MapReduce (MR) programming model becomes an effective data processing engine for big data analytics *: http://www.dbms2.com/2010/10/15/pricing-of-data-warehouse-appliances/

MapReduce Overview • A simple but effective programming model designed to process huge volumes of data concurrently on large-scale clusters • Key/value pair transitions • Map: (k1, v1)  (k2, v2) • Reduce: (k2, a list of v2)  (k3, v3) • Shuffle: Partition Key (It could be the same as k2, or not) • Partition Key: to determine how a key/value pair in the map output would be transferred to a reduce task

MR(Hadoop) Job Execution Overview MR program (job) The execution of a MR job involves 6 steps Map Tasks Reduce Tasks Control level work, e.g. job scheduling and task assignment Data is stored in a distributed file system (e.g. Hadoop Distributed File System) 1: Job submission Master node Worker nodes Worker nodes 2: Assign Tasks Do data processing work specified by Map or Reduce Function

MR(Hadoop) Job Execution Overview MR program The execution of a MR job involves 6 steps Map Tasks Reduce Tasks 1: Job submission Map output Master node Worker nodes Worker nodes Map output will be shuffled to different reduce tasks based on Partition Keys (PKs) (usually Map output keys) 3: Map phase Concurrent tasks 4: Shuffle phase

MR(Hadoop) Job Execution Overview MR program The execution of a MR job involves 6 steps Map Tasks Reduce Tasks 1: Job submission 6: Output will be stored back to the distributed file system Master node Worker nodes Worker nodes Reduce output 3: Map phase Concurrent tasks 5: Reduce phase Concurrent tasks 4: Shuffle phase

MR(Hadoop) Job Execution Overview MR program The execution of a MR job involves 6 steps Map Tasks Reduce Tasks 1: Job submission 6: Output will be stored back to Distributed File System Master node Worker nodes Worker nodes A MapReduce (MR) job is resource-consuming: 1: Input data scan in the Map phase => local or remote I/Os 1: Store intermediate results of Map output => local I/Os 3: Transfer data across in the Shuffle phase => network costs 4: Store final results of this MR job => local I/Os + network costs (replicate data) Reduce output 3: Map phase Concurrent tasks 5: Reduce phase Concurrent tasks 4: Shuffle phase

MR programming is not that “simple”! publicstaticclass Reduce extends Reducer<IntWritable,Text,IntWritable,Text> { private Text result = new Text(); publicvoid reduce(IntWritable key, Iterable<Text> values, Context context ) throwsIOException, InterruptedException { doublesumQuantity = 0.0; IntWritablenewKey = newIntWritable(); booleanisDiscard = true; String thisValue = new String(); intthisKey = 0; for (Text val : values) { String[] tokens = val.toString().split("\\|"); if (tokens[tokens.length - 1].compareTo("l") == 0){ sumQuantity += Double.parseDouble(tokens[0]); } elseif (tokens[tokens.length - 1].compareTo("o") == 0){ thisKey = Integer.valueOf(tokens[0]); thisValue = key.toString() + "|" + tokens[1]+"|"+tokens[2]; } else continue; } if (sumQuantity > 314){ isDiscard = false; } if (!isDiscard){ thisValue = thisValue + "|" + sumQuantity; newKey.set(thisKey); result.set(thisValue); context.write(newKey, result); } } } publicint run(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = newGenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 3) { System.err.println("Usage: Q18Job1 <orders> <lineitem> <out>"); System.exit(2); } Job job = new Job(conf, "TPC-H Q18 Job1"); job.setJarByClass(Q18Job1.class); job.setMapperClass(Map.class); job.setMapOutputKeyClass(IntWritable.class); job.setMapOutputValueClass(Text.class); job.setReducerClass(Reduce.class); job.setOutputKeyClass(IntWritable.class); job.setOutputValueClass(Text.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileInputFormat.addInputPath(job, new Path(otherArgs[1])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[2])); return (job.waitForCompletion(true) ? 0 : 1); } publicstaticvoid main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new Q18Job1(), args); System.exit(res); } } packagetpch; importjava.io.IOException; importjava.util.ArrayList; importorg.apache.hadoop.conf.Configuration; importorg.apache.hadoop.conf.Configured; importorg.apache.hadoop.fs.Path; importorg.apache.hadoop.io.DoubleWritable; importorg.apache.hadoop.io.IntWritable; importorg.apache.hadoop.io.Text; importorg.apache.hadoop.mapreduce.Job; importorg.apache.hadoop.mapreduce.Mapper; importorg.apache.hadoop.mapreduce.Reducer; importorg.apache.hadoop.mapreduce.Mapper.Context; importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat; importorg.apache.hadoop.mapreduce.lib.input.FileSplit; importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat; importorg.apache.hadoop.util.GenericOptionsParser; importorg.apache.hadoop.util.Tool; importorg.apache.hadoop.util.ToolRunner; publicclass Q18Job1 extends Configured implements Tool{ publicstaticclass Map extendsMapper<Object, Text, IntWritable, Text>{ privatefinalstatic Text value = new Text(); privateIntWritable word = newIntWritable(); private String inputFile; privatebooleanisLineitem = false; @Override protectedvoid setup(Context context ) throwsIOException, InterruptedException { inputFile = ((FileSplit)context.getInputSplit()).getPath().getName(); if (inputFile.compareTo("lineitem.tbl") == 0){ isLineitem = true; } System.out.println("isLineitem:" + isLineitem + " inputFile:" + inputFile); } publicvoid map(Object key, Text line, Context context ) throwsIOException, InterruptedException { String[] tokens = (line.toString()).split("\\|"); if (isLineitem){ word.set(Integer.valueOf(tokens[0])); value.set(tokens[4] + "|l"); context.write(word, value); } else{ word.set(Integer.valueOf(tokens[0])); value.set(tokens[1] + "|" + tokens[4]+"|"+tokens[3]+"|o"); context.write(word, value); } } } This complex code is for a simple MR job Low Productivity! Do you miss some thing like … “SELECT * FROM Book WHERE price > 100.00”?

High-Level Programming and Data Processing Environment on top of Hadoop A job description in SQL-like declarative language An interface between users and MR programs (jobs) SQL-to-MapReduce Translator Write MR programs (jobs) MR programs (jobs) Workers Hadoop Distributed File System (HDFS)

High-Level Programming and Data Processing Environment on top of Hadoop A job description in SQL-like declarative language A interface between users and MR programs (jobs) SQL-to-MapReduce Translator Write MR programs (jobs) • Improve productivity from hand-coding MapReduce programs • 95%+Hadoop jobs in Facebook are generated by Hive • 75%+ Hadoop jobs in Yahoo! are invoked by Pig* A MR program (job) A data warehousing system (Facebook) A high-level programming environment (Yahoo!) Workers Hadoop Distributed File System (HDFS) * http://hadooplondon.eventbrite.com/

Translating SQL-like Queries to MapReduce Jobs: Existing Approach • “Sentence by sentence” translation • [C. Olston et al. SIGMOD 2008], [A. Gates et al., VLDB 2009] and [A. Thusoo et al., ICDE2010] • Implementation: Hive and Pig • Three steps • Identify major sentences with operations that shuffle the data • Such as: Join, Group by and Order by • For every operation in the major sentence, a corresponding MR job is generated • e.g. a join op. => a join MR job • Add other operations, such as selection and projection, into corresponding MR jobs Can existing SQL-to-MapReduce translators also provide comparable performance with optimized MR job(s)?

An Example: TPC-H Q21 • One of the most complex and time-consuming queries in the TPC-H benchmark for data warehousing performance • Optimized MR Jobs vs. Hive in a Facebook production cluster 3.7x What’s wrong?

The Execution Plan of TPC-H Q21 Hive handle this sub-tree in a different way from our optimized MR jobs SORT AGG3 It’s the dominated part on time (~90% of execution time) Join4 Left-outer-Join Join3 supplier nation Join2 AGG1 AGG2 Join1 lineitem orders lineitem lineitem

A JOIN MR Job However, inter-job correlations exist. Let’s look at the Partition Key An AGG MR Job Key: l_orderkey A Table J5 A Composite MR Job Key: l_orderkey J3 Key: l_orderkey Key: l_orderkey Key: l_orderkey J4 J2 J1 lineitem orders lineitem lineitem lineitem orders What’s wrong with existing SQL-to-MR translators? Existing translators are correlation-unaware Ignore common data input Ignore common data transition Add unnecessary data re-partition J1, J2 and J4 all need the input table ‘lineitem’ J1 to J5 all use the same partition key ‘l_orderkey’

Approaches of Big Data Analytics in MR: The landscape Correlation-aware SQL-to-MR translator Hand-coding MR jobs Pro: Easy programming, high productivity Con: Poor performance on complex queries (complex queries are usual in daily operations) Performance Pro: high performance MR programs Con: 1: lots of coding even for a simple job 2: Redundant coding is inevitable 3: Hard to debug [J. Tan et al., ICDCS 2010] Existing SQL-to-MR Translators Productivity

Outline • Background and Motivation • Our Solution: YSmart, a correlation-aware SQL-to-MapReduce translator • Performance Evaluation • YSmart in Hive • Conclusion

Our Approaches and Critical Challenges Correlation-aware SQL-to-MR translator MR Jobs for best performance SQL-like queries Primitive MR Jobs Identify Correlations Merge Correlated MR jobs 1: Correlation possibilities and detection 3: Implement high-performance and low-overhead MR jobs 2: Rules for automatically exploiting correlations

Primitive MR Jobs • Selection-Projection (SP) MR Job: • Simple query with only selection and/or projection • Aggregation (AGG) MR Job: • Group input relation and apply aggregation functions • e.g. SELECT count(*) FROM faculty GROUP BY age. • Partition key can be any column(s) in the GROUP BY clause • Join (JOIN) MR Job: • For an equi-join on two relations (inner or left/right/full outer join) • e.g. SELECT * FROM faculty LEFT OUTER JOIN awards ON … • Partition key is in the JOIN predicate • Sort (SORT) MR Job: • Sort input relation • Usually the last step in a complex query

Input Correlation (IC) • Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint J2 J1 lineitem orders lineitem A shared input relation set Map Func. of MR Job 1 Map Func. of MR Job 2

Transit Correlation (TC) • Multiple MR jobs have transit correlation (TC) if • they have input correlation (IC), and • they have the same Partition Key Key: l_orderkey Key: l_orderkey J2 J1 lineitem orders lineitem Two MR jobs should first have IC Partition Key A shared input relation set Other Data Map Func. of MR Job 1 Map Func. of MR Job 2

Job Flow Correlation (JFC) • A MR job has Job Flow Correlation (JFC) with one of its child MR jobs if it has the same partition key as that MR job J1 J2 J2 Partition Key Output of MR Job 2 Other Data Map Func. of MR Job 1 Map Func. of MR Job 2 Reduce Func. of MR Job 1 J1 Reduce Func. of MR Job 2 lineitem orders

Query Optimization Rules for Automatically Exploiting Correlations • JOB-Merging: Exploiting both Input Correlation and Transit Correlation • AGG-Pushdown: Exploiting the Job Flow Correlation associated with Aggregation jobs • JOIN-Merging-Full: Exploiting the Job Flow Correlation associated with JOIN jobs and their Transit Correlated parents jobs • JOIN-Merging-Half: Exploiting the Job Flow Correlation associated with JOIN jobs

Rule 1: JOB MERGING Condition: Job 1 and Job2 have both IC and TC Action: Merge Map/Reduce Func. of Job 1 and Job 2 into a Single Map/Reduce Func. J2 J1 Do all the work of Job 1 Reduce Func., Job 2 Reduce Func.. Generate two outputs lineitem orders lineitem Job 1 Reduce Func. Reduce Func. of Merged Job Job 2 Reduce Func. Job 1 Map Func. Map Func. of Merged Job Job 2 Map Func. lineitem orders

Rule 2: AGG-Pushdown Condition: AGG Job and its parent Job 1 have JFC J2 Action: Merge Job 1 Reduce Func., Job 2 Reduce Func. and JOIN Job into a single Reduce Func. J1 AGG Reduce Func. Reduce Func. of Merged Job Do all the work of Job 1 Reduce Func., AGG Map Func. and AGG Reduce Func. . lineitem orders AGG Map Func. Job 1 Reduce Func. Job 1 Map Func. lineitem orders

Rule 3:Join-Merging-Full Condition 1: Job 1 and Job 2 have Transit Correlation (TC) Condition 2: JOIN Job have JFC with Job 1 and Job 2 J3 Action: Merge Job 1 Reduce Func., Job 2 Reduce Func. and JOIN Job into a single Reduce Func. J2 J1 JOIN Reduce Func. Do all the work of Job 1 Reduce Func., Job 2 Reduce Func., JOIN Map Func. and JOIN Reduce Func. lineitem orders lineitem JOIN Map Func. Reduce Func. of Merged Job Job 1 Reduce Func. Job 2 Reduce Func. Job 1 Map Func. Map Func. of Merged Job Job 2 Map Func. lineitem orders 26

Rule 4:Join-Merging-Half Condition: Job 2 and JOIN Job have JFC Action: Merge the Reduce Func. of Job and that of JOIN into a single Reduce Func. Data from Job 1: do the work of Reduce Func. of JOIN JOIN Reduce Task Reduce Func. of Merged Job Data from Job 2: Do all the work of Job 2 Reduce Func., JOIN Map Func. and JOIN Reduce Func. JOIN Map Func. JOIN Map Func. JOIN Map Func. Must consider data dependency Job 1 must be executed before the merged Job Job 1 Reduce Func. Job 2 Reduce Func. Job 1 Map Func. Job 2 Map Func.

The Common MapReduce Framework • Aim at providing a framework for executing merged tasks in low overhead • To minimize size of intermediate data • To minmize data access in reduce phase • Common Map Func. and Common Reduce Func. • Common Map Func.: Tag a record with Job-IDs it does not belong to • Less intermediate results • Common Reduce Func.: Dispatch key/value pairs => post-job computations • Refer our paper for details

Outline • Background and Motivation • Our Solution: YSmart, a correlation-aware SQL-to-MapReduce translator • Performance Evaluation • YSmart in Hive • Conclusion

Exp1: Four Cases of TPC-H Q21 • 1: Sentence-to-Sentence Translation • 5 MR jobs • 2: InputCorrelation+TransitCorrelation • 3 MR jobs Left-outer-Join Left-outer-Join Join2 Join2 Join1 AGG1 AGG2 lineitem lineitem lineitem lineitem orders orders orders orders lineitem lineitem • 3: InputCorrelation+TransitCorrelation+ • JobFlowCorrelation • 1 MR job • 4: Hand-coding (similar with Case 3) • In reduce function, we optimize code according to the query semantic

Breakdowns of Execution Time (sec) From totally 888sec to 510sec From totally 768sec to 567sec Only 17% difference No Correlation Input Correlation Transit Correlation Input Correlation Transit Correlation JobFlow Correlation Hand-Coding

Exp2: Clickstream Analysis A typical query in production clickstream analysis: “what is the average number of pages a user visits between a page in category ‘X’ and a page in category ‘Y’?” In YSmart JOIN1, AGG1, AGG2, JOIN2 and AGG3 are executed in a single MR job 8.4x 4.8x

YSmart in the Hadoop Ecosystem • See patch HIVE-2206 at apache.org YSmart Hive + YSmart Hadoop Distributed File System (HDFS)

Conclusion • YSmart is a correlation-aware SQL-to-MapReduce translator • Ysmart can outperform Hive by 4.8x, and Pig by 8.4x • YSmart is being integrated into Hive • The individual version of YSmart will be released soon Thank You!

Backup Slides

Performance Evaluation • Exp 1. How can correlations affect query execution performance? • In a local small cluster, the worker node is a quad-core machine • Workload: TPC-H Q21, 10GB dataset • Exp 2. How YSmart can outperformce Hive and Pig? • Amazon EC2 cluster • Workload: Click-stream, 20GB • Compare the performance on execution time among YSmart, Hive and Pig

Why MapReduce is effective data processing engine for big data analytics? • Two unique properties • Minimum dependency among tasks (almost sharing nothing) • Simple task operations in each node (low cost machines are sufficient) • Two strong merits for big data analytics • Scalability (Amadal’s Law): increase throughput by increasing # of nodes • Fault-tolerance (quick and low cost recovery of the failures of tasks) • Hadoop is the most widely used implementation of MapReduce • in hundreds of society-dependent corporations/organizations for big data analytics: AOL, Baidu, EBay, Facebook, IBM, NY Times, Yahoo! ….

An Example A typical question in click stream analysis: “what is the average number of pages a user visits between a page in category ‘Travel’ and a page in category ‘Shopping’?”

An Example A typical question in click stream analysis: “what is the average number of pages a user visits between a page in category ‘Travel’ and a page in category ‘Shopping’?” 3 steps: 1: Identify the time window for every user-visit path from a page in in category ‘Travel’ and a page in category ‘Shopping’; 2: Count the number of pages in each time window (exclude start and end pages); 3: Calculate the average number of counts generated by step 2.

Step 1: Identify the time window for every user-visit path from a page in in category ‘Travel’ and a page in category ‘Shopping’; Step 1-1: For each user, identify pairs of clicks from category ‘Travel’ to ‘Shopping’ Join1 C1.user=C2.user Clicks (C1) Clicks (C2)

2011-01-01 04:05:06 2011-01-01 04:05:06 2011-01-01 04:05:07 2011-01-01 04:05:07 2011-01-01 04:07:06 2011-01-01 04:07:06 2011-01-01 04:07:20 2011-01-01 04:07:20 Join1 C1.user=C2.user Clicks (C1) Clicks (C2)

Step 1-2: Find the end TS for every time window Step 1-3: Find the start TS for every time window AGG2 Group by user, TS_End AGG1 Group by user, TS_Travel Join1 C1.user=C2.user Clicks (C1) Clicks (C2)

Step 2: Count the number of pages in each time window (exclude start and end pages); Step 2-1: Identify the clicks in each time window Join2 user=C3.user Group by user, TS_End AGG2 Clicks (C3) AGG1 Group by user, TS_Travel Join1 C1.user=C2.user Clicks (C1) Clicks (C2)

Step 2-2: Count the number of pages in each time window (exclude start and end pages); Group by user, TS_Start AGG3 Join2 user=C3.user Group by user, TS_End AGG2 Clicks (C3) AGG1 Group by user, TS_Travel Join1 C1.user=C2.user Clicks (C1) Clicks (C2)

Step 3: Calculate the average number of counts generated by step 2. AGG4 Group by user, TS_Start AGG3 Join2 user=C3.user Group by user, TS_End AGG2 Clicks (C3) AGG1 Group by user, TS_Travel the average number of pages a user visits between a page in category ‘Travel’ and a page in category ‘Shopping’ Join1 C1.user=C2.user Clicks (C1) Clicks (C2)

AGG4 2 MR jobs 6 MR jobs Group by user, TS_Start AGG3 AGG3 Join2 Join2 These 5 MR jobs are correlated (sharing the same Partition Key), thus can be merged into 1 MR job user=C3.user Group by user, TS_End ~3x AGG2 AGG2 Clicks (C3) AGG1 AGG1 Group by user, TS_Travel Join1 Join1 C1.user=C2.user Clicks (C1) Clicks (C2)

Exp2: Evaluation Environment • Amazon EC2 cluster • 11 nodes • 20GB click-stream dataset • Compare the performance on execution time among YSmart, Hive and Pig • Facebook cluster • 747 nodes • 8 cores per node • 1 TB data set of TPC-H benchmark • Compare the performance on execution time between YSmart and Hive • Every query were executed three times for both YSmart and Hive

Amazon EC2 11-node cluster Query: “what is the average number of pages a user visits between a page in category ‘X’ and a page in category ‘Y’?” In YSmart JOIN1, AGG1, AGG2, JOIN2 and AGG3 are executed in a single MR job 8.4x 4.8x

Facebook Cluster Query: TPC-H Suppliers Who Kept Orders Waiting Query (Q21) 3.3x 3.7x 3.1x

YSmart : Yet Another SQL-to- MapReduce Translator