ParaTimer : A progress Indicator for MapReduce DAGs

ParaTimer : A progress Indicator for MapReduce DAGs Kristi Morton, Magdalena Balazinska, and Dan Grossman Computer Science and Engineering Department, University of Washington Advisor Martin Theobald Isha Khosla Masters in Informatics

Overview

Parallel Database Management Systems • Designed to process massive scale datasets. • Parallelism speeds up query execution. • Example of programming models or framework that provides parallelism of data sets. • Map Reduce • Pig latin

MapReduce • MapReduce is a programming model for processing large data sets. • Each MapReduce job contain seven phases of execution. • Split phase reads the data from file and split it. • Record reader phase iterates through the data set and generate key value pairs. • Map function process these records by appropriate operators. Split Record Reader - -- Map ---- Combine Map Task File

MapReduce • Combine phase sorts and preaggregates the data and writes the records locally. • Copy phase copies the relevant data from the node where the map executed. • Sort phase merges all the files and passes the data to reducer phase. • Reducer applies appropriate operators and writes the data to disk. Split Record Reader---- Map ----- Combine Copy Sort Reduce Map Task Reduce Task File Storage Storage

Piglatin • Extension of MapReduce framework. • Provides declarative interface to MapReduce. • Transform a SQL query…to Pig query. Suppose we have a table urls: (url,category, pagerank). The following is a simple SQL query that finds, for each sufficiently large category, the average pagerank of high-pagerank urls in that category. good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY category; big_groups = FILTER groups BY COUNT(good_urls)>106; output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank); SELECT category, AVG(pagerank) FROM urls WHERE pagerank >0.2 GROUP BY category HAVING COUNT(*) > 106

Pig Latin Example

Summarizing Pig Latin Query • visits = load ‘/data/visits’ as (user, url, time); • gVisits = group visits by url; • visitCounts = foreach gVisits generate url, count(visits); • urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); • visitCounts = join visitCounts by url, urlInfo by url; • gCategories = group visitCounts by category; • topUrls = foreach gCategories generate top(visitCounts,10); • store topUrls into ‘/data/topUrls’;

Compilation to MapReduce

OverviewMotivation

Parallel Environment • To improve ..Parallel DBMSs • Resource Allocation • Enable query debugging • Tune the cluster configuration. What we need?

Framing the situation • Given a magnitude of data and queries. • Need more than efficient query processing • What else user needs? • Accurate, time based progress estimation. • Intra-query fault tolerance • Query scheduling and resource management. • All this without too much runtime overhead

Challenges • Accurate progress estimation in parallel environment is a challenging task.. • Yes! It is.. • Parallel environments.. • Distribution. • Concurrency • Failures • Data skew

Parallax-Progress Indicatorfor parallel queries • Accurate time remaining estimate for parallel queries. • Why is accurate progress important? • Users need to plan their time. • Users need to know when to stop queries • Parallel queries are translated into sequence of map-reduce jobs. • Assumption- uniform data distribution, absence of node failures.

For, Accurate, time based progress estimation. • Parallax ! is proposed..

Parallex • Breaks the query into pipelines, which are groups of interconnected operators. • From the 7 phases of MapReduce..Parallex considers three pipelines • Breaks the query into pipelines, which are groups of interconnected operators. • From the 7 phases of MapReduce..Parallex considers three pipelines 1 1 2 2 3 3 Split Split Record Reader Map Combine Record Reader Map Combine Copy Copy Sort Sort Reduce Reduce Reduce Task Reduce Task Map Task Map Task Storage Storage File File Storage Storage

Time estimation - Parallex • Let, N =Total number of tuples that pipeline must process. • K= number of tuples processed so far • Work remaining, w= N-K • For each pipeline p, given Np , Kp and pis the estimated processing cost. • So, the time remaining for the pipeline is

Time estimation - Parallex Given J, the setof all MapReduce jobs, and Pj , the set of all pipelines within job j belongs to J, the progress of a computation is thus given by the following formula, where Njp and Kjp values are aggregated across all partitions of the same pipeline and Setupremaining is the overhead for the unscheduled map and reduce tasks.

Pig Progress Indicator • Considers only record reader, copy, and reducer phases. • Limited accuracy. • Assumes that all operators (within and across jobs) perform the same amount of work. • Ignores high degree of parallelism.

It is found that none of the progress indicators so discussed are not efficient for large parallelism and consider node failures and non-uniform data distribution.

Problem Statement Requires a time remaining indicator for broader class of queries that handles real system challenges such as failures and data skew.

Overview • Motivation Solution: ParaTimer

ParaTimer • A progress indicator for parallel queries that take the form of directed acyclic graphs (DAGs) of MapReduce jobs. • To handle complex shaped query plans in the form of trees, ParaTimer adopts the strategy of identifying and tracking the critical path in the query plan.

Critical-Path based progress estimation • Step 1: Computing the task schedule. • FIFO scheduler is considered, jobs are launched one after the other in sequence. • Consider the cluster with capacity of 5 concurrent map and 5 concurrent reduce tasks. • Assume, Job1= 2 map tasks +1 reduce task. Job2= 6 map tasks + 1 reduce task. Job3 = 1map task + 1 reduce task Job3 Job1 Job2 Pig latin query plan with a join operator

Critical-Path based progress estimation • Step 1: Computing the task schedule. • Job1= 2 map tasks +1 reduce task. • Job2= 6 map tasks + 1 reduce task. • Job3 = 1map task + 1 reduce task. • 5 concurrent map and reduce nodes in a cluster. Given a DAG of MapReduce jobs, Para timer computes a schedule S, such as shown.. Job1 m11 m24 m3 m12 m25 Job2 m21 m26 m22 m23 r3 r1 r2

Critical-Path based progress estimation • Step 2: Breaking a schedule into path fragments. • Typically, batches of task are scheduled at the same time. Given a schedule S, a task round, T, is a set of tasks t belongs to S that all begin within a time x1of each other and end within a time x1of each other. Batch3 Batch1 m11 m24 m3 m12 m25 m21 m26 Batch2 m22 m23 r3 r2 r1

Critical-Path based progress estimation • Step 2: Breaking a schedule into path fragments. A path fragment is a set of tasks all of the same type (i.e., either maps or reduces) that execute in consecutive rounds. In a path fragment, all rounds have the same width (i.e., same number of parallel tasks) except the last round, which can be either full or not. • Six path fragments are scheduled, i.e P1= {m11,m12, m24, m25} P2= {m21, m22, m23, m26} P3= {r1} P4= {r2} P5= {m3} P6= {r3}

How these path fragments represent parallel query execution? • Case 1 : If a query comprises only sequence of MapReduce jobs P1 P2 P3 P4 Paths Map1 Reduce1 Map2 Reduce2 Critical path is a sequence of all these path fragments. Here it is equivalent t Parallex

How these path fragments represent parallel query execution? • Case 2 : If a query comprises parallel MapReduce jobs • Identify the critical paths

Critical-Path based progress estimation • Step 3: Identifying the critical path fragments. Given a schedule and an assignment of tasks to path fragments, it is easy to derive a schedule in terms of path fragments where each path fragment is accompanied by a start time and a duration. Case 1: If two overlapping path fragments start at the same time, keep only the one expected to take longer. In the example, p1 and p2 execute in parallel. Hence, the shorter p1 fragment can be ignored. • P1= {m11,m12, m24, m25} • P2= {m21, m22, m23, m26}

Critical-Path based progress estimation • Step 3: Identifying the critical path fragments. Case 2: If two overlapping path fragments start at different times, keep the one that starts earlier. Remove the other one, but add back its extra time. In our example, p2 and p3 overlap. Because the overlap is total, p3’s time can be ignored. However, if r1 stretched past the end of m26, the extra time would be taken into account on the critical path.. • P2= {m21, m22, m23, m26} • P3= {r1} Critical path : P2= {m21, m22, m23, m26} .

Critical-Path based progress estimation • Step 4: Estimating the time remaining at run-time. ParaTimer could monitor only a thread of tasks within the path fragment (or some subset of these threads), where a thread is a sequence of tasks from the beginning to the end of a path fragment.

Overview • Motivation • Solution: ParaTimer Key Contributions

Contributions of ParaTimer • Handling failures • Failures affect progress estimation. There is no way to predict the running time for a query accurately if Failures occur. How to estimate the remaining time for queries? Proposed approach.. Comprehensive progress estimation – users should be shown multiple guesses about the remaining query time

Comprehensive Progress estimation • Std Estimator + Pessimistic Failure estimator • Pessimistic Failure estimator • The longest remaining task must be the one to fail. • The task must fail right before finishing as this adds the greatest delay. • The task must have been scheduled in the last round of tasks for the given job and phase. PFE..estimates time remaining for this schedule StdEstimator estimates the time remaining for this schedue

Comprehensive Progress estimation • Pessimistic Failure estimator . Pipelines are scheduled or blocked Failure adds a path fragment but does not change latency

Handling Data skew • Uneven distribution of data to partitions • In MapReduce, skew due to uneven distribution can occur only in reduce tasks. • Example.. • No longer wide path fragments. • Each slot in cluster becomes its own path fragment.

Overview • Motivation • Solution: ParaTimer • Key Contributions Evaluation

Experimental Setup • Configuration • 8-node cluster configured with Hadoop-17 and Pig Latin. • Each node =2.00GHz dual quad-core Intel Xeon CPU, 16 GB of RAM. • Parallelism=16 concurrent map and reduce tasks. • Assumptions, for each pipeline • Input cardinality estimates=N • Processing rate estimated= • Both Parallex and ParaTimer are assumed in two forms.. • Perfect..which uses values from a prior run over the entire data set. • 1%..which uses collected from a single prior run over a1% sampled subset.

Experimental Setup • Experiment script • ParaTimer handles the following PigLatin script that contain a join operator and yields a query plan with concurrent map reduce jobs. • Job1-1GB data,4 parallel map and 16 reduce task. • Job2 – 4.2 GB data, 17 map and 16 reduce.

Percent-time complete estimates for parallelquery with join. 4.2 GB and 1 GB data sets Instantaneous error is computed as: fi- percent time done estimate. ti- current time. tn-time when the query completes.

Overview • Motivation • Solution: ParaTimer • Key Contributions • Evaluation

Conclusion • What is ParaTimer ?, a system for estimating time remaining for parallel queries consisting of multiple map reduce jobs running on a cluster. • Key idea.. Identifying the critical path for the entire query and producing multiple estimates.

Related work • Parallel DBMSs provides coarse grain indicators for running parallel queries. • DB2. SQL/monitoring facility.http://www.sprdb2.com/SQLMFVSE.PDF, 2000. • DB2. DB2 Basics: The whys and how-tos of DB2 UDB monitoring. http://www.ibm.com/developerworks/db2/library/ techarticle/dm-0408hubel/index.html, 2004. • Parallex.. Already discussed • Pig progress indicator

Thanks

ParaTimer : A progress Indicator for MapReduce DAGs