Optimizing Iterative MapReduce Jobs

Map-Reduce for Iterative Computation Dhruv Kumar

Outline • Introduction • Haloop • Architecture and API • Loop Aware Task Scheduling • Caching and Indexing • Experimental Results • Conclusion

Map Reduce Heaven • MR framework is great for certain problems: word count, equi-join queries, generating an inverted index. • Equi-join: • SELECT E.EmpName, E.EmpNum, E.Salary, D.DepNum, D.DepCity from E, D WHERE E.EmpNum = D.DepNum • Inverted Index: • S0 = “it is what it is” • S1 = “it is a fruit” • “it” : {0, 1} • “is” : {0, 1} • “a” : {1} • “what” : {0} • “fruit”: {1}

Map Reduce Panacea • Word Count, equi-join and inverted index are a class of “embarassingly parallel” problems. • Programmers can effortlessly define a Map() and Reduce() function while operating under an MR implementation (Hadoop). • What about other operations in the age of petascale data? • What are they? Can I use MR?

For a large data set… • Analyze hypertext links and compute page rank: matrix vector computation. • Group similar items together: k-means clustering. • Analyze a social network and discover friends:descendant query. • What is the common underlying feature of all these algorithms?

Iterative Problems • These data mining algorithms are iterative. • In iterative algorithms we repeatedly process some data until • The computation converges: R(t) – R (t-1) < Delta • We reach a stopping condition: n(R) > Num • Can we use MR on them?

Recall that MR… • Is great for “embarassingly parallel” algorithms in which: • Divisible: The problem can be broken down into simpler “map” tasks. • Independent: The individual map tasks do not need to communicate with each other. • Simple Agglomeration: The collection is a simple “reduce” task which can combine all the map results. • Eg: Word Count (everyone’s favorite).

Retrofit Iteration on MR • MR does not support iteration out of the box. • But we still want Page Rank, clustering etc which are iterative. • One “easy” solution: Split iteration into multiple Map() and Reduce() jobs. Write a driver program for orchestration (Apache Mahout). • Wasteful because: • Reload constant data from GFS/HDFS at each iteration to consume CPU, I/O and network bandwidth. • Fixpoint Detection requires an extra MR job on each iteration.

Meet Haloop • HaLoop: Efficient Iterative Data Processing on Large Clusters by Yingyi Bu, Bill Howe, Magdalena Balazinska, Michael D. Ernst. In VLDB'10: The 36th International Conference on Very Large Data Bases, Singapore, 24-30 September, 2010. • http://code.google.com/p/haloop/ • Currently only a prototype, Open Source.

HaLoop’s Elevator Pitch • New programming model and architecture for iterative computations: absolve programmer of loop control, give new APIs to express iteration. • Loop-aware task scheduling: reuse data across iterations, physical proximity. • Caching for loop invariant data: detect invariants in first iteration, cache for subsequent iterations to save resources. • Caching to support fixpoint evaluation: avoids the need for a dedicated MR step on each iteration. • Outperforms Hadoop: improves query runtime by 1.85, shuffles only 4% of data between mappers and reducers when tested on “real world” and synthetic data.

MR Architecture

Iteration on MR 3 1 2 4 5 6

Iteration in Haloop 1 2 3 4 5 6

Loop Control: HaLoopvsHadoop Application Application Framework Framework Notice the difference in loop control

API and Loop Control

Architecture New, additional API Leverage data locality Caching for fast access Starts new MR jobs repeatedly

Architecture New, additional API Starts new MR jobs repeatedly

Programming API

So far… • We have a layer of software over MapReduce which handles the “driving.” • Automatically dispatch jobs • Check for iteration stop • The benefit is primarily convenience. • What about efficiency?

Architecture Leverage data locality

General Form of Iteration Join Next Result Initial Data Previous Result Invariant Relation (Constant data)

Page Rank

Descendant Query

Page Rank Inspection • The linkage table L is invariant across iterations. • Conventional MR iteration methods are unaware of this. Why? • Conventional MR: process and shuffle the Linkage table at each iteration. • Can we do better?

Exploit Inter-iteration locality Iteration 1 Iteration 2 Place on the same physical machines those map and reduce tasks that occur in different iterations but access the same data. What is the constraint?

Haloop’s Task Scheduler • Use: AddInvariantTable(HDFS file); • Track the data partitions accessed by the map and the reduce workers on each node. • When a node is free, assign to it a new task which uses the invariant data from previous iteration. • Search nearby in case of failure.

Architecture Caching for fast access

Caching • Task scheduler component places map and reduce tasks which occur in different iterations but access the same data on the same machine. • Success depends on the ability of the master to schedule tasks “intelligently.” • Hence, a particular loop invariant data partition is usually only needed by one physical node. • Needed by some other node in case of failure. • Caching and indexingcomponents build on Task Scheduler to: • Cache data on local disk instead of HDFS • Index it for fast object retreival • Save network bandwidth and I/O time.

Cache and Index 1 2 3 4 5 6 Where is the network I/O going on?

Network I/O 1 2 3 4 5 6 HDFS Mapper Reducer Master HDFS Mapper

Cache and Index: Mapper Input Cache 1 2 3 4 5 6 In Hadoop data local mapping rate is between 70%-95%.

Cache and Index: Mapper Input Cache 1 2 3 4 5 6 • - avoid non-local data reads after the first iteration. SetMapperInputCache(true); • iterations use loop aware task scheduler to read data from local disks. • useful for algorithms where mapper input does not change across iterations • model fitting algorithms. Eg, k-means.

Cache and Index: Reducer Input Cache 1 2 3 4 5 6 • - Specify AddInvariantTable();and SetReducerInputCache(true); • In cache, store keys and values in separate files. Have pointers from keys to values. Index sorted keys for fast, seek forward only access. • useful for algorithms having repeated joins against large invariant data: Page Rank.

Reducer Output Cache 1 2 3 4 5 6 • - Specify SetReducerOutputCache(true); • Used for evaluating termination condition in parallel without an extra MR job. • Each reducer compares current output with previous iteration’s cached output and reports to master. • Page Rank, descendant query etc.

Caching and Indexing • Mapper Input Cache • Reducer Input Cache • Reducer Output Cache • Why is there no Mapper Output Cache?

Testing Methods • Amazon’s EC2 • 50 and 90 slave nodes • One master • Semi synthetic and “real-world” datasets: • LiveJournal (Social Network) : 18GB • Triples (Semantic Web RDF) : 120GB • Freebase (Concept Linkage Graph) : 12GB • Evaluate independently the effect of Cache management: • Reducer Input • Reducer Output • Mapper Input

Page Rank • Run only for 10 iterations. • Join and aggregate in every iteration. • Overhead in first step for caching input. • Catches up soon and outperfromsHadoop. • Low shuffling time: time between RPC invocation by reducer and sorting of keys.

Descendant Query • Join and duplicate elimination in every iteration. • Less striking Performance on LiveJournal: social network, high fan out, excessive duplicate generation which dominates the join cost and reducer input caching is less useful.

Testing… • Evaluate independently the effect of Cache management: • Reducer Input Cache • Reducer Output Cache • MapperInput Cache

Reducer Output Cache • Recall that ROC saves an extra MR job for fixedpoint evaluation.

Testing… • Evaluate independently the effect of Cache management: • Reducer Input Cache • Reducer Output Cache • MapperInput Cache

Mapper Input Cache • Cannot use Page Rank or Desc. Query. (Why?) • Use a model fitting algorithm: k-means.

Optimizing Iterative MapReduce Jobs

Optimizing Iterative MapReduce Jobs

Presentation Transcript

MapReduce

Optimizing MapReduce Provisioning in the Cloud

Restore ： R eusing results of mapreduce jobs

Optimizing Cloud MapReduce for Processing Stream Data using Pipelining

MapReduce

MapReduce, GPGPU and Iterative Data mining algorithms

Tarazu Optimizing MapReduce On Heterogeneous Clusters

MapReduce

Two Sides of a C oin : Optimizing the S chedule of MapReduce Jobs

MapReduce

Twister4Azure : Iterative MapReduce for Azure Cloud

MapReduce

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Iterative MapReduce and High Performance Datamining

Optimizing OpenCL Kernels for Iterative Statistical Applications on GPUs

Twister: A Runtime for Iterative MapReduce

Iterative Layering: Optimizing Arithmetic Circuits by Structuring the Information Flow

MapReduce

MapReduce

Iterative MapReduce E nabling HPC-Cloud Interoperability

Iterative Layering: Optimizing Arithmetic Circuits by Structuring the Information Flow

MapReduce