map reduce for iterative computation l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Optimizing Iterative MapReduce Jobs PowerPoint Presentation
Download Presentation
Optimizing Iterative MapReduce Jobs

Loading in 2 Seconds...

play fullscreen
1 / 54

Optimizing Iterative MapReduce Jobs - PowerPoint PPT Presentation


  • 409 Views
  • Uploaded on

This presentation is a summary of optimization techniques for Hadoop which were discovered and published by Yingyi Bu et. al. in the paper "HaLoop: Efficient Iterative Data Processing on Large Clusters." I gave a talk on this paper in a graduate school course about Big Data and Online Social Networks. Please feel free to share.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Optimizing Iterative MapReduce Jobs' - dkumar


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
outline
Outline
  • Introduction
  • Haloop
    • Architecture and API
    • Loop Aware Task Scheduling
    • Caching and Indexing
  • Experimental Results
  • Conclusion
outline3
Outline
  • Introduction
  • Haloop
    • Architecture and API
    • Loop Aware Task Scheduling
    • Caching and Indexing
  • Experimental Results
  • Conclusion
map reduce heaven
Map Reduce Heaven
  • MR framework is great for certain problems: word count, equi-join queries, generating an inverted index.
  • Equi-join:
    • SELECT E.EmpName, E.EmpNum, E.Salary, D.DepNum, D.DepCity from E, D WHERE E.EmpNum = D.DepNum
  • Inverted Index:
    • S0 = “it is what it is”
    • S1 = “it is a fruit”
    • “it” : {0, 1}
    • “is” : {0, 1}
    • “a” : {1}
    • “what” : {0}
    • “fruit”: {1}
map reduce panacea
Map Reduce Panacea
  • Word Count, equi-join and inverted index are a class of “embarassingly parallel” problems.
  • Programmers can effortlessly define a Map() and Reduce() function while operating under an MR implementation (Hadoop).
  • What about other operations in the age of petascale data?
  • What are they? Can I use MR?
for a large data set
For a large data set…
  • Analyze hypertext links and compute page rank: matrix vector computation.
  • Group similar items together: k-means clustering.
  • Analyze a social network and discover friends:descendant query.
  • What is the common underlying feature of all these algorithms?
iterative problems
Iterative Problems
  • These data mining algorithms are iterative.
  • In iterative algorithms we repeatedly process some data until
    • The computation converges: R(t) – R (t-1) < Delta
    • We reach a stopping condition: n(R) > Num
  • Can we use MR on them?
recall that mr
Recall that MR…
  • Is great for “embarassingly parallel” algorithms in which:
    • Divisible: The problem can be broken down into simpler “map” tasks.
    • Independent: The individual map tasks do not need to communicate with each other.
    • Simple Agglomeration: The collection is a simple “reduce” task which can combine all the map results.
  • Eg: Word Count (everyone’s favorite).
retrofit iteration on mr
Retrofit Iteration on MR
  • MR does not support iteration out of the box.
  • But we still want Page Rank, clustering etc which are iterative.
  • One “easy” solution: Split iteration into multiple Map() and Reduce() jobs. Write a driver program for orchestration (Apache Mahout).
  • Wasteful because:
    • Reload constant data from GFS/HDFS at each iteration to consume CPU, I/O and network bandwidth.
    • Fixpoint Detection requires an extra MR job on each iteration.
meet ha l oop
Meet Haloop
  • HaLoop: Efficient Iterative Data Processing on Large Clusters by Yingyi Bu, Bill Howe, Magdalena Balazinska, Michael D. Ernst. In VLDB'10: The 36th International Conference on Very Large Data Bases, Singapore, 24-30 September, 2010.
  • http://code.google.com/p/haloop/
  • Currently only a prototype, Open Source.
haloop s elevator pitch
HaLoop’s Elevator Pitch
  • New programming model and architecture for iterative computations: absolve programmer of loop control, give new APIs to express iteration.
  • Loop-aware task scheduling: reuse data across iterations, physical proximity.
  • Caching for loop invariant data: detect invariants in first iteration, cache for subsequent iterations to save resources.
  • Caching to support fixpoint evaluation: avoids the need for a dedicated MR step on each iteration.
  • Outperforms Hadoop: improves query runtime by 1.85, shuffles only 4% of data between mappers and reducers when tested on “real world” and synthetic data.
outline12
Outline
  • Introduction
  • Haloop
    • Architecture and API
    • Loop Aware Task Scheduling
    • Caching and Indexing
  • Experimental Results
  • Conclusion
loop control haloop vs hadoop
Loop Control: HaLoopvsHadoop

Application

Application

Framework

Framework

Notice the difference in loop control

architecture
Architecture

New, additional API

Leverage data locality

Caching for fast access

Starts new MR jobs repeatedly

architecture19
Architecture

New, additional API

Starts new MR jobs repeatedly

outline21
Outline
  • Introduction
  • Haloop
    • Architecture and API
    • Loop Aware Task Scheduling
    • Caching and Indexing
  • Experimental Results
  • Conclusion
so far
So far…
  • We have a layer of software over MapReduce which handles the “driving.”
    • Automatically dispatch jobs
    • Check for iteration stop
  • The benefit is primarily convenience.
  • What about efficiency?
architecture23
Architecture

New, additional API

Leverage data locality

Caching for fast access

Starts new MR jobs repeatedly

architecture24
Architecture

Leverage data locality

general form of iteration
General Form of Iteration

Join

Next Result

Initial Data

Previous Result

Invariant Relation (Constant data)

page rank inspection
Page Rank Inspection
  • The linkage table L is invariant across iterations.
  • Conventional MR iteration methods are unaware of this. Why?
  • Conventional MR: process and shuffle the Linkage table at each iteration.
  • Can we do better?
exploit inter iteration locality
Exploit Inter-iteration locality

Iteration 1

Iteration 2

Place on the same physical machines those map and reduce tasks that occur in different iterations but access the same data.

What is the constraint?

haloop s task scheduler
Haloop’s Task Scheduler
  • Use: AddInvariantTable(HDFS file);
  • Track the data partitions accessed by the map and the reduce workers on each node.
  • When a node is free, assign to it a new task which uses the invariant data from previous iteration.
  • Search nearby in case of failure.
outline32
Outline
  • Introduction
  • Haloop
    • Architecture and API
    • Loop Aware Task Scheduling
    • Caching and Indexing
  • Experimental Results
  • Conclusion
architecture33
Architecture

New, additional API

Leverage data locality

Caching for fast access

Starts new MR jobs repeatedly

architecture34
Architecture

Caching for fast access

caching
Caching
  • Task scheduler component places map and reduce tasks which occur in different iterations but access the same data on the same machine.
    • Success depends on the ability of the master to schedule tasks “intelligently.”
  • Hence, a particular loop invariant data partition is usually only needed by one physical node.
    • Needed by some other node in case of failure.
  • Caching and indexingcomponents build on Task Scheduler to:
    • Cache data on local disk instead of HDFS
    • Index it for fast object retreival
    • Save network bandwidth and I/O time.
cache and index
Cache and Index

1

2

3

4

5

6

Where is the network I/O going on?

network i o
Network I/O

1

2

3

4

5

6

HDFS

Mapper

Reducer

Master

HDFS

Mapper

cache and index mapper input cache
Cache and Index: Mapper Input Cache

1

2

3

4

5

6

In Hadoop data local mapping rate is between 70%-95%.

cache and index mapper input cache39
Cache and Index: Mapper Input Cache

1

2

3

4

5

6

  • - avoid non-local data reads after the first iteration. SetMapperInputCache(true);
  • iterations use loop aware task scheduler to read data from local disks.
  • useful for algorithms where mapper input does not change across iterations
    • model fitting algorithms. Eg, k-means.
cache and index reducer input cache
Cache and Index: Reducer Input Cache

1

2

3

4

5

6

  • - Specify AddInvariantTable();and SetReducerInputCache(true);
  • In cache, store keys and values in separate files. Have pointers from keys to values. Index sorted keys for fast, seek forward only access.
  • useful for algorithms having repeated joins against large invariant data: Page Rank.
reducer output cache
Reducer Output Cache

1

2

3

4

5

6

  • - Specify SetReducerOutputCache(true);
  • Used for evaluating termination condition in parallel without an extra MR job.
  • Each reducer compares current output with previous iteration’s cached output and reports to master.
  • Page Rank, descendant query etc.
caching and indexing
Caching and Indexing
  • Mapper Input Cache
  • Reducer Input Cache
  • Reducer Output Cache
  • Why is there no Mapper Output Cache?
outline43
Outline
  • Introduction
  • Haloop
    • Architecture and API
    • Loop Aware Task Scheduling
    • Caching and Indexing
  • Experimental Results
  • Conclusion
testing methods
Testing Methods
  • Amazon’s EC2
    • 50 and 90 slave nodes
    • One master
  • Semi synthetic and “real-world” datasets:
    • LiveJournal (Social Network) : 18GB
    • Triples (Semantic Web RDF) : 120GB
    • Freebase (Concept Linkage Graph) : 12GB
  • Evaluate independently the effect of Cache management:
    • Reducer Input
    • Reducer Output
    • Mapper Input
page rank45
Page Rank
  • Run only for 10 iterations.
  • Join and aggregate in every iteration.
  • Overhead in first step for caching input.
  • Catches up soon and outperfromsHadoop.
  • Low shuffling time: time between RPC invocation by reducer and sorting of keys.
descendant query46
Descendant Query
  • Join and duplicate elimination in every iteration.
  • Less striking Performance on LiveJournal: social network, high fan out, excessive duplicate generation which dominates the join cost and reducer input caching is less useful.
testing
Testing…
  • Evaluate independently the effect of Cache management:
    • Reducer Input Cache
    • Reducer Output Cache
    • MapperInput Cache
reducer output cache48
Reducer Output Cache
  • Recall that ROC saves an extra MR job for fixedpoint evaluation.
testing49
Testing…
  • Evaluate independently the effect of Cache management:
    • Reducer Input Cache
    • Reducer Output Cache
    • MapperInput Cache
mapper input cache
Mapper Input Cache
  • Cannot use Page Rank or Desc. Query. (Why?)
  • Use a model fitting algorithm: k-means.
outline51
Outline
  • Introduction
  • Haloop
    • Architecture and API
    • Loop Aware Task Scheduling
    • Caching and Indexing
  • Experimental Results
  • Conclusion
good things about haloop
Good things about Haloop
  • Haloop extends MapReduce:
    • easier programming of iterative algorithms
    • Efficiency improvement due to loop awareness and caching
    • Lets users reuse major building blocks from existing application implementations in Hadoop.
    • Fully backward compatible with Hadoop.
the not so good
The not so good…
  • Only useful for algorithms which can be expressed as:
  • Imposes constraints: fixed partition function for each iteration.
  • Does not improve asymptotic running time. Still O(M+R) scheduling decisions, keep O(M*R) state in memory. And more overhead…
  • Not completely novel: iMapReduce and Twister.
  • People still do iteration using traditional Map Reduce. Google, Nutch, Mahout…
thanks
Thanks!
  • Also see
    • Apache Mahout: A Hadoop sub project providing tons of iterative machine learning algorithms.
    • iMapReduce.
    • Twister.