haloop efficient iterative data processing on large scale clusters
Download
Skip this Video
Download Presentation
HaLoop: Efficient Iterative Data Processing On Large Scale Clusters

Loading in 2 Seconds...

play fullscreen
1 / 23

HaLoop: Efficient Iterative Data Processing On Large Scale Clusters - PowerPoint PPT Presentation


  • 177 Views
  • Uploaded on

HaLoop: Efficient Iterative Data Processing On Large Scale Clusters. Presentation by Amr Swafta. Outlines. Introduction / Motivation Iterative Application E xample HaLoop Architecture Task Scheduling Caching and Indexing Experiments & Results Conclusion. Introduction / Motivation.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'HaLoop: Efficient Iterative Data Processing On Large Scale Clusters' - beulah


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
outlines
Outlines
  • Introduction / Motivation
  • Iterative Application Example
  • HaLoop Architecture
  • Task Scheduling
  • Caching and Indexing
  • Experiments & Results
  • Conclusion
introduction motivation
Introduction / Motivation
  • HaLoop: is a modified version of the Hadoop MapReduce framework that is designed to serve iterative applications.
  • MapReduce framework can’t directly support recursion/iteration.
  • Many data analysis techniques require iterative computations:
    • PageRank
    • Clustering
    • Neural-network analysis
    • Social network analysis
iterative application example
Iterative Application Example
  • PageRank algorithm: system for ranking web pages.
  • Where:

- PR(A): is the PageRank of page A.

- PR(Ti): is the PageRank of pages Ti which link to page A.

- C(Ti): is the number of outbound links on page Ti.

- d: is a damping factor which can be set between 0 and 1.

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + +PR(Tn)/C(Tn))

slide5
Consider a small web consisting of three pages A, B and C with d= 0.5.
  • The PageRank will be calculated as the following:

PR(A) = 0.5 + 0.5 PR(C)

PR(B) = 0.5 + 0.5 (PR(A) / 2)

PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))

slide7
HaLoop’s master node contains a new loop controlmodule that repeatedly starts new map-reduce steps that compose the loop body.
  • HaLoop uses a modified task scheduler for iterative applications.
  • HaLoop caches and indexes application data on slave nodes.
different between hadoop and haloop with iterative applications
Different between Hadoop and HaLoop with iterative applications.
  • Note: The loop control is pushed from the application into the infrastructure.
task scheduling
Task Scheduling
  • Inter-iteration locality: place on the same physical machines those map and reduce tasks that occur in different iterations but access the same data.
  • In order to cached data reused between iterations.
  • The scheduling exhibits inter-iteration locality if:

For all i > 1, Ti(d) and Ti−1 (d) are assigned to the same physical node

d: mapper / reducer input.

T: task which consumes (d) during iterations.

slide10
- Scheduling the first iteration in HaDoop and HaLoop is the same.- Subsequent iterations put tasks that access the same data on the same physical node.
caching and indexing
Caching and Indexing
  • To reduce I/O cost, HaLoop caches loop-invariant data partitions on the physical node’s local disk for subsequent re-use.
  • To further accelerate processing, it indexes the

cached data.

- Keys and values stored in separate local files.

    • Type of caches:

- Reducer Input Cache

- Reducer Output Cache

- Mapper Input Cache

reducer i nput cache
Reducer Input Cache

    • Access to loop invariant data without map/shuffle.
  • RI cached data is used By reducer function.
  • Assumes:
    • Mapper output is constant across iterations.
    • Static partitioning (implies: no new nodes).
  • In HaLoop, the number of reducer tasks is unchanched across iterations.
reducer output cache
Reducer Output Cache

  • Stores and indexes the most recent

local output on each reducer node.

    • Provides distributed access to output of previous iterations.
    • RO cached data is used by Fixpoint evaluation.
  • It’s very efficient when the fixpoint evaluation should be conducting after each iteration.
mapper input cache
Mapper Input Cache

  • In the first iteration, if a mapper

performs a non-local read on an

input split, the split will be cached in the local disk of the mapper’s physical node.

  • In later iterations, all mappers read data only from local disks.
    • MI cached data is used during scheduling of map tasks.
cache reloading
Cache Reloading

1- The hosting node fails.

2- The hosting node has a full load and a map or reduce task must be scheduled on a different substitution node.

experiments results
Experiments & Results
  • HaLoop is evaluated on real queries and real datasets.
  • Compared with Hadoop, on average, HaLoop reduces query runtimes by 1.85, and shuffles only 4% of the data between mappers and reducers.
evaluation of reducer output cache
Evaluation of Reducer Output Cache
  • Time spent on fixpoint evaluation in each iteration.

Fixpoint evaluation (s)

Iteration #

Iteration #

Livejournal dataset

50 Nodes

Freebase dataset

90 Nodes

evaluation of mapper input cache
Evaluation of Mapper Input Cache
  • Overall runtime.

Cosmo-dark

8 Nodes

Cosmo-gas

8 Nodes

conclusion
Conclusion
  • Authors present the design, implementation, and evaluation of HaLoop, a novel parallel and distributed system that supports large-scale iterative data analysis applications.
  • HaLoop is built on top of Hadoop and extends it with a several important optimizations that include:

- A loop-aware task scheduler

- Loop-invariant data caching

- Caching for efficient fixpoint verification.

ad