1 / 23

# HaLoop: Efficient Iterative Data Processing On Large Scale Clusters - PowerPoint PPT Presentation

HaLoop: Efficient Iterative Data Processing On Large Scale Clusters. Presentation by Amr Swafta. Outlines. Introduction / Motivation Iterative Application E xample HaLoop Architecture Task Scheduling Caching and Indexing Experiments & Results Conclusion. Introduction / Motivation.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

HaLoop: Efficient Iterative Data Processing On Large Scale Clusters

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

## HaLoop: Efficient Iterative Data Processing On Large Scale Clusters

Presentation by

Amr Swafta

### Outlines

• Introduction / Motivation

• Iterative Application Example

• HaLoop Architecture

• Caching and Indexing

• Experiments & Results

• Conclusion

### Introduction / Motivation

• HaLoop: is a modified version of the Hadoop MapReduce framework that is designed to serve iterative applications.

• MapReduce framework can’t directly support recursion/iteration.

• Many data analysis techniques require iterative computations:

• PageRank

• Clustering

• Neural-network analysis

• Social network analysis

### Iterative Application Example

• PageRank algorithm: system for ranking web pages.

• Where:

- PR(A): is the PageRank of page A.

- PR(Ti): is the PageRank of pages Ti which link to page A.

- C(Ti): is the number of outbound links on page Ti.

- d: is a damping factor which can be set between 0 and 1.

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + +PR(Tn)/C(Tn))

• Consider a small web consisting of three pages A, B and C with d= 0.5.

• The PageRank will be calculated as the following:

PR(A) = 0.5 + 0.5 PR(C)

PR(B) = 0.5 + 0.5 (PR(A) / 2)

PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))

### HaLoop Architecture

• HaLoop’s master node contains a new loop controlmodule that repeatedly starts new map-reduce steps that compose the loop body.

• HaLoop uses a modified task scheduler for iterative applications.

• HaLoop caches and indexes application data on slave nodes.

### Different between Hadoop and HaLoop with iterative applications.

• Note: The loop control is pushed from the application into the infrastructure.

• Inter-iteration locality: place on the same physical machines those map and reduce tasks that occur in different iterations but access the same data.

• In order to cached data reused between iterations.

• The scheduling exhibits inter-iteration locality if:

For all i > 1, Ti(d) and Ti−1 (d) are assigned to the same physical node

d: mapper / reducer input.

T: task which consumes (d) during iterations.

- Scheduling the first iteration in HaDoop and HaLoop is the same.- Subsequent iterations put tasks that access the same data on the same physical node.

### Caching and Indexing

• To reduce I/O cost, HaLoop caches loop-invariant data partitions on the physical node’s local disk for subsequent re-use.

• To further accelerate processing, it indexes the

cached data.

- Keys and values stored in separate local files.

• Type of caches:

- Reducer Input Cache

- Reducer Output Cache

- Mapper Input Cache

### Reducer Input Cache

• RI cached data is used By reducer function.

• Assumes:

• Mapper output is constant across iterations.

• Static partitioning (implies: no new nodes).

• In HaLoop, the number of reducer tasks is unchanched across iterations.

• ### Reducer Output Cache

• Stores and indexes the most recent

local output on each reducer node.

• RO cached data is used by Fixpoint evaluation.

• It’s very efficient when the fixpoint evaluation should be conducting after each iteration.

### Mapper Input Cache

• In the first iteration, if a mapper

performs a non-local read on an

input split, the split will be cached in the local disk of the mapper’s physical node.

• In later iterations, all mappers read data only from local disks.

• MI cached data is used during scheduling of map tasks.

1- The hosting node fails.

2- The hosting node has a full load and a map or reduce task must be scheduled on a different substitution node.

### Experiments & Results

• HaLoop is evaluated on real queries and real datasets.

• Compared with Hadoop, on average, HaLoop reduces query runtimes by 1.85, and shuffles only 4% of the data between mappers and reducers.

### Evaluation of Reducer Input Cache

• Overall runtime.

Reduce and Shuffle

### Evaluation of Reducer Output Cache

• Time spent on fixpoint evaluation in each iteration.

Fixpoint evaluation (s)

Iteration #

Iteration #

Livejournal dataset

50 Nodes

Freebase dataset

90 Nodes

### Evaluation of Mapper Input Cache

• Overall runtime.

Cosmo-dark

8 Nodes

Cosmo-gas

8 Nodes

### Conclusion

• Authors present the design, implementation, and evaluation of HaLoop, a novel parallel and distributed system that supports large-scale iterative data analysis applications.

• HaLoop is built on top of Hadoop and extends it with a several important optimizations that include: