HaLoop: Efficient Iterative Data Processing On Large Scale Clusters. Presentation by Amr Swafta. Outlines. Introduction / Motivation Iterative Application E xample HaLoop Architecture Task Scheduling Caching and Indexing Experiments & Results Conclusion. Introduction / Motivation.
- PR(A): is the PageRank of page A.
- PR(Ti): is the PageRank of pages Ti which link to page A.
- C(Ti): is the number of outbound links on page Ti.
- d: is a damping factor which can be set between 0 and 1.
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + +PR(Tn)/C(Tn))
PR(A) = 0.5 + 0.5 PR(C)
PR(B) = 0.5 + 0.5 (PR(A) / 2)
PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))
For all i > 1, Ti(d) and Ti−1 (d) are assigned to the same physical node
d: mapper / reducer input.
T: task which consumes (d) during iterations.
- with d= 0.5.Scheduling the first iteration in HaDoop and HaLoop is the same.- Subsequent iterations put tasks that access the same data on the same physical node.
- Keys and values stored in separate local files.
- Reducer Input Cache
- Reducer Output Cache
- Mapper Input Cache
local output on each reducer node.
performs a non-local read on an
input split, the split will be cached in the local disk of the mapper’s physical node.
1- The hosting node fails.
2- The hosting node has a full load and a map or reduce task must be scheduled on a different substitution node.
Reduce and with d= 0.5.Shuffle
Fixpoint evaluation (s)
- A loop-aware task scheduler
- Loop-invariant data caching
- Caching for efficient fixpoint verification.
Thank You with d= 0.5.