1 / 23

HaLoop: Efficient Iterative Data Processing On Large Scale Clusters

HaLoop: Efficient Iterative Data Processing On Large Scale Clusters. Presentation by Amr Swafta. Outlines. Introduction / Motivation Iterative Application E xample HaLoop Architecture Task Scheduling Caching and Indexing Experiments & Results Conclusion. Introduction / Motivation.

beulah
Download Presentation

HaLoop: Efficient Iterative Data Processing On Large Scale Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HaLoop: Efficient Iterative Data Processing On Large Scale Clusters Presentation by Amr Swafta

  2. Outlines • Introduction / Motivation • Iterative Application Example • HaLoop Architecture • Task Scheduling • Caching and Indexing • Experiments & Results • Conclusion

  3. Introduction / Motivation • HaLoop: is a modified version of the Hadoop MapReduce framework that is designed to serve iterative applications. • MapReduce framework can’t directly support recursion/iteration. • Many data analysis techniques require iterative computations: • PageRank • Clustering • Neural-network analysis • Social network analysis

  4. Iterative Application Example • PageRank algorithm: system for ranking web pages. • Where: - PR(A): is the PageRank of page A. - PR(Ti): is the PageRank of pages Ti which link to page A. - C(Ti): is the number of outbound links on page Ti. - d: is a damping factor which can be set between 0 and 1. PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + +PR(Tn)/C(Tn))

  5. Consider a small web consisting of three pages A, B and C with d= 0.5. • The PageRank will be calculated as the following: PR(A) = 0.5 + 0.5 PR(C) PR(B) = 0.5 + 0.5 (PR(A) / 2) PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))

  6. HaLoop Architecture

  7. HaLoop’s master node contains a new loop controlmodule that repeatedly starts new map-reduce steps that compose the loop body. • HaLoop uses a modified task scheduler for iterative applications. • HaLoop caches and indexes application data on slave nodes.

  8. Different between Hadoop and HaLoop with iterative applications. • Note: The loop control is pushed from the application into the infrastructure.

  9. Task Scheduling • Inter-iteration locality: place on the same physical machines those map and reduce tasks that occur in different iterations but access the same data. • In order to cached data reused between iterations. • The scheduling exhibits inter-iteration locality if: For all i > 1, Ti(d) and Ti−1 (d) are assigned to the same physical node d: mapper / reducer input. T: task which consumes (d) during iterations.

  10. - Scheduling the first iteration in HaDoop and HaLoop is the same.- Subsequent iterations put tasks that access the same data on the same physical node.

  11. Caching and Indexing • To reduce I/O cost, HaLoop caches loop-invariant data partitions on the physical node’s local disk for subsequent re-use. • To further accelerate processing, it indexes the cached data. - Keys and values stored in separate local files. • Type of caches: - Reducer Input Cache - Reducer Output Cache - Mapper Input Cache

  12. Reducer Input Cache … • Access to loop invariant data without map/shuffle. • RI cached data is used By reducer function. • Assumes: • Mapper output is constant across iterations. • Static partitioning (implies: no new nodes). • In HaLoop, the number of reducer tasks is unchanched across iterations.

  13. Reducer Output Cache … • Stores and indexes the most recent local output on each reducer node. • Provides distributed access to output of previous iterations. • RO cached data is used by Fixpoint evaluation. • It’s very efficient when the fixpoint evaluation should be conducting after each iteration.

  14. Mapper Input Cache … • In the first iteration, if a mapper performs a non-local read on an input split, the split will be cached in the local disk of the mapper’s physical node. • In later iterations, all mappers read data only from local disks. • MI cached data is used during scheduling of map tasks.

  15. Cache Reloading 1- The hosting node fails. 2- The hosting node has a full load and a map or reduce task must be scheduled on a different substitution node.

  16. Experiments & Results • HaLoop is evaluated on real queries and real datasets. • Compared with Hadoop, on average, HaLoop reduces query runtimes by 1.85, and shuffles only 4% of the data between mappers and reducers.

  17. Evaluation of Reducer Input Cache • Overall runtime.

  18. Reduce and Shuffle

  19. Evaluation of Reducer Output Cache • Time spent on fixpoint evaluation in each iteration. Fixpoint evaluation (s) Iteration # Iteration # Livejournal dataset 50 Nodes Freebase dataset 90 Nodes

  20. Evaluation of Mapper Input Cache • Overall runtime. Cosmo-dark 8 Nodes Cosmo-gas 8 Nodes

  21. Conclusion • Authors present the design, implementation, and evaluation of HaLoop, a novel parallel and distributed system that supports large-scale iterative data analysis applications. • HaLoop is built on top of Hadoop and extends it with a several important optimizations that include: - A loop-aware task scheduler - Loop-invariant data caching - Caching for efficient fixpoint verification.

  22. Questions

  23. Thank You

More Related