Haloop efficient iterative data processing on large scale clusters
1 / 23

HaLoop: Efficient Iterative Data Processing On Large Scale Clusters - PowerPoint PPT Presentation

  • Uploaded on

HaLoop: Efficient Iterative Data Processing On Large Scale Clusters. Presentation by Amr Swafta. Outlines. Introduction / Motivation Iterative Application E xample HaLoop Architecture Task Scheduling Caching and Indexing Experiments & Results Conclusion. Introduction / Motivation.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' HaLoop: Efficient Iterative Data Processing On Large Scale Clusters' - beulah

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Outlines Clusters

  • Introduction / Motivation

  • Iterative Application Example

  • HaLoop Architecture

  • Task Scheduling

  • Caching and Indexing

  • Experiments & Results

  • Conclusion

Introduction motivation
Introduction / Motivation Clusters

  • HaLoop: is a modified version of the Hadoop MapReduce framework that is designed to serve iterative applications.

  • MapReduce framework can’t directly support recursion/iteration.

  • Many data analysis techniques require iterative computations:

    • PageRank

    • Clustering

    • Neural-network analysis

    • Social network analysis

Iterative application example
Iterative Application Example Clusters

  • PageRank algorithm: system for ranking web pages.

  • Where:

    - PR(A): is the PageRank of page A.

    - PR(Ti): is the PageRank of pages Ti which link to page A.

    - C(Ti): is the number of outbound links on page Ti.

    - d: is a damping factor which can be set between 0 and 1.

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + +PR(Tn)/C(Tn))

Haloop architecture
HaLoop Architecture with d= 0.5.

  • HaLoop’s master with d= 0.5.node contains a new loop controlmodule that repeatedly starts new map-reduce steps that compose the loop body.

  • HaLoop uses a modified task scheduler for iterative applications.

  • HaLoop caches and indexes application data on slave nodes.

Different between hadoop and haloop with iterative applications
Different between Hadoop and HaLoop with with d= 0.5. iterative applications.

  • Note: The loop control is pushed from the application into the infrastructure.

Task scheduling
Task Scheduling with d= 0.5.

  • Inter-iteration locality: place on the same physical machines those map and reduce tasks that occur in different iterations but access the same data.

  • In order to cached data reused between iterations.

  • The scheduling exhibits inter-iteration locality if:

For all i > 1, Ti(d) and Ti−1 (d) are assigned to the same physical node

d: mapper / reducer input.

T: task which consumes (d) during iterations.

- with d= 0.5.Scheduling the first iteration in HaDoop and HaLoop is the same.- Subsequent iterations put tasks that access the same data on the same physical node.

Caching and indexing
Caching and Indexing with d= 0.5.

  • To reduce I/O cost, HaLoop caches loop-invariant data partitions on the physical node’s local disk for subsequent re-use.

  • To further accelerate processing, it indexes the

    cached data.

    - Keys and values stored in separate local files.

    • Type of caches:

      - Reducer Input Cache

      - Reducer Output Cache

      - Mapper Input Cache

Reducer i nput cache
Reducer with d= 0.5.Input Cache

  • Access to loop invariant data without map/shuffle.

  • RI cached data is used By reducer function.

  • Assumes:

    • Mapper output is constant across iterations.

    • Static partitioning (implies: no new nodes).

  • In HaLoop, the number of reducer tasks is unchanched across iterations.

  • Reducer output cache
    Reducer Output Cache with d= 0.5.

    • Stores and indexes the most recent

      local output on each reducer node.

      • Provides distributed access to output of previous iterations.

      • RO cached data is used by Fixpoint evaluation.

    • It’s very efficient when the fixpoint evaluation should be conducting after each iteration.

    Mapper input cache
    Mapper Input Cache with d= 0.5.

    • In the first iteration, if a mapper

      performs a non-local read on an

      input split, the split will be cached in the local disk of the mapper’s physical node.

    • In later iterations, all mappers read data only from local disks.

      • MI cached data is used during scheduling of map tasks.

    Cache reloading
    Cache Reloading with d= 0.5.

    1- The hosting node fails.

    2- The hosting node has a full load and a map or reduce task must be scheduled on a different substitution node.

    Experiments results
    Experiments & Results with d= 0.5.

    • HaLoop is evaluated on real queries and real datasets.

    • Compared with Hadoop, on average, HaLoop reduces query runtimes by 1.85, and shuffles only 4% of the data between mappers and reducers.

    Evaluation of reducer input cache
    Evaluation of Reducer Input Cache with d= 0.5.

    • Overall runtime.

    Reduce and with d= 0.5.Shuffle

    Evaluation of reducer output cache
    Evaluation of Reducer with d= 0.5.Output Cache

    • Time spent on fixpoint evaluation in each iteration.

    Fixpoint evaluation (s)

    Iteration #

    Iteration #

    Livejournal dataset

    50 Nodes

    Freebase dataset

    90 Nodes

    Evaluation of mapper input cache
    Evaluation of with d= 0.5.Mapper Input Cache

    • Overall runtime.


    8 Nodes


    8 Nodes

    Conclusion with d= 0.5.

    • Authors present the design, implementation, and evaluation of HaLoop, a novel parallel and distributed system that supports large-scale iterative data analysis applications.

    • HaLoop is built on top of Hadoop and extends it with a several important optimizations that include:

      - A loop-aware task scheduler

      - Loop-invariant data caching

      - Caching for efficient fixpoint verification.

    Questions with d= 0.5.

    Thank You with d= 0.5.