Haloop efficient iterative data processing on large scale clusters
This presentation is the property of its rightful owner.
Sponsored Links
1 / 23

HaLoop: Efficient Iterative Data Processing On Large Scale Clusters PowerPoint PPT Presentation


  • 110 Views
  • Uploaded on
  • Presentation posted in: General

HaLoop: Efficient Iterative Data Processing On Large Scale Clusters. Presentation by Amr Swafta. Outlines. Introduction / Motivation Iterative Application E xample HaLoop Architecture Task Scheduling Caching and Indexing Experiments & Results Conclusion. Introduction / Motivation.

Download Presentation

HaLoop: Efficient Iterative Data Processing On Large Scale Clusters

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Haloop efficient iterative data processing on large scale clusters

HaLoop: Efficient Iterative Data Processing On Large Scale Clusters

Presentation by

Amr Swafta


Outlines

Outlines

  • Introduction / Motivation

  • Iterative Application Example

  • HaLoop Architecture

  • Task Scheduling

  • Caching and Indexing

  • Experiments & Results

  • Conclusion


Introduction motivation

Introduction / Motivation

  • HaLoop: is a modified version of the Hadoop MapReduce framework that is designed to serve iterative applications.

  • MapReduce framework can’t directly support recursion/iteration.

  • Many data analysis techniques require iterative computations:

    • PageRank

    • Clustering

    • Neural-network analysis

    • Social network analysis


Iterative application example

Iterative Application Example

  • PageRank algorithm: system for ranking web pages.

  • Where:

    - PR(A): is the PageRank of page A.

    - PR(Ti): is the PageRank of pages Ti which link to page A.

    - C(Ti): is the number of outbound links on page Ti.

    - d: is a damping factor which can be set between 0 and 1.

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + +PR(Tn)/C(Tn))


Haloop efficient iterative data processing on large scale clusters

  • Consider a small web consisting of three pages A, B and C with d= 0.5.

  • The PageRank will be calculated as the following:

    PR(A) = 0.5 + 0.5 PR(C)

    PR(B) = 0.5 + 0.5 (PR(A) / 2)

    PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B))


Haloop architecture

HaLoop Architecture


Haloop efficient iterative data processing on large scale clusters

  • HaLoop’s master node contains a new loop controlmodule that repeatedly starts new map-reduce steps that compose the loop body.

  • HaLoop uses a modified task scheduler for iterative applications.

  • HaLoop caches and indexes application data on slave nodes.


Different between hadoop and haloop with iterative applications

Different between Hadoop and HaLoop with iterative applications.

  • Note: The loop control is pushed from the application into the infrastructure.


Task scheduling

Task Scheduling

  • Inter-iteration locality: place on the same physical machines those map and reduce tasks that occur in different iterations but access the same data.

  • In order to cached data reused between iterations.

  • The scheduling exhibits inter-iteration locality if:

For all i > 1, Ti(d) and Ti−1 (d) are assigned to the same physical node

d: mapper / reducer input.

T: task which consumes (d) during iterations.


Haloop efficient iterative data processing on large scale clusters

- Scheduling the first iteration in HaDoop and HaLoop is the same.- Subsequent iterations put tasks that access the same data on the same physical node.


Caching and indexing

Caching and Indexing

  • To reduce I/O cost, HaLoop caches loop-invariant data partitions on the physical node’s local disk for subsequent re-use.

  • To further accelerate processing, it indexes the

    cached data.

    - Keys and values stored in separate local files.

    • Type of caches:

      - Reducer Input Cache

      - Reducer Output Cache

      - Mapper Input Cache


Reducer i nput cache

Reducer Input Cache

  • Access to loop invariant data without map/shuffle.

  • RI cached data is used By reducer function.

  • Assumes:

    • Mapper output is constant across iterations.

    • Static partitioning (implies: no new nodes).

  • In HaLoop, the number of reducer tasks is unchanched across iterations.


  • Reducer output cache

    Reducer Output Cache

    • Stores and indexes the most recent

      local output on each reducer node.

      • Provides distributed access to output of previous iterations.

      • RO cached data is used by Fixpoint evaluation.

    • It’s very efficient when the fixpoint evaluation should be conducting after each iteration.


    Mapper input cache

    Mapper Input Cache

    • In the first iteration, if a mapper

      performs a non-local read on an

      input split, the split will be cached in the local disk of the mapper’s physical node.

    • In later iterations, all mappers read data only from local disks.

      • MI cached data is used during scheduling of map tasks.


    Cache reloading

    Cache Reloading

    1- The hosting node fails.

    2- The hosting node has a full load and a map or reduce task must be scheduled on a different substitution node.


    Experiments results

    Experiments & Results

    • HaLoop is evaluated on real queries and real datasets.

    • Compared with Hadoop, on average, HaLoop reduces query runtimes by 1.85, and shuffles only 4% of the data between mappers and reducers.


    Evaluation of reducer input cache

    Evaluation of Reducer Input Cache

    • Overall runtime.


    Haloop efficient iterative data processing on large scale clusters

    Reduce and Shuffle


    Evaluation of reducer output cache

    Evaluation of Reducer Output Cache

    • Time spent on fixpoint evaluation in each iteration.

    Fixpoint evaluation (s)

    Iteration #

    Iteration #

    Livejournal dataset

    50 Nodes

    Freebase dataset

    90 Nodes


    Evaluation of mapper input cache

    Evaluation of Mapper Input Cache

    • Overall runtime.

    Cosmo-dark

    8 Nodes

    Cosmo-gas

    8 Nodes


    Conclusion

    Conclusion

    • Authors present the design, implementation, and evaluation of HaLoop, a novel parallel and distributed system that supports large-scale iterative data analysis applications.

    • HaLoop is built on top of Hadoop and extends it with a several important optimizations that include:

      - A loop-aware task scheduler

      - Loop-invariant data caching

      - Caching for efficient fixpoint verification.


    Questions

    Questions


    Haloop efficient iterative data processing on large scale clusters

    Thank You


  • Login