Cardio cost aware replication for data intensive workflows
This presentation is the property of its rightful owner.
Sponsored Links
1 / 27

CARDIO: Cost-Aware Replication for Data-Intensive workflOws PowerPoint PPT Presentation


  • 65 Views
  • Uploaded on
  • Presentation posted in: General

CARDIO: Cost-Aware Replication for Data-Intensive workflOws. Presented by Chen He. Motivation. Is large scale cluster reliable? 5 average worker deaths per Map-Reduce job At least 1 disk failure in every run of a 6- hour MapReduce job on a 4000-node cluster. Motivation.

Download Presentation

CARDIO: Cost-Aware Replication for Data-Intensive workflOws

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Cardio cost aware replication for data intensive workflows

CARDIO: Cost-Aware Replication for Data-Intensive workflOws

Presented by Chen He


Motivation

Motivation

  • Is large scale cluster reliable?

    • 5 average worker deaths per Map-Reduce job

    • At least 1 disk failure in every run of a 6- hour MapReduce job on a 4000-node cluster


Motivation1

Motivation

  • How to prevent node failure from affecting performance?

    • Replication

      • Capacity constraint

      • Replication time, etc

    • Regeneration through re-execution

      • Delay program progress

      • Cascaded re-execution


Motivation2

Motivation

COST

AVAILABILITY

All pictures adopted from the Internet


Outline

Outline

  • Problem Exploration

  • CARDIO Model

  • Hadoop CARDIO System

  • Evaluation

  • Discussion


Problem exploration

Problem Exploration

  • Performance Costs

    • Replication cost (R)

    • Regeneration cost (G)

    • Reliability cost (Z)

    • Execution cost (A)

    • Total cost (T)

    • Disk cost (Y)

      T=A+Z

      Z=R+G


Problem exploration1

Problem Exploration

  • Experiment Environment

    • Hadoop 0.20.2

    • 25 VMs

    • Workloads: Tagger->Join->Grep->RecordCounter


Problem exploration summary

Problem Exploration Summary

  • Replication Factor for MR Stages


Problem exploration summary1

Problem Exploration Summary

  • Detailed Execution Time of 3 Cases


Cardio model

CARDIO Model

  • Block Failure Model

    • Output of stage i is

    • Replication factor is

    • Total block number is

    • Single block failure probability is

    • Failure probability in stage i:


Cardio model1

CARDIO Model

  • Cost Computation Model

    • Total time of stage i:

    • Replication cost of stage i:

    • Expected regeneration time of stage i:

    • Reliability cost for all stages:

    • Storage Constraint C of all stages:

    • Choose to minimize Z


Cardio model2

CARDIO Model

  • Dynamic Replication

    • Replication number x may vary during the program approaching

      • Job is in Step k, the replication factor at this step is:


Cardio model3

CARDIO Model

  • Model for Reliability

    • Minimize

    • Based on

    • In the condition of


Cardio model4

CARDIO Model

  • Resource Utilization Model

    • Model Cost = resource utilized

    • Resource type Q

      • CPU, Network, Disk, and Storage resource, etc.

      • Utilization of q resource in stage i:

      • Normalize usage by

      • Relative costs weights:


Cardio model5

CARDIO Model

  • Resource Utilization Model

    • The cost for A is:

    • Total Cost:

    • Optimization target:

      • Choose to minimize T


Cardio model6

CARDIO Model

  • Optimization Problem

    • Job optimality (JO)

    • Stage optimality (SO)


Hadoop cardio system

Hadoop CARDIO System

  • CardioSense

    • Obtain progress from JT periodically

    • Be triggered by pre-configured threshold-value

    • Collect resource usage statistics for running stages

    • Rely on HMon on each worker node

      • HMon based on Atop has low overhead


Hadoop cardio system1

Hadoop CARDIO System

  • CardioSolve

    • Receive data from CardioSense

    • Solve SO problem

    • Decide the replication factors for current and previous stages


Hadoop cardio system2

Hadoop CARDIO System

  • CardioAct

    • Implement the command from CardioSolve

    • Use HDFS API setReplication(file, replicaNumber)


Hadoop cardio system3

Hadoop CARDIO System


Evaluation

Evaluation

  • Several Important Parameters

    • p is the failure rate 0.2 if not specified

    • is the time to replicate a data unit, 0.2 as well

    • is the computation resource of stage i, it follows uniform distribution U(1,Cmax),Cmax=100 in general.

    • is the output of stage i, it is obtained from a uniform distribution U(1, Dmax), Dmax varies within the [1,Cmax].

    • C is the storage constraint for the whole process. Default value is


Evaluation1

Evaluation

  • Effect of Dmax


Evaluation2

Evaluation

  • Effect of Failure rate p


Evaluation3

Evaluation

  • Effect of block size


Evaluation4

Evaluation

  • Effect of different resource constraints

++ means over-utilzed, and this type of resource is regarded as expensive

P=0.08, C=204GB, delta=0.6

S3 is CPU intensive

DSK has similar performance pattern as NET

CPU 0010, NET 0011, DSKIO 0011,STG0011


Evaluation5

Evaluation

S2 re-execute more frequently due to the failure injection. Because it has large data output.

P=0.02, 0.08 and 0.1

1 , 3, 21

API reason


Discussion

Discussion

  • Problems

    • Typos and misleading

      symbols

    • HDFS API setReplication()

  • Any other ideas?


  • Login