cardio cost aware replication for data intensive workflows
Download
Skip this Video
Download Presentation
CARDIO: Cost-Aware Replication for Data-Intensive workflOws

Loading in 2 Seconds...

play fullscreen
1 / 27

CARDIO: Cost-Aware Replication for Data-Intensive workflOws - PowerPoint PPT Presentation


  • 91 Views
  • Uploaded on

CARDIO: Cost-Aware Replication for Data-Intensive workflOws. Presented by Chen He. Motivation. Is large scale cluster reliable? 5 average worker deaths per Map-Reduce job At least 1 disk failure in every run of a 6- hour MapReduce job on a 4000-node cluster. Motivation.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' CARDIO: Cost-Aware Replication for Data-Intensive workflOws' - elvin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
motivation
Motivation
  • Is large scale cluster reliable?
    • 5 average worker deaths per Map-Reduce job
    • At least 1 disk failure in every run of a 6- hour MapReduce job on a 4000-node cluster
motivation1
Motivation
  • How to prevent node failure from affecting performance?
    • Replication
      • Capacity constraint
      • Replication time, etc
    • Regeneration through re-execution
      • Delay program progress
      • Cascaded re-execution
motivation2
Motivation

COST

AVAILABILITY

All pictures adopted from the Internet

outline
Outline
  • Problem Exploration
  • CARDIO Model
  • Hadoop CARDIO System
  • Evaluation
  • Discussion
problem exploration
Problem Exploration
  • Performance Costs
    • Replication cost (R)
    • Regeneration cost (G)
    • Reliability cost (Z)
    • Execution cost (A)
    • Total cost (T)
    • Disk cost (Y)

T=A+Z

Z=R+G

problem exploration1
Problem Exploration
  • Experiment Environment
    • Hadoop 0.20.2
    • 25 VMs
    • Workloads: Tagger->Join->Grep->RecordCounter
problem exploration summary
Problem Exploration Summary
  • Replication Factor for MR Stages
problem exploration summary1
Problem Exploration Summary
  • Detailed Execution Time of 3 Cases
cardio model
CARDIO Model
  • Block Failure Model
    • Output of stage i is
    • Replication factor is
    • Total block number is
    • Single block failure probability is
    • Failure probability in stage i:
cardio model1
CARDIO Model
  • Cost Computation Model
    • Total time of stage i:
    • Replication cost of stage i:
    • Expected regeneration time of stage i:
    • Reliability cost for all stages:
    • Storage Constraint C of all stages:
    • Choose to minimize Z
cardio model2
CARDIO Model
  • Dynamic Replication
    • Replication number x may vary during the program approaching
      • Job is in Step k, the replication factor at this step is:
cardio model3
CARDIO Model
  • Model for Reliability
    • Minimize
    • Based on
    • In the condition of
cardio model4
CARDIO Model
  • Resource Utilization Model
    • Model Cost = resource utilized
    • Resource type Q
      • CPU, Network, Disk, and Storage resource, etc.
      • Utilization of q resource in stage i:
      • Normalize usage by
      • Relative costs weights:
cardio model5
CARDIO Model
  • Resource Utilization Model
    • The cost for A is:
    • Total Cost:
    • Optimization target:
      • Choose to minimize T
cardio model6
CARDIO Model
  • Optimization Problem
    • Job optimality (JO)
    • Stage optimality (SO)
hadoop cardio system
Hadoop CARDIO System
  • CardioSense
    • Obtain progress from JT periodically
    • Be triggered by pre-configured threshold-value
    • Collect resource usage statistics for running stages
    • Rely on HMon on each worker node
      • HMon based on Atop has low overhead
hadoop cardio system1
Hadoop CARDIO System
  • CardioSolve
    • Receive data from CardioSense
    • Solve SO problem
    • Decide the replication factors for current and previous stages
hadoop cardio system2
Hadoop CARDIO System
  • CardioAct
    • Implement the command from CardioSolve
    • Use HDFS API setReplication(file, replicaNumber)
evaluation
Evaluation
  • Several Important Parameters
    • p is the failure rate 0.2 if not specified
    • is the time to replicate a data unit, 0.2 as well
    • is the computation resource of stage i, it follows uniform distribution U(1,Cmax),Cmax=100 in general.
    • is the output of stage i, it is obtained from a uniform distribution U(1, Dmax), Dmax varies within the [1,Cmax].
    • C is the storage constraint for the whole process. Default value is
evaluation1
Evaluation
  • Effect of Dmax
evaluation2
Evaluation
  • Effect of Failure rate p
evaluation3
Evaluation
  • Effect of block size
evaluation4
Evaluation
  • Effect of different resource constraints

++ means over-utilzed, and this type of resource is regarded as expensive

P=0.08, C=204GB, delta=0.6

S3 is CPU intensive

DSK has similar performance pattern as NET

CPU 0010, NET 0011, DSKIO 0011,STG0011

evaluation5
Evaluation

S2 re-execute more frequently due to the failure injection. Because it has large data output.

P=0.02, 0.08 and 0.1

1 , 3, 21

API reason

discussion
Discussion
  • Problems
    • Typos and misleading

symbols

    • HDFS API setReplication()
  • Any other ideas?
ad