1 / 40

Fault-Tolerant Programming Models and Computing Frameworks

Fault-Tolerant Programming Models and Computing Frameworks. Candidacy Examination 12/11/2013 Mehmet Can Kurt. Increasing need for resilience. Performance is not the sole consideration anymore. i ncreasing number of components  decreasing MTBF

aleta
Download Presentation

Fault-Tolerant Programming Models and Computing Frameworks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fault-Tolerant Programming Models and Computing Frameworks Candidacy Examination 12/11/2013 Mehmet Can Kurt

  2. Increasing need for resilience • Performance is not the sole consideration anymore. • increasing number of components  decreasing MTBF • long-running nature of applications (weeks, months) • MTBF < running time of an application • Projected failure-rate in exascale era: every 3-26 minutes • Existing Solutions • Checkpoint/Restart • size of checkpoints matter (ex: 100000 core job, MTBF=5 years, checkpoint+restart+recomp.=65% of exec.) • Redundant Execution • low-resource utilization

  3. Outline • DISC: a domain-interaction based programming model with support for heterogeneous execution and low-overhead fault-tolerance • A Fault-Tolerant Data-Flow Programming Model • A Fault-Tolerant Environment for Large-Scale Query Processing • Future Work

  4. DISC programming model • Increasing heterogeneity due to several factors; • decreasing feature sizes • local power optimizations • popularity of accelerators and co-processors • Existing programming models designed for homogeneous settings • DISC: a high-level programming model and associated runtime on top of MPI • Automatic Partitioning and Communication • Low-Overhead Checkpointing for Resilience • Heterogeneous Execution Support with Work Redistribution

  5. DISC Abstractions • Domain • input-space as a multidimensional domain • data-points as domain elements • domain initialization by API • leverages automatic partitioning • Interaction between Domain Elements • grid-based interactions (inferred from domain type) • radius-based interaction (by cutoff distance) • explicit-list based interaction (by point connectivity)

  6. compute-function and computation-space • compute-function • a set of functions to perform main computations in a program • calculate new values for point attributes • ex:jacobi and sobel kernels, time-step integration function in MD • computation-space • any updates must be directly performed on computation-space • contains an entry for each local point in assigned subdomain

  7. Work Redistribution for Heterogeneity • shrinking/expanding a subdomain changes processors’ workload • ti: unit-processing time of subdomain i ti = Ti / ni Ti = total time spent on compute-functions ni = number of local points in subdomain i

  8. Work Redistribution for Heterogeneity 1D Case • size of each subdomain should be inversely proportional to its unit-processing time 2D/3D Case • express as a non-linear optimization problem min Tmax s.t. xr1 * yr1 * t1 <= Tmax xr2* yr1 * t2<= Tmax … xr1 + xr2 + xr3 = xr yr1+ yr2 = yr

  9. Fault-Tolerance Support: Checkpointing • When do we need to initiate a checkpoint? end of an iteration forms a natural point • Which data-structures should be checkpointed? computation-space captures the application-state MD checkpoint file 2D-stencil checkpoint file

  10. Experiments • Implemented with C language on MPICH2 • Each node with two-quad core 2.53 GHz Intel(R) Xeon(R) processor with 12GB RAM • Up to 128 nodes (by using a single core at each node) • Applications • Stencil (Jacobi, Sobel) • Unstructured grid (Euler) • Molecular dynamics (MiniMD)

  11. Experiments: Checkpointing • Comparison with MPI Implementations(MPICH2-BLCR for checkpointing) Jacobi MiniMD 42% 60% • 400 million elements for 1000 it. • Checkpoint freq: 250 it. • Checkpoint size: 6 GB vs 3 GB • 4 million atoms for 1000 it. • Checkpoint freq: 100 it. • Checkpoint size: ~2GB vs 192 MB

  12. Experiments: Heterogeneous Exec. • Varying number of nodes slowed down by %40 Sobel MiniMD • Load-balance freq: 20 it. (100 it.) • Load-balance overhead: 8% • Slowdown: 64%  25-27% • Load-balance freq: 200 it. (1000 it.) • Load-balance overhead: 1% • Slowdown: 65%  9-16%

  13. Experiments: Charm++ Comparison • Euler (6.4 billion elements for 100 iterations) • 4 nodes are slowed down out of 16 • Diff. Load-Balancing Strategies for Charm++ (RefineLB) • Load-balance once at the beginning (a) Homog.:Charm++ 17.8% slower than DISC (c) Heter. LB:Charm++, at 64-chares (best-case), 14.5% slower than DISC

  14. Outline • DISC: a domain-interaction based programming model with support for heterogeneous execution and low-overhead fault-tolerance • A Fault-Tolerant Data-Flow Programming Model • A Fault-Tolerant Environment for Large-Scale Query Processing • Future Work

  15. Why do we need to revisit data-flow programming? • Massive parallelism in future systems • synchronous nature of existing models (SPMD, BSP) • Data-flow programming • data-availability triggers execution • asynchronous execution due to latency hiding • Majority of FT solutions in the context of MPI

  16. Our Data-Flow Model Tasks Data-Blocks Single assignment-rule Interface to access a data-block; put() and get() Multiple versions for each data-block • Unit of computation • Consumes/produces a set of data-blocks • Side-effect free execution • Task-generation • via user defined iterator objects • creates a task descriptor from a given index for each version vi (int) size (void*) value (int) usage_counter (int) status (vector) wait_list (di, vi) (di, vi) (di, vi) (di, vi) Task T status=not-ready status=ready usage_counter=3 status=ready usage_counter=2 status=garbage-col. status=ready usage_counter=1

  17. Work-Stealing Scheduler • Working-phase • enumerate task T • check data-dependencies of T • ifsatisfied, insert T into <ready queue> otherwise, insert T into <waiting queue> • Steal-phase • a node becomes a thief • steals tasks from a random victim • unit of steal is an iterator-slice • ex: victim iterator object operating on (100-200). thief can steal the slice of (100-120) leaving (120-200) to victim. Repeat until no tasks can be executed

  18. Fault-Tolerance Support • Lost state due to a failure includes; • task execution in failure domain (past, present, future) • data-blocks stored in failure domain • Checkpoint/Restart as traditional solution • Checkpoint execution-frontier • Roll-back to latest checkpointand restart from there • Downside: significant task re-execution overhead • Our Approach • Checkpoint and Selective Recovery • task recovery • data-block recovery

  19. Task Recovery • Tasks to recover: • un-enumerated, waiting, ready and currently executing • should be scheduled for execution • But, work-stealing scheduler implies that • tasks in failure domain are not know a-priori • Solution: • victim remembers the steal by (stolen iterator-slice, thief id) pair • construct working-phases in failure domain by asking alive nodes

  20. Data-Block Recovery • Identify lost data-blocks and re-execute completed tasks to produce them • Do we need (di,vi) for recovery? • not needed if we can show that its status was “garbage-collected” • consumption_infostructure at each worker • holds number of times that a data-block version has been consumed Uinit=initial usage counter Uacc=number of consumptions so far Ur=Uinit– Uacc(reconstructed usagecounter) Case1: Ur == 0 Case2: Ur > 0 && Ur < Uinit Case3: Ur == Uinit (not needed) (needed) (needed)

  21. Data-Block Recovery completed task T4 ready task gc. data-block ready data-block T1 T2 T3 We know that T5 won’t be re-executed d4 d2 d3 d1 T7 T5 T6 d7 d5 d6 T11 T10 * Re-execute T7 and T4 T8 T9

  22. Transitive Re-execution completed task T3 T2 ready task gc. data-block ready data-block T1 d2 d3 T4 d1 • produce d1, d5 • re-execute T1 and T5 • produce d4 • re-execute T4 • produce d2and d3 • re-execute T2 and T3 d4 T5 d5 T6 T7

  23. Outline • DISC: a domain-interaction based programming model with support for heterogeneous execution and low-overhead fault-tolerance • A Fault-Tolerant Data-Flow Programming Model • A Fault-Tolerant Environment for Large-Scale Query Processing • Future Work

  24. Our Work • focusing on two specific query types on a massive dataset: • Range Queries on Spatial datasets • Aggregation Queries on Point datasets • Primary Goals • high efficiency of execution when there are no failures • handling failures efficiently up to a certain number of nodes • a modest slowdown in processing times when recovered from a failure

  25. Range Queries on Spatial Data • query: for a given 2D rectangle, return intersecting rectangles • parallelization: master/worker model • data-organization: • chunk is the smallest data-unit • group close data-objects together into chunks via Hilbert Curve (*chunk size) • round-robin distribution to workers • spatial-index support: • deploy Hilbert R-Tree at master node • leaf nodes correspond to chunks • initial filtering at master; tells workers which chunks to further examine o4 3 2 o2 o7 o5 o1 o8 o3 1 4 o6 sorted objects:o1,o3,o8,o6,o2 ,o7,o4,o5 chunk1={o1,o3} chunk2={o8,o6} chunk3={o2,o7} chunk4={o4,o5}

  26. Range Queries: Subchunk Replication step1:divide each chunk into k sub-chunks step2: distribute sub-chunks in round-robin fashion Worker1 Worker 2 Worker 3 Worker 4 chunk3 chunk4 chunk2 chunk1 step1 step1 step1 step1 k = 2 chunk2,1 chunk2,2 chunk3,1 chunk3,2 chunk4,1 chunk4,2 chunk1,1 chunk1,2 * rack-failure: same approach, but distribute sub-chunks to nodes in different rack

  27. Aggregation Queries on Point Data • query: • each data object is a point in 2D space • each query is defined with a dimension (X or Y), and an aggregation function (SUM, AVG, …) • parallelization: • master/worker model • divide space into M partitions • no indexing support • standard 2-phase algorithm: local and global aggregation partial result in worker 2 Y worker 2 worker 1 worker 3 worker 4 X M = 4

  28. Aggregation Queries: Subpartition Replication step1:divide each partition evenly into M’ sub-partitions step2:send each of M’ sub-partitions to a different worker node • Important questions: • how many sub-partitions (M’)? • how to divide a partition (cv’ and ch’) ? • where to send each sub-partition? (random vs. rule-based) Y M’ = 4 ch’ = 2 cv’ = 2 rule-based selection:assign to nodes which share the same coordinate-range a better distribution reduces comm. overhead X

  29. Experiments • two quad-core 2.53 GHz Xeon(R) processors with 12-GB RAM • entire system implemented in C by using MPI-library • 64 nodes used, unless noted otherwise • range queries • comparison with chunk replication scheme • 32 GB spatial data • 1000 queries are run, and aggregate time is reported • aggregation queries • comparison with partition replication scheme • 24 GB point data

  30. Experiments: Range Queries - Execution Times with No Replication and No Failures Optimal Chunk Size Selection Scalability * chunk size = 10000

  31. Experiments: Range Queries • Execution Times under Failure Scenarios (64 workers in total) • k is the number of sub-chunks for a chunk Single-Machine Failure Rack Failure

  32. Future Work • Retaining Task-Graph on Data-Flow Models and Experimental Evaluation (continuation of 2ndwork) • Protection against Soft Errors with DISC Programming Model

  33. Retaining Task-Graph • Requires knowledge on task-graph structure • efficient detection of producer tasks • Retain task-graph structure • storing (producer, consumers) per task-level large-space overhead • use a compressed representation of dependencies via iterator-slices • iterator-slice represents a grouping of tasks • An iterator-slice remembers the dependent iterator-slices

  34. Retaining Task-Graph • Same dependency can be also stored in reverse direction. b)after data-block has been garbage-collected a)before data-block has been garbage-collected

  35. 16-Cases of Recovery • expose all possible cases for recovery • define four dimensions to categorize each data-block • d1:aliveorfailed (its producer) • d2: aliveorfailed(its consumers) • d3: aliveor failed(where it’s stored) • d4: trueorfalse(garbage-collected) <alive,alive,failed,false> <alive,alive,alive,true> <alive,alive,alive,false> <alive,alive,failed,true>

  36. Experimental Evaluation • Benchmarks to test • LU-decomposition • 2D-Jacobi • Smith-Waterman Sequence Alignment • Evaluation goals • performance of the model without FT support • space-overhead caused by additional data-structures for FT • Efficiency of proposed schemes under different failure scenarios

  37. Future Work • Retaining Task-Graph on Data-Flow Models and Experimental Evaluation (continuation of 3rd work) • Protection against Soft Errors with DISC Programming Model

  38. Soft Errors • Increasing soft error rate in current large-systems • random-bit flips in processing cores, memory, or disk • due to radiation, increasing intra-node complexity, low-voltage execution, … • “soft errors in some data-structures/parameters have more impact on the execution than others” (*) • program halt/crash:size and identity of domain, index arrays, function handles, … • output incorrectness: parameters specific to an application • ex: atom density, temperature, … * Dong Li, Jeffrey S. Vetter, Weikuan Yu “Classifying soft error vulnerabilities in extreme-scale applications using a binary instrumentation tool” (SC’12)

  39. DISC model against soft errors • DISC abstractions • runtime internally maintains critical data-structures • can protect them transparently to the programmer • protection: • periodic verification • storing in more reliable memory • more reliable execution of compute-functions against SDC

  40. THANKS!

More Related