1 / 40

More on Adaptivity in Grids

More on Adaptivity in Grids. Sathish S. Vadhiyar Source/Credits: Figures from the referenced papers. Predicting the Cost and Benefit of Adapting Data Parallel Applications in Clusters – Jon Weissman. Deals with 3 adaptation techniques in response to resource or load changes Migration

zamora
Download Presentation

More on Adaptivity in Grids

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. More on Adaptivity in Grids Sathish S. Vadhiyar Source/Credits: Figures from the referenced papers

  2. Predicting the Cost and Benefit of Adapting Data Parallel Applications in Clusters – Jon Weissman • Deals with 3 adaptation techniques in response to resource or load changes • Migration • Involves remote process creation followed by transmission of old worker’s data to new worker • Dynamic load balancing • Collecting load indices, determining redistribution and initiating data transmission • Addition or removal of processors • Followed by data transmission to maintain load balance • A prediction framework to decide automatically which adaptation technique is better at a given time for a given resource conditions

  3. Terms for Prediction • N – problem size; P – processors; • i – current iteration • Esize – size in bytes of data elements • DM – amount of data that must be moved at i to achieve load balance • Tcomm(N, P, j, i) – cost of ith communication step for jth worker; Tcomp; Texec = Tcomm+Tcomp • Tpc – cost of creating remote process • Tdm(B) – cost of transferring B bytes of data • Tnn – cost of establishing new worker neighbors

  4. Cost of adaptive methods • Migration from a processor • Involves process creation and data movement • Tmigrate(N,P,i) = Tpc + Tdm(Esize DM(N,P,i)) + Tnn • Addition of a processor • Each worker sends a fraction of their data to worker 0 • Worker 0 distributes the collected data to the new worker • Tadd(N,P,i) = max{Tdm(Esize DM(N,P,i)), Tpc} + Tdm(Esize DM(N,P+1,i)) + Tnn

  5. Cost of adaptive methods • Removal of a processor • Existing worker’s data is sent to worker 0 • This data is distributed across remaining worker • Tremove(N,P,i) = 2Tdm(Esize DM(N,P,i)) + Tnn • Dynamic Load Balancing - Moving data from overloaded to underloaded workers • Load indices of each worker collected by worker 0 • Worker 0 computes the redistribution, collects data from each overloaded worker • Worker 0 transmits data to each underloaded worker to achieve load balance. • Tdlb(N,P,i) = 2Tdm(Esize DM(N,P,i))

  6. Contd… • All the above depends on DM(N,P,i) • L(j,i) – individual load index of the jth worker at iteration I • D(j,i) – amount of data held by each worker • Where

  7. Experiment Setup • Applications • Iterative Jacobi solver • Gaussian elimination • Gene sequence comparison • Tnn = 0 (inexpensive) • Few test programs to determine cluster-specific cost constants - Tpc, Tdm • Each application run with 3 problem sizes on 3 processor configurations for application-specific constants • Tcomp, Tcomm, Texec

  8. STEN • 5-point stencil iterative Jacobi solver • 1D communication topology where processor division is across rows • Size of each data point, Esize – 8 bytes • Tcomm(N,P,j,i) = 5 + 0.00219 x 8N ms • Latency – 5 ms, 0.00219 - per-byte transfer costs, message size – 8N • Tcomp(N,P,j,i) = 0.000263 x 5N D(j,i) L(j,i) ms • 5 floating points operations per element; 0.000263 time to perform update of single element on an unloaded machine • Texec = Tcomp + Tcomm ; Tpc = 330 ms • Tdm = 1 + 0.00103M ms • 1ms latency; 0.00103 per-byte transfer costs

  9. GE with partial pivoting • Row-cyclic decomposition of the matrix • Master-slave broadcast topology for pivot exchange • Esize – 8 bytes, N iterations, Tnn = 0 • Tcomm(N,P,j,i) = 1.14 + 0.00114 x (N-i)P ms • Tcomp(N,P,j,i) = 0.000335(N-i)D(j,i)L(j,i) • N-i entries are modified • Texec = Tcomp + Tcomm ; Tpc = 55 ms • Tdm(M) – same as previous

  10. CL (CompLib) • Biology application that classifies protein sequences • Compares source library of sequences to a target library of sequences (string matching problem) • Parallel implementation – target library decomposed across workers in load balanced fashion • Each worker compares all target sequences it is assigned to a source sequence in the single iteration • Amount of computation in each worker depends on size of source sequence and sizes and number of target sequences

  11. CL (Contd…) • N – total number of target sequence blocks • D(j,i) – number of target sequence blocks stored in jth worker • Each block – 5000 bytes • Seq(i) – size of the source sequence • Data transferred – source sequence to each worker by master, results sent back up • Tcomm(N,P,j,i) = 1.14P + 0.00130(seq(i)+180D(j,i)) • 1.14 P – latency, 0.0013 per-byte transfer costs, 180 bytes – comparison score for each target sequence • Tcomp(N,P,j,i) = 0.00000424D(j,i)5000seq(i)L(j,i) • 0.00000424 – cost for integer comparison • Tpc = 550 ms • Tdm(M) – same as above

  12. Accuracy in predicting cost of adaptation

  13. Accuracy in predicting cost of adaptation

  14. Accuracy in Predicting Benefit of Adaptation

  15. Accuracy in Predicting Benefit of Adaptation

  16. Sensitivity to Increasing Load • Differing loads added to a machine • Migration provides benefits with increasing load

  17. Sensitivity to Load Introduction Times • Migration benefit decreases with increasing load injection times

  18. Dynamically Choosing Best Adaptive Method • Adaptive run time system that chooses the best adaptive method based on a given condition • Automated adaptive method selection in response to two events • Addition of a new processor • Presence of external CPU load

  19. Addition of free nodes • * represents number of nodes picked by adaptive method

  20. Adaptation due to load events Prediction-based gives better results

  21. Fault-Tolerance, Malleability and Migration for Divide-and-Conquer Applications on the Grid Wrzesinska et al.

  22. Fault-Tolerance, Malleability and Migration for Divide-and-Conquer Applications on the Grid • 3 general class of divisible applications • Master-worker paradigm – 1 level • Hierarchical master-worker grid system – 2 levels • Divide-and-conquer paradigm – allows computation to be split up in a general way. E.g. search algorithms, ray tracing etc. • The work deals with mechanisms to deal with processors leaving • Handling partial results from leaving processors • Handling orphan work • 2 cases of processors leaving • When processors leave gracefully (e.g. when processor reservation comes to an end) • When processors crash • Restructuring computation tree

  23. Introduction • Divide-and-conquer • Recursive subdivision; After solving subproblems, their results are recursively combined until the final solution is reached. • Work is distributed across processors by work-stealing • When a processor runs out of work, it picks another processor at random and steals a jobs from its work queue • After computing the jobs, the result is returned to the originating processor • Have a work-stealing algorithm called CRS (Cluster-aware random stealing) that overlaps intra-cluster steals with inter-cluster steals

  24. Malleability • Adding a new machine to a divide-and-conquer computation is simple • New machine starts stealing jobs from other machines • Leaving of a processor - Restructuring of the computation tree to reuse as many partial results as possible • What happens when processors leave • remaining processors are notified by leaving processor (when processors leave gracefully) • detected by the communication layer (in unexpected leaves)

  25. Recomputing jobs stolen by leaving processors • Each processor maintains a list of jobs stolen from it and the processor Ids of the thieves • When processors leave • Each of the remaining processors traverses its stolen jobs list, searches for jobs stolen by leaving processors • Such jobs are put back in the work queues of owners, marked as “restarted” • Children of “restarted” jobs are also marked as “restarted” when they are spawned

  26. Example

  27. Example (Contd…)

  28. Orphan Jobs • Jobs stolen from leaving processors • Existing approaches • Processor working on an orphan job must discard the result, since it does not know where to return the result • Need to know the new address to return the result • Salvaging orphan jobs requires creating the link between the orphan and its restarted parent

  29. Orphan Jobs (Contd…) • For each finished orphan job • Broadcast of a small message containing the jobID of the orphan and the processorID that computed the orphan • Abort unfinished intermediate nodes of orphan subtrees • (jobID, processorID) stored by each processor in a local orphan table

  30. Orphan Jobs (Contd…) • When a processor tries to recompute “restarted” jobs • Processors perform lookup in orphan table • If the jobIDS match, the processor removes it from the workqueue, puts it in the list of stolen jobs • Send message to the orphan owner requesting result of the job • Orphan owner marks it as stolen from the sender of the request • Link between restarted parent and orphaned child is restored • Reusing orphans improves performance of the system

  31. Example

  32. Partial Results on Leaving Processors • If a processor knows it has to leave: • Chooses another processor randomly • Transfers all results of finished jobs to the other processor • The jobs are treated as orphan jobs • Processor receiving the finished jobs broadcasts a (jobID, processorID) tuple • Partial results linked to the restarted parents

  33. Special Cases • Master leaving – special case; owns root job that was not stolen from anyone • Remaining processors elect the new master which will respawn the root job • New run will reuse partial results of orphan jobs from previous run • Adding processors • New processor downloads an orphan table from one of the other processors • Piggybacks orphan table requests with steal requests • Message combining • One small (broadcast) message has to be sent for each orphan and for each computed job in the leaving processor • Messages are combined

  34. Results • 3 Types • Overhead when no processors are leaving • Comparison with traditional approach that does not save orphans • To show that mechanism can be used for efficient migration of the computation • Testbeds • DAS-2 system, 5 clusters in five Dutch Universities • European GridLab – 24 processors in 4 sites in Europe • 8 in Leiden and 8 in Delft (DAS-2) • 4 in Berlin • 4 in Brno

  35. Overhead during normal Execution • 4 applications on a system with and without their mechanisms • RayTracer, TSP, SAT solver, Knapsack problem • Overhead is negligible

  36. Impact of Salvaging Partial Results • RayTracer Application • 2 DAS-2 clusters with 16 processors each • Removed one cluster in the middle of the computation, after half of the time it would take on 2 clusters without processors leaving • Comparison of • Traditional approach (without saving partial results) • Recomputing trees when processors leave unexpectedly • Recomputing trees when processors leave gracefully • Runtime on 1.5 clusters (16 on processors in 1 cluster and 8 processors in another cluster) • Difference between last two gives overhead of transferring the partial results from leaving processors and the work lost because of the leaving processors

  37. Results

  38. Migration • Replaced one cluster with another • Raytracer application on 3 clusters • In the middle of the computation, one cluster was gracefully removed, and another identical cluster added • Comparison without migration • Overhead of migration – 2%

  39. References • Predicting the cost and benefit of adapting data parallel applications in clusters. Journal of Parallel and Distributed Computing. Volume 62 ,  Issue 8  (August 2002) Pages: 1248 - 1271   Year of Publication: 2002 Author Jon B. Weissman • Fault-Tolerance, Malleability and Migration for Divide-and-Conquer Applications on the Grid," Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International , vol., no.pp. 13a- 13a, 04-08 April 2005

  40. Predicting the Cost and Benefit of Adapting Data Parallel Applications in Clusters – Jon Weissman • Library of adaptation techniques • Migration • Involves remote process creation followed by transmission of old worker’s data to new worker • Dynamic load balancing • Collecting load indices, determining redistribution and initiating data transmission • Addition or removal of processors • Followed by data transmission to maintain load balance • Library calls to detect and initiate adaptation actions within the applications • Adaptation event sent from an external detector to all workers

More Related