1 / 10

Fault-tolerant Scheduling of Fine-grained Tasks in Grid Environments

Fault-tolerant Scheduling of Fine-grained Tasks in Grid Environments. Assumptions. Fail-stop model If a processor fails, it no longer transmits valid messages. Reliable communication Processor crashes are detected eventually by the communication layer. 2 is a descendant of 1. ( Master ).

joy-reed
Download Presentation

Fault-tolerant Scheduling of Fine-grained Tasks in Grid Environments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fault-tolerant Scheduling of Fine-grained Tasks in Grid Environments

  2. Assumptions • Fail-stopmodel If a processor fails, it no longer transmits valid messages. • Reliablecommunication Processor crashes are detected eventually by the communication layer.

  3. 2 is a descendant of 1 (Master)

  4. Map of Thief to Set of Tasks • Each victim has a table of stolen tasks. Map<Computer, Set<Task>> thiefTaskSet = new … • When a task is stolen, a copy is put in the Set<Task> associated with that thief (Computer).

  5. Global Result Table (GRT) • Each compute server has a GRT replica: Map<taskParameters, Result> Entries are broadcast to all compute servers. • The Map key & value are potentially large. • It should be (more explanation later …) Map<TaskId, Computer> Where Computer is where the Result is stored.

  6. Crash recovery method • If ( master crashed ) Elect a new master; • For all ( tasks stolen by a crashed processor ) Put task in task queue; • For all ( descendants of tasks stolen from a crashed processor ) If (descendant is finished) Then store it’s result in Global Result Table; Else abort the task; • If ( old master crashed && I am the master ) Restart the application;

  7. Notes • A task is an orphan if its parent task is on a crashed server. • The authors: Our contribution: Some descendants of orphaned tasks are not recomputed. • Descendants of orphaned tasks are aborted, if they are incomplete at the time they become orphans. • They do not use explicit continuation passing: No composition tasks. Descendant decompositions that were complete must be recomputed!

  8. Complete decomposition • tasks 4, 8, & 14 are lost. • In-progress task 21 is lost. • Decompositions • 2, 5, 10, 16 are lost

  9. Notes • Their GRT key is task parameters. • The hash code is sum the hash of the parameters If the parameter is an array, they sum the hash of each element! • It should be TaskId, but they do not have a processor-independent TaskId. This is claimed as future work. • They claim: only 1 in 1000-10,000 tasks is stolen, which is key to the efficiency of their scheme. Their tests crash whole clusters, rather than individual compute servers within a cluster. Why?

More Related