1 / 22

Instant-access cycle stealing for parallel applications requiring interactive response

Instant-access cycle stealing for parallel applications requiring interactive response. Paul Kelly (Imperial College) Susanna Pelagatti (University of Pisa) Mark Rossiter (ex-Imperial, now with Telcordia). Application scenario…. Workplace with fast LAN and many PCs

palti
Download Presentation

Instant-access cycle stealing for parallel applications requiring interactive response

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Instant-access cycle stealing for parallel applications requiring interactive response Paul Kelly (Imperial College) Susanna Pelagatti (University of Pisa) Mark Rossiter (ex-Imperial, now with Telcordia)

  2. Application scenario… • Workplace with fast LAN and many PCs • Some users occasionally need high computing power to accelerate interactive tasks • Example:CAD • Interactive design of components/structures • Analyse structural properties • Simulate fluid flow • Compute high-resolution rendering • Most PCs are under-utilised most of the time • Can we use spare CPU cycles to improve responsiveness?

  3. The challenge… • Cycle stealing the easy way… • Maintain a batch queue • Maximise throughput for multiple, long-running jobs • Wait til desktop users leave their desks • This paper is about doing it the hard way: • Using spare cycles to accelerate short, parallel tasks (5-60 seconds) • In order to reduce interactive response time • While desktop users are at their desks • This means: • No batch queue – execute immediately using resources instantaneously available • No time to migrate or checkpoint tasks • No time to ship data across wide-area network

  4. A challenging environment… • For our experiments, we used a group of 32 Linux PCs in a very busy CS student lab • Graph shows hourly-average percentage utilisation (on a log scale) over a typical day • Although not 100% busy, machines in continuous use

  5. Scenario • Host PCs service interactive desktop users • Requests to execute parallel guest jobs arrive intermittently • System allocates group of idle PCs to execute guest job • Objectives: • Minimise average response time for guest jobs • Keep interference suffered by hosts within reasonable limits • We show that this can really work, even in our extremely challenging environment • Next: characterise patterns of idleness • Then: design software to assign guest tasks • Then: evaluate alternative strategies by simulation

  6. Earlier work • Batch queue, multiple long-running jobs • Parallel jobs • “60-workstation cluster can handle job arrival trace taken from a dedicated 32-node CM-5” • Wide-area networks • Our goal: Improve response time for individual tasks • Litzkow, Livny, Mutka, “Condor - a hunter of idle workstations”. ICDCS’88. • Atallah, Black, et al, “Models and algorithms for co-scheduling compute-intensive tasks on networks of workstations”. JPDC 1992. • Arpaci, Dusseau et al “The interaction of parallel and sequential workloads on a network of workstations”. SIGMETRICS’95 • Acharya, Edjlali, Saltz, “The utility of exploiting idle workstations for parallel computing”. SIGMETRICS’97 • Petrini, Feng, “Buffered coscheduling: a new methodology for multitasking parallel jobs on distributed systems”. IPDPS 2000. • United Devices, Seti@home, Entropia • Subholk, Lieu, Lowekamp, “Automatic node selection for high performance applications on networks”. PPoPP 1999.

  7. Characterize patterns of idleness • Idle periods occur frequently • 90% of idle periods occur within 5s Idle = over a one second period, less than 10% of CPU time is spent executing user processes, and at least 90% of CPU time could be devoted to a new process

  8. Characterize patterns of idleness • Idle periods occur frequently • 90% of idle periods occur within 5s • Idle periods don’t last long • Only 50% last more than 3.3s Idle = over a one second period, less than 10% of CPU time is spent executing user processes, and at least 90% of CPU time could be devoted to a new process

  9. Distribution of idleness – 32 PCs in busy student lab • It’s very likely that we’ll have up to 15 idle machines at any time

  10. Distribution of idleness – 32 PCs in busy student lab • It’s unlikely that the same 15 machines will stay idle for long • It’s very likely that we’ll have up to 15 idle machines at any time

  11. So how much can we hope to get? • With our 32-PC cluster, an idle group of 5 processors has about a 50% chance of remaining idle for more than 5 seconds • This is our parallel computing resource!

  12. The mpidled software • mpidled is a Linux daemon process which runs on every participating PC: • Monitors system utilisation, determines whether system is idle • Uses this and past measurements to predict short-term future utilization • mpidle is a client application which lists the participating PCs which are currently predicted to be idle • Produces list of machine names, for use as MPI machinefile

  13. Zero administration by leadership election • Participating PCs are regularly unplugged and rebooted • Vital to minimize systems administration overheads… • Mpidled daemons autonomously elect “leader” to handle client requests (current implementation relies on LAN broadcast, confined to one subnet) • Mpidle usually responds in less than 0.15s

  14. Load prediction • We use recent measurements of idleness to predict how idle each PC will be in the future • Good prediction leads to • shorter execution time for guest jobs • Less interference with host processes, ie the desktop user • We’re interested in short-running guest jobs – so we don’t consider migrating tasks if the prediction turns out wrong

  15. How good is load prediction? • Previous studies (Dinda and O’Halloran, Wolski et al) have shown that taking the weighted mean of the last few samples works as well as anything For 10-second prediction Forecast length (seconds)

  16. How well does it work? • Simulation, driven by traces from 32 machines gathered over one week, during busy working hours • Uses application’s speedup curve to predict execution time given number of processors available • Also uses trace load data to compute CPU share available on each processor • For this study, we simulated execution of a ray-tracing task • Sequential execution takes 42 seconds • Speedup is more-or-less linear with 50-60% efficiency • Requests to execute a guest task arrive with an exponential distribution, with mean inter-arrival time of 20 seconds

  17. How well does it work - baseline • Disruption to desktop users is dramatically reduced compared to assigning work at random (but not zero) • Although many processors used, speedup is low • Quite often, a guest task is rejected because no processor is idle • Usually because earlier guest task is still running

  18. Allocation policy matters… • The simplest policy is to allocate all available (idle) processors to each guest job • This leads to a bimodal distribution: a substantial proportion of guest jobs get little or no benefit

  19. A better strategy – holdback • The problem: • If a second guest task arrives before the first has finished, very few processors are available to run it • Idea: “holdback” • Hold back a percentage r of processors in reserve • Each guest task is allocated (1-r) of the available (idle) processors

  20. Holdback improves fairness • By holding back some resources at each allocation, guest tasks get a more predictable and consistent share • How much to hold back depends on rate of arrival of guest tasks Frequency (%) Group size Group size Group size

  21. How much to hold back • Mean speedup maximised with right holdback • Parallel efficiency lower than would be on dedicated parallel system, due to interference • Larger group size doesn’t imply higher speedup • Details depend on speedup characteristics of guest application workload

  22. Conclusions & Further work • Simple, effective tool, to be made freely available • Even extremely busy environments can host a substantial parallel workload • Short interactive jobs can be accelerated, if • Relatively small startup cost, data size • Parallel execution time lies within scope of load prediction – 10 seconds or so • Desktop users prepared to tolerate some interference • Plenty of scope for further study… • Memory contention • Adaptive holdback • Integrate with queuing to handle longer-running jobs • How to reduce startup delay?

More Related