1 / 31

Communication-Aware Processor Allocation for Supercomputers

Communication-Aware Processor Allocation for Supercomputers. Michael Bender , SUNY Stony Brook David Bunde, University of Illinois Urbana Erik Demaine, MIT Sandor Fekete, Braunschweig University of Technology Vitus Leung, Sandia National Laboratories Henk Meijer, Queen’s University, Ontario

stacy
Download Presentation

Communication-Aware Processor Allocation for Supercomputers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Communication-Aware Processor Allocation for Supercomputers Michael Bender, SUNY Stony Brook David Bunde, University of Illinois Urbana Erik Demaine, MIT Sandor Fekete, Braunschweig University of Technology Vitus Leung, Sandia National Laboratories Henk Meijer, Queen’s University, Ontario Cynthia Phillips, Sandia National Laboratories Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy under contract DE-AC04-94AL85000.

  2. Computational Plant (Cplant) • Commodity-based supercomputers at Sandia National Laboratories (off-the-shelf components) • Up to 1500 processors • Production computing environment • Our Job: Improve parallel node allocation on Cplant to optimize performance.

  3. The Cplant System • DEC alpha processors • Myrinet interconnect (Sandia modified) • MPI • Different sizes/topologies: usually 2D or 3D grid with toroidal wraps • Ross = ~1500 proc, 3D mesh • Zermatt = 128-proc 2D mesh • Alaska = ~600, heavily-augmented 2D mesh (cannibalized). • Modified Linux OS (now public domain) • Four processors/switch (compute, I/O, service nodes)

  4. Scheduling Environment • Users submit jobs to queue (online) • Users specify number of processors and runtime estimate • If a job runs past this estimate by 5 min, it is killed • No preemption, no migration, no multitasking (security) • Actual runtime depends on set of processors allocated and placement of other jobs Goals: • User - minimum response time • Bureaucracy (GAO) - high utilization

  5. Scheduler Allocator Scheduler/Allocator Association Scheduler and allocator effect each others’ performance. Performance dependencies

  6. Scheduler/Allocator Dissociation Job: User Executable # processors Requested time • Scheduler enforces policy • Management sets priorities for access, utilization policy • Allocator can optimize performance Node Allocator PBS Scheduler Cplant . . . queue Job

  7. What’s a Good Allocation? Objective: Allocate jobs to processors to minimize network contention  processor locality. • Especially important for commodity networks Good allocation For 2D mesh Bad allocation For 2D mesh

  8. Quantitative Effect of Processor Locality But, speed-up anomaly = 2  faster than = empty processor

  9. Communication Hops on a 2D grid • L1 distance = # hops (~ # switches) between 2 processors on grid 5 4

  10. Allocation Problem • Given n available points on grid (some unavailable) • Find a set of k available points with minimum average (or total) L1 distance. • Example: green allocation: 3(2) + 3(1) = 9

  11. Empirical Correlation Leung et al, 2002 Related support: Mache and Lo, 1996

  12. Previous Work • Various Work forcing a convex set • Insufficient processor utilization • Mache, Lo, Windisch MC algorithm • Krume et al 2-approximation, NP-hard w/general metric • Complexity open for grids • Dispersion problem (max distance) linear time for fixed k (Fekete and Meijer)

  13. Optimal Unconstrained Shape[Bender,Bender,Demaine,Fekete 2004] Almost a circle but not quite. Only .05 percent difference in area. 0.650 245 952 951

  14. Our Results • 7/4-approximation (2 - in d dimensions) • PTAS ((1+)-approximation in time poly(n, ) • MC is a 4-approximation • Linear-time exact dynamic program 1D • O(n log n) time for k=3 • Simulations (performance on job streams)

  15. An L1 Ball on a 2D Grid (0,1) y - x = 1 x + y = 1 (-1,0) (1,0) x + y = -1 x - y = 1 (0,-1)

  16. Possible medians of selected set • A median will always share x coordinate with an available point and y coordinate with a (possibly different) available point.

  17. Manhattan Median (MM) Algorithm • For each possible median p • Pick k free processors closest to p (in L1) • Compute total pairwise L1 distance Return set with the smallest total distance. • Krumke et al (1997) previously showed this is a 2-approximation in arbitrary metric spaces. • We proved it is a 7/4-approximation for L1. This is tight.

  18. Lower Bound Instance (7/4)

  19. Upper Bound Techniques • WLOG assume the origin is a median of OPT • Let M be the k points closest to the origin • Candidate point set for algorithm MM • Set returned by MM can only be better • Compare M to optimal • Assume M is the worst-case example

  20. Upper Bound Techniques • Transform optimal and M to point placements that have the same performance ratio, but are easy to analyze • Transform in steps • Argue the ratio gets worse if we deviate from this form (impossible if M is the worst case) All points of Opt and M at these 5 points

  21. Simulations: Performance on a Job Stream We’ve analyzed a greedy algorithm for placing a single job How well does it do for a stream of jobs? Consider two types of algorithms: • Situation algorithm: Places job stream prefix (system normal/default) • Decision algorithm: Places current job (can be a 1-time override)

  22. Simulation Set up • Job stream from LLNL Cray T3D Trace • 21323 jobs, 256 processors Situation Algorithm Job stream Current Allocation 1-time decision Algorithm

  23. Simulations: Alternative Placement Algorithm MC • Search in shell from minimum-size region of preferred shape. • Weight processors by shells • Return processor set with minimum weight.

  24. Alternative: One-Dimensional Reduction rlrubin: illustrate algorithms unlikely to be efficiently solvable more motivation - why default is not good enough • Order processors so that close in linear order  close in physical processor graph • Consider one-dimensional processor allocation • Pack jobs onto the line (or ring), allowing fragmentation

  25. Hilbert (Space-Filling) Curves • For 2D and 3D grids • Previous applications • I/O efficient and cache-oblivious computation • Compression (images) • Domain decomposition

  26. Four Algorithms for Simulation • MM • MM + Incremental improvement • Hilbert curve with best fit • MC

  27. Results • Ordering in a row consistent with proven approximation performance MM+Inc, MM, MC1x1, HilbertBF • Ordering on diagonal (normal operation): approximately opposite

  28. Results • MM “paints into a corner on streams” • But good for single high-priority job • Thoughts: rectangles pack better than circles

  29. New System Red Storm • 10,368 AMD Opteron 2Ghz • 31.2 TB Memory, 240 TB disk • 41.47 TF peak performance • 3D Mesh

  30. Impact • Changed the node allocator on Cplant • 1D default allocator • Carried over to Red Storm system software • 1D algorithms current default • 2D algorithms implemented on Red Storm • Awaiting testing for use • R&D 100 submission (must win internal competition)

  31. Questions • What’s the right allocation for a stream (online)? • Scheduling + Allocation • Simulation issues • Nondeterminism • Credit for good placement in timing

More Related