1 / 20

Seaborg Code Scalability Project

Seaborg Code Scalability Project. Richard Gerber NERSC User Services. NERSC Scaling Objectives. NERSC wants to promote higher concurrency jobs. To this end NERSC Reconfigured the LoadLeveler job scheduler to favor large jobs Implemented a large job reimbursement program

summer-dean
Download Presentation

Seaborg Code Scalability Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Seaborg Code Scalability Project Richard Gerber NERSC User Services

  2. NERSC Scaling Objectives NERSC wants to promote higher concurrency jobs. To this end NERSC • Reconfigured the LoadLeveler job scheduler to favor large jobs • Implemented a large job reimbursement program • Provides users assistance with their codes • Began a detailed study of a number of selected projects

  3. Code Scalability Project About 20 large user projects were chosen for NERSC to study more closely. Each is assigned to a staff member from NERSC User Services. The study will: • Interview users to determine why they run their jobs they way they do. • Collect scaling information for major codes. • Identify classes of codes that scale well or poorly. • Identify bottlenecks to scaling. • Analyze cost/benefit for large concurrency jobs. • Note lessons learned and tips for scaling well.

  4. Current Usage of Seaborg • We can examine current job statistics on Seaborg to check • User behavior (how jobs are run) • Queue wait times • We can also look at the results of the large job reimbursement program to see how it influenced the way users ran jobs

  5. Job Distribution Regular charge class 3/3/2003-5/26/2003

  6. Connect Time Usage Distribution Regular charge class 3/3/2003-5/26/2003 3/3/2003-5/26/2003

  7. Queue Wait Times 3/3/2003-5/26/2003 Regular charge class

  8. Processor Time/Wait Ratio Regular charge class 3/3/2003-5/26/2003

  9. Current Usage Summary • Users run many small jobs • However, 55% of computing time is spent running jobs that use more than 16 nodes (256 processors) • And 45% of computing time is used by jobs running on 32+ nodes (512+ CPUs) • Current queue policy favors large jobs; it is not a barrier to running on many nodes

  10. Factors that May Affect Scaling Why aren’t even more jobs run at high concurrency? Are any of the following bottlenecks to scaling? • Algorithmic issues • Coding effort needed • MPP cost per amount of science achieved • Any remaining scheduling / job turnaround issues • Other????

  11. Hints from Reimbursement • During April NERSC reimbursed a number of projects for jobs using 64+ nodes • Time set aside to let users investigate scaling performance of their codes • Some projects made great use of the program, showing that they would run at high concurrency if given free time.

  12. Reimbursement Usage Run time percentage using 64+ nodes (examples) Batchelor went from 0% to 66% of time running on 128+ nodes (2,048 CPUs)

  13. Project activity • Many projects are working with their User Services Group contacts • Characterizing scaling performance • Profiling codes • Parallel I/O strategies • Enhancing code for high concurrency • Compiler, runtime bug fixes and optimizations • Examples: • Batchelor (Jaeger), Ryne (Qiang, Adelmann), Vahalla, Toussaint, Mezzacappa (Swesty, Strayer, Blondin), Butalov, Guzdar (Swisdak), Spong

  14. Project Example 1 • Qiang’s (Ryne) BeamBeam3D beam dynamics code; written in Fortran • Poor scaling noted on N3E compared to N3 • We performed many scaling runs, noticed very bad performance using 16 tasks/node • Tracked problem to routine making heavy use of RANDOM_NUMBER intrinsic • Identified runtime problem with IBM’s default threading of RANDOM_NUMBER • Found undocumented setting that improved performance dramatically; reported to IBM • Identified run strategy that minimized execution time; another that minimized cost

  15. BeamBeam3D Scaling

  16. Tasks per Node Number of Tasks 16 8 4 2 1 4 206.1 (209.0) 205.2 (207.5) 202.1 (201.7) 8 103.6 (115.6) 101.6 (100.8) 97.6 (98.0) 96.1 (96.1) 16 53.1 (106.6) 47.0 (53.2) 45.3 (45.9) 44.2 (44.7) 43.7 (44.0) 32 25.0 (62.0) 21.9 (27.2) 21.6 (22.8) 21.2 (21.6) 64 15.6 (73.9) 14.1 (21.7) 12.7 (14.1) 12.3 (12.7) 128 20.8 (75.4) 13.5 (16.7) 12.2 (12.1) 256 38.1 (181.2) 21.8 (32.9) BeamBeam Run Time intrinthds=1 (default)

  17. Number of Nodes Number of Tasks 1 2 4 8 16 32 4 8,244 (8,360) 16,416 (16,600) 32,336 (32,272) 8 4,144 (4,624) 8,128 (8,064) 15,616 (15,680) 30,750 (30,750) 16 2,124 (4,264) 3,760 (4,256) 7,248 (7,344) 14,144 (14,304) 27,968 (28,160) 32 2,000 (4,960) 3,504 (4,352) 6,912 (7,296) 13,568 (13,824) 64 2,496 (11,824) 4,512 (6,944) 8,122 (9,024) 15,744 (16,230) 128 6,656 (24,128) 8,640 (10,688) 15,616 (15,539) 256 24,384 (115,968) 27,904 (42,112) MPP Charges

  18. BeamBeam Summary • Found a fix for runtime performance problem • Reported to IBM; seeking clarification and documentation • Identified run configuration that solved problem the fastest • Identified cheapest job • Quantified MPP cost for various configurations

  19. Project Example 2 • Adelmann’s (Ryne) PARSEC code • 3D self-consistent iterative field solver, particle code for studying accelerator beam dynamics; written in C++ • Scales extremely well to 4,096 processors, but Mflops/s performance disappointing • Migrating from KCC to xlC; found fatal xlC compiler bug; pushing IBM for fix so can optimize with IBM compiler • Using HPMlib profiling calls, found that large amount of run time spent in integer-only stenciling routines; naturally gives low Mflops/s • Have recently identified possible poor load-balancing issues; working to resolve

  20. In Conclusion • This work is underway. • We don’t expect to be able to characterize every code we are studying, but we hope to survey a number of algorithms and scientific applications. • A draft report is scheduled for July.

More Related