200 likes | 291 Views
Seaborg Code Scalability Project. Richard Gerber NERSC User Services. NERSC Scaling Objectives. NERSC wants to promote higher concurrency jobs. To this end NERSC Reconfigured the LoadLeveler job scheduler to favor large jobs Implemented a large job reimbursement program
E N D
Seaborg Code Scalability Project Richard Gerber NERSC User Services
NERSC Scaling Objectives NERSC wants to promote higher concurrency jobs. To this end NERSC • Reconfigured the LoadLeveler job scheduler to favor large jobs • Implemented a large job reimbursement program • Provides users assistance with their codes • Began a detailed study of a number of selected projects
Code Scalability Project About 20 large user projects were chosen for NERSC to study more closely. Each is assigned to a staff member from NERSC User Services. The study will: • Interview users to determine why they run their jobs they way they do. • Collect scaling information for major codes. • Identify classes of codes that scale well or poorly. • Identify bottlenecks to scaling. • Analyze cost/benefit for large concurrency jobs. • Note lessons learned and tips for scaling well.
Current Usage of Seaborg • We can examine current job statistics on Seaborg to check • User behavior (how jobs are run) • Queue wait times • We can also look at the results of the large job reimbursement program to see how it influenced the way users ran jobs
Job Distribution Regular charge class 3/3/2003-5/26/2003
Connect Time Usage Distribution Regular charge class 3/3/2003-5/26/2003 3/3/2003-5/26/2003
Queue Wait Times 3/3/2003-5/26/2003 Regular charge class
Processor Time/Wait Ratio Regular charge class 3/3/2003-5/26/2003
Current Usage Summary • Users run many small jobs • However, 55% of computing time is spent running jobs that use more than 16 nodes (256 processors) • And 45% of computing time is used by jobs running on 32+ nodes (512+ CPUs) • Current queue policy favors large jobs; it is not a barrier to running on many nodes
Factors that May Affect Scaling Why aren’t even more jobs run at high concurrency? Are any of the following bottlenecks to scaling? • Algorithmic issues • Coding effort needed • MPP cost per amount of science achieved • Any remaining scheduling / job turnaround issues • Other????
Hints from Reimbursement • During April NERSC reimbursed a number of projects for jobs using 64+ nodes • Time set aside to let users investigate scaling performance of their codes • Some projects made great use of the program, showing that they would run at high concurrency if given free time.
Reimbursement Usage Run time percentage using 64+ nodes (examples) Batchelor went from 0% to 66% of time running on 128+ nodes (2,048 CPUs)
Project activity • Many projects are working with their User Services Group contacts • Characterizing scaling performance • Profiling codes • Parallel I/O strategies • Enhancing code for high concurrency • Compiler, runtime bug fixes and optimizations • Examples: • Batchelor (Jaeger), Ryne (Qiang, Adelmann), Vahalla, Toussaint, Mezzacappa (Swesty, Strayer, Blondin), Butalov, Guzdar (Swisdak), Spong
Project Example 1 • Qiang’s (Ryne) BeamBeam3D beam dynamics code; written in Fortran • Poor scaling noted on N3E compared to N3 • We performed many scaling runs, noticed very bad performance using 16 tasks/node • Tracked problem to routine making heavy use of RANDOM_NUMBER intrinsic • Identified runtime problem with IBM’s default threading of RANDOM_NUMBER • Found undocumented setting that improved performance dramatically; reported to IBM • Identified run strategy that minimized execution time; another that minimized cost
Tasks per Node Number of Tasks 16 8 4 2 1 4 206.1 (209.0) 205.2 (207.5) 202.1 (201.7) 8 103.6 (115.6) 101.6 (100.8) 97.6 (98.0) 96.1 (96.1) 16 53.1 (106.6) 47.0 (53.2) 45.3 (45.9) 44.2 (44.7) 43.7 (44.0) 32 25.0 (62.0) 21.9 (27.2) 21.6 (22.8) 21.2 (21.6) 64 15.6 (73.9) 14.1 (21.7) 12.7 (14.1) 12.3 (12.7) 128 20.8 (75.4) 13.5 (16.7) 12.2 (12.1) 256 38.1 (181.2) 21.8 (32.9) BeamBeam Run Time intrinthds=1 (default)
Number of Nodes Number of Tasks 1 2 4 8 16 32 4 8,244 (8,360) 16,416 (16,600) 32,336 (32,272) 8 4,144 (4,624) 8,128 (8,064) 15,616 (15,680) 30,750 (30,750) 16 2,124 (4,264) 3,760 (4,256) 7,248 (7,344) 14,144 (14,304) 27,968 (28,160) 32 2,000 (4,960) 3,504 (4,352) 6,912 (7,296) 13,568 (13,824) 64 2,496 (11,824) 4,512 (6,944) 8,122 (9,024) 15,744 (16,230) 128 6,656 (24,128) 8,640 (10,688) 15,616 (15,539) 256 24,384 (115,968) 27,904 (42,112) MPP Charges
BeamBeam Summary • Found a fix for runtime performance problem • Reported to IBM; seeking clarification and documentation • Identified run configuration that solved problem the fastest • Identified cheapest job • Quantified MPP cost for various configurations
Project Example 2 • Adelmann’s (Ryne) PARSEC code • 3D self-consistent iterative field solver, particle code for studying accelerator beam dynamics; written in C++ • Scales extremely well to 4,096 processors, but Mflops/s performance disappointing • Migrating from KCC to xlC; found fatal xlC compiler bug; pushing IBM for fix so can optimize with IBM compiler • Using HPMlib profiling calls, found that large amount of run time spent in integer-only stenciling routines; naturally gives low Mflops/s • Have recently identified possible poor load-balancing issues; working to resolve
In Conclusion • This work is underway. • We don’t expect to be able to characterize every code we are studying, but we hope to survey a number of algorithms and scientific applications. • A draft report is scheduled for July.