1 / 24

User experiences on Heterogeneous TACC IBM Power4 System

User experiences on Heterogeneous TACC IBM Power4 System. Avi Purkayastha Kent Milfeld, Chona Guiang Texas Advanced Computing Center University of Texas, Austin ScicomP 8 Minneapolis, MN August 4-8, 2003. Outline. Architectural Overview TACC Heterogeneous Power4 System

Thomas
Download Presentation

User experiences on Heterogeneous TACC IBM Power4 System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. User experiences on Heterogeneous TACC IBM Power4 System Avi Purkayastha Kent Milfeld, Chona Guiang Texas Advanced Computing Center University of Texas, Austin ScicomP 8 Minneapolis, MN August 4-8, 2003 TEXAS ADVANCED COMPUTING CENTER

  2. Outline • Architectural Overview • TACC Heterogeneous Power4 System • Fundamental differences/similarities of TACC Power4 nodes • Resource Allocation Policies • Scheduling • Simple and advanced job scripts • Pre-processing LL with Job filter • Performance Analysis • STREAM, NPB benchmarks • Finite Difference and Molecular Dynamics Applications • MPI Bandwidth Performance • Conclusions TEXAS ADVANCED COMPUTING CENTER

  3. TACC Cache/Micro-architecture Features • L1 32KB/data 2-way assoc. (write through)64KB/instruction direct mapped • L2 1.44MB (unified) 8-way assoc. • L3 32MB 8-way assoc. • memory 32 GB/Node (p690H) 128 GB/Node (p690T) 8 GB/Node (p655H) 128/128/4x128 Byte Lines for L1/L2/L3 TEXAS ADVANCED COMPUTING CENTER

  4. Comparison of TACC Power4 Nodes • All nodes have same processing speed but different memory configurations; p690H, p655H have 2G/proc; p690T has 4G/proc. • Only p690T has dual-core processors hence share L2 cache while others have dedicated L2 cache. • p655 nodes have PCI-X adapters while the other nodes have PCI adapters, hence former has faster throughputs on message-passing. • Global address snooping is absent on the p655s; this provides about 10% improvement in performance over the p690s. TEXAS ADVANCED COMPUTING CENTER

  5. TACC Power4 Systemlonghorn.tacc.utexas.edu Login GPFS P690 HPC P690 Turbo P690s HPC P655s HPC login node 13-way 16GB GPFS nodes 3 1-way 6 GB 3 nodes 16-way SMP 32GB/node 1 node 32-way 128GB 32 nodes 4-way SMP 8GB/node Totals 16 procs. 32 procs. 48 procs. 128 procs. 22GB 128GB 96 GB 256GB TEXAS ADVANCED COMPUTING CENTER

  6. TACC Power4 Systemlonghorn.tacc.utexas.edu Login GPFS P690 HPC P690 Turbo P690s HPC P655s HPC 32 ports 32 ports IBM HPC Dual-Plane SP Switch2 TEXAS ADVANCED COMPUTING CENTER

  7. TACC Power4 Systemlonghorn.tacc.utexas.edu local 36GB 36GB 36GB 36GB 18GB 18GB 1/4TB /srcatch /srcatch /srcatch /srcatch /srcatch /srcatch /home x16 x16 Login P690 Turbo P690 HPC P690 HPC P690 HPC P690 HPC P655 HPC P655 work SP Switch 2 4.5TB /work archival /archive TEXAS ADVANCED COMPUTING CENTER

  8. LoadLeveler Batch Facility • Used to execute a batch parallel job • POE options: Use Environment Variables for LoadLeveler Scheduling • Adapter Specification • MPI parameters • Number of Nodes • Class (Priority) • Consumable Resources • Also have simple job scripts (PBS like) to address users migrating from clusters TEXAS ADVANCED COMPUTING CENTER

  9. Job Filter • TACC Power4 system is heterogeneous • some nodes have large memories • some have faster communication throughput • some have dual cores, with different processor counts • need to address cluster users • Part of the scheduling is just categorizing the job requests into classes • Problem with LoadLeveler is that a job cannot be changed from one class to another -- hence a filter has evolved. • Filter also optimizes resource allocation and scheduling policies with emphasis on application performance. job submission job filter one of following queues{ LH13, LH16, LH32, LH4 } scheduler determines priority and releases jobs for execution TEXAS ADVANCED COMPUTING CENTER

  10. POE: Simple Job Script I MPI example #!/bin/csh … # @ job type = parallel # @ tasks = 16 # @ memory = 1000 # @ walltime = 00:30:00 # @ class = normal # @ queue poe a.out TEXAS ADVANCED COMPUTING CENTER

  11. POE: Simple Job Script II OpenMP example #!/bin/csh … # @ job type = parallel # @ threads = 16 # @ memory = 1000 # @ walltime = 00:30:00 # @ class = normal # @ queue setenv OMP_NUM_THREADS 16 poe a.out TEXAS ADVANCED COMPUTING CENTER

  12. POE: Advanced Job Script MPI example across Nodes #!/bin/csh … # @ resources = ConsumableCpus(1) ConsumableMemory(1500mb) # @ network.MPI=csss,not_shared,us # @ node = 4 # @tasks_per_node=16 # @ class = normal # @ queue setenv MP_SHARED_MEMORY true poe a.out TEXAS ADVANCED COMPUTING CENTER

  13. X * 4 Non-shared, csss Time< f(nodes) Mem LH16 LH32 LH4 CpT=1 TpN>1 N=1 CpT=1 TpN>1 N>1 TACC Power4 System Filter Logic (MPI) User Input N CpT TpN MpT TpN Derived Values C=N * CpT * TpN M = TpN * MpT C>32 C>32 _ _ 3>C>2 _ 32>C>17 16>C>5 C=4 _ _ Decision Matrix N =1 N > 1 N =1 N =1 N =1 N=2or3 TpN=C TpN=C TpN<4 TpN=16 TpN=1-4 TpN=4 M<2 M<4 M>2 M<2 M>2 M<2 M>2 M<2 M<2 removed N = nodes CpT = cpus/task TpN = tasks/node MpT = mem/task C = CPUs M = Mem/Task checks Non-Shared Shared Shared TEXAS ADVANCED COMPUTING CENTER

  14. Non-shared, csss Mem Time< f(nodes) X * 4 LH32 LH16 LH4 CpT>1 TpN=1 N=1 TACC Power4 System Filter Logic (OMP) User Input CpT MpT Derived Values C= CpT M = (MpT)/C _ _ _ 32>C>17 16>C>5 16>C>5 4>C>1 Decision Matrix M<4 M<2 M<2 2<M<4 N = nodes CpT = cpus/task TpN = tasks/node MpT = mem/task C = CPUs M = Mem/Cpu Non-Shared checks TEXAS ADVANCED COMPUTING CENTER

  15. Batch Resource Limits • serial (front end) • 8 hours, up to 2 jobs, • development (front end) • 12 GB, 2 hours, upto 8 CPUs • normal/high (some default examples) • <= 8 GBs, < 4 CPUs (LH16) • <= 8 GBs, 4 CPUs (LH4) • > 32 GBs, 5<= CPUs <=16 (LH32) • For various other combination possibilities, see the User Guide • dedicated (by special request only) TEXAS ADVANCED COMPUTING CENTER

  16. Application Performance • Effects of compute intensive and memory intensive scientific applications on the different nodes. Examples are STREAM, NPB etc. • Effects of different kinds of MPI functionality on the different nodes. Examples include MPI ping-pong and MPI All-to-All Send_Recv’s TEXAS ADVANCED COMPUTING CENTER

  17. SM: Stommel model of ocean “circulation” ; solves 2-D partial differential equation. Uses Finite Difference approx for derivatives on discretized domain, (timed for a constant number of Jacobi iterations). Parallel version uses Domain Decomposition for grid 1Kx1K Memory Intensive Application Nearest Neighbor communication MD: Molecular Dynamics of a Solid Argon Lattice. Uses Verlet algorithm for propagation (displacement & velocities). Calculation done for 1 pico second for size 4.153 Compute Intensive Application Global communications Scientific Applications TEXAS ADVANCED COMPUTING CENTER

  18. Time(secs) Scientific Applications II • lh16 architecture best suited for memory intensive application combined with nearest neighbor communication. • L2 cache sharing is most ill suited for memory intensive application. TEXAS ADVANCED COMPUTING CENTER

  19. cg bt ep is ft lu mg lh4 14 49.7 8.6 1.1 11.9 25.2 1.38 lh16 16 67.5 8.6 2 20.4 29.1 1.8 lh32 11.1 90.5 8.5 0.7 11.2 28 2 cg bt ep is ft lu mg lh4 32.1 224 34.2 4.7 53 102 14.1 lh16 37.6 304 34.1 8.5 87.3 107 19 lh32 37.1 421 34.2 3.1 50.8 171 23.5 Class B results Time (secs) NAS Parallel Benchmarks Class C results Time (secs) TEXAS ADVANCED COMPUTING CENTER

  20. STREAM Benchmarks* Results courtesy of John McCalpin STREAM web-site http://www.cs.virginia.edu/stream/ TEXAS ADVANCED COMPUTING CENTER

  21. STREAM Benchmarks* • Results for p655 used large pages and threads • Results for p690s used small pages and threads • p655 tested system has 64G memory, TACC system has 8 G of memory • p690 tested system has 128G memory, TACC system has 32 G of memory • Systems with larger memory and CPUS can prefetch data streaming, hence applications with STREAM like kernels should perform better on p690s than p655s TEXAS ADVANCED COMPUTING CENTER

  22. Ping Pong MPI On-node Performance Bisection Bandwidth IBM P690 Turbo 1.93 GB/s @ 256K 1.39 GB/s @ 2M IBM P690 HPC 1.89 GB/s @ 256K 1.32 GB/s @ 2M IBM P655 HPC 2.47 GB/s @ 256K 1.62 GB/s @ 2M IBM P690 Turbo 1.10 GB/s @ 32K 549 MB/s @ 2M IBM P690 HPC 1.76 GB/s @ 128K 862 MB/s @ 2M IBM P655 HPC 2.71 GB/s @ 256K 1.67 GB/s @ 2M TEXAS ADVANCED COMPUTING CENTER

  23. Sustained off-node range measurements MPI Off-node Performance IBM P690 320-330 MB/s @ 2M-4M IBM P655 400-410 MB/s @ 2M-4M Cruiser adapter on p655s vs Corsair on p690s TEXAS ADVANCED COMPUTING CENTER

  24. Thoughts and Comments • p690T are most suited for large memory, threaded (OMP) type applications. • Applications such as FD, which typically contain nearest-neighbor communication with large memory requirements, are best suited for p690H type nodes. • Large MPI distributed jobs are best suited for p655 nodes as they are most balanced nodes. • Latency sensitive, but small MPI jobs are better suited for p690H node than using the slower interconnect with p655s. • In general, p690s are more limited by slower interconnect than helped by shared memory -- exceptions include FD, Linpack. TEXAS ADVANCED COMPUTING CENTER

More Related