1 / 24

SDSC Blue Gene: Overview

SDSC Blue Gene: Overview. Mahidhar Tatineni Introduction to and Optimization for SDSC Systems Workshop October 16, 2006. BG System Overview: Novel, massively parallel system from IBM. Full system installed at LLNL from 4Q04 to 3Q05 65,000+ compute nodes in 64 racks

jadzia
Download Presentation

SDSC Blue Gene: Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SDSC Blue Gene: Overview Mahidhar Tatineni Introduction to and Optimization for SDSC Systems Workshop October 16, 2006

  2. BG System Overview:Novel, massively parallel system from IBM • Full system installed at LLNL from 4Q04 to 3Q05 • 65,000+ compute nodes in 64 racks • Each node being two low-power PowerPC processors + memory • Compact footprint with very high processor density • Slow processors & modest memory per processor • Very high peak speed of 367 Tflop/s • #1 Linpack speed of 280 Tflop/s • Two applications (CPMD & ddcMD) run at over 100 Tflop/s • 1024 compute nodes in single rack installed at SDSC in 4Q04 • Maximum I/O-configuration with 128 I/O nodes for data-intensive computing • Has achieved more than 4 GB/s for writes using GPFS • Systems at 14 sites outside IBM & 4 within IBM as of 2Q06 • Need to select apps carefully • Must scale (at least weakly) to many processors (because they’re slow) • Must fit in limited memory

  3. BG System Overview: Chips to Racks

  4. BG System Overview:SDSC’s single-rack system

  5. BG System Overview: Processor Chip(= System-on-a-chip) • Two 700-MHz PowerPC 440 processors • Each with two floating-point units • Each with 32-kB L1 data caches that are not coherent • 4 flops/proc-clock peak (=2.8 Gflop/s-proc) • 2 8-B loads or stores / proc-clock peak in L1 (=11.2 GB/s-proc) • Shared 2-kB L2 cache (or prefetch buffer) • Shared 4-MB L3 cache • Five network controllers (though not all wired to each node) • 3-D torus (for point-to-point MPI operations: 175 MB/s nom x 6 links x 2 ways) • Tree (for most collective MPI operations: 350 MB/s nom x 3 links x 2 ways) • Global interrupt (for MPI_Barrier: low latency) • Gigabit Ethernet (for I/O) • JTAG (for machine control) • Memory controller for 512 MB of off-chip, shared memory

  6. BG System Overview: Processor Chip

  7. BG System Overview: Integrated system

  8. BG System Overview:SDSC’s single-rack system • 1024 compute nodes & 128 I/O nodes (each with 2p) • Most I/O-rich configuration possible (8:1 compute:I/O node ratio) • Identical hardware in each node type with different networks wired • Compute nodes connected to: torus, tree, global interrupt, & JTAG • I/O nodes connected to: tree, global interrupt, Gigabit Ethernet, & JTAG • Two half racks (also confusingly called midplanes) • Connected via link chips • Front-end nodes (2 B80s, each with 4 pwr3 processors, 1 pwr5 node) • Service node (Power 275 with 2 Power4+ processors) • Two parallel file systems using GPFS • Shared /gpfs-wan serviced by 58 NSD nodes (each with 2 IA-64s) • Local /bggpfs serviced by 12 NSD nodes (each with 2 IA-64s)

  9. BG System Overview:Multiple operating systems & functions • Compute nodes: run Compute Node Kernel (CNK = blrts) • Each run only one job at a time • Each use very little memory for CNK • I/O nodes: run Embedded Linux • Run CIOD to manage compute nodes • Perform file I/O • Run GPFS • Front-end nodes: run SuSE Linux • Support user logins • Run cross compilers & linker • Run parts of mpirun to submit jobs & LoadLeveler to manage jobs • Service node: runs SuSE Linux • Uses DB2 to manage four system databases • Runs control system software, including MMCS • Runs other parts of mpirun & LoadLeveler • Software comes in drivers: We are currently running Driver V1R3

  10. BG System Overview:Parallel I/O via GPFS

  11. SDSC Blue Gene: Getting Started Logging on & moving files • Logging on sshbglogin.sdsc.edu or ssh -lusername bglogin.sdsc.edu • Moving files scpfile username@bglogin.sdsc.edu:~ or scp -rdirectory username@bglogin.sdsc.edu:~

  12. SDSC Blue Gene: Getting started Places to store your files • /users (home directory) • 1.1 TB NFS mounted file system • Recommended for storing source / important files. • Do not write data/output to this area: Slow and limited in size! • Regular backups • /bggpfs available for parallel I/O via GPFS • ~18.5 TB accessed via IA-64 NSD servers • No backups • 225 TB /gpfs-wan available for parallel I/O and shared with DataStar and TG IA-64 cluster.

  13. Using the compilers:Important programming considerations • Front-end nodes have different processors & run different OS than compute nodes • Hence codes must be cross compiled • Care must be taken with configure scripts • Discovery of system characteristics during compilation (e.g., via configure) may require modifications to the configure script. • Make sure that if code has to be executed during the configure, it runs on the compute nodes. • Alternately, system characteristics can be specified by user and the configure modified to take this into account. • Some system calls are not supported by the compute node kernel

  14. Using the compilers:Compiler versions, paths, & wrappers • Compilers (version numbers the same as on DataStar) XL Fortran V10.1: blrts_xlf & blrts_xlf90 XL C/C++ V8.0: blrts_xlc & blrts_xlC • Paths to compilers in default .bashrc export PATH=/opt/ibmcmp/xlf/bg/10.1/bin:$PATH export PATH=/opt/ibmcmp/vac/bg/8.0/bin:$PATH export PATH=/opt/ibmcmp/vacpp/bg/8.0/bin:$PATH • Compilers with MPI wrappers (recommended) mpxlf, mpxlf90, mpcc, & mpCC • Path to MPI-wrapped compilers in default .bashrc export PATH=/usr/local/apps/bin:$PATH

  15. Using the compilers: Options & example • Compiler options -qarch=440 uses only single FPU per processor (minimum option) -qarch=440d allows both FPUs per processor (alternate option) -qtune=440 (after -qarch) seems superfluous, but avoids warnings -O3 gives minimal optimization with no SIMDization -O3 -qhot=simd adds SIMDization (seems to be the same as -O5) -O4 adds compile-time interprocedural analysis -O5 adds link-time interprocedural analysis (but sometimes has problems) -qdebug=diagnostic gives SIMDization info • Current recommendation -O3 -qarch=440 • Example using MPI-wrapped compiler mpxlf90 -O3 -qarch=440 -o hello hello.f

  16. Using libraries • ESSL • Version 4.2 is available in /usr/local/apps/lib • MASS/MASSV • Version 4.3 is available in /usr/local/apps/lib • FFTW • Versions 2.1.5 and 3.1.2 available in both single & double precision. The libraries are located in /usr/local/apps/V1R3 • NETCDF • Versions 3.6.0p1 and 3.6.1 are available in /usr/local/apps/V1R3 • Example link paths -Wl,--allow-multiple-definition -L/usr/local/apps/lib -lmassv -lmass -lesslbg -L/usr/local/apps/V1R3/fftw-3.1.2s/lib -lfftw3f

  17. Running jobs: Overview • There are two compute modes • Coprocessor (CO) mode: one compute processor per node • Virtual node (VN) mode: two compute processors per node • Jobs run in partitions or blocks • These are typically powers of two • Blocks must be allocated (or booted) before run & are restricted to a single user at a time • Only batch jobs are supported • Batch jobs are managed by LoadLeveler • Users can monitor jobs using llq –b & llq -x

  18. Running jobs:LoadLeveler for batch jobs • Here is an example LoadLeveler run script (test.cmd) #!/usr/bin/ksh #@ environment = COPY_ALL; #@ job_type = BlueGene #@ class = parallel #@ bg_partition = <partition name; for example: top> #@ output = file.$(jobid).out #@ error = file.$(jobid).err #@ notification = complete #@ notify_user = <your email address> #@ wall_clock_limit = 00:10:00 #@ queue mpirun -mode VN -np <number of procs> -exe <your executable> -cwd <working directory> • Submit as follows: llsubmit test.cmd

  19. Running jobs: mpirun options • Key mpirun options are -mode compute mode: CO or VN -connect connectivity: TORUS or MESH -np number of compute processors -mapfile logical mapping of processors -cwd full path of current working directory -exe full path of executable -args arguments of executable (in double quotes) -env environmental variables (in double quotes) (These are mostly different than for TeraGrid) • See mpirun user’s manual for syntax

  20. Running jobs: mpirun options • -mode may be CO (default) or VN • Generally you must specify partition to run in VN mode • For given number of nodes, VN mode is usually faster than CO mode • Memory per processor in VN mode is half that of CO mode • Recommendation: Use VN mode unless there is not enough memory • -connect may be TORUS or MESH • Option only applies if -partition not specified (with MESH the default) • Performance is generally better with TORUS • Recommendation: Use a predefined partition, which ensures TORUS

  21. Running jobs: mpirun options • -np gives number of processors • Must fit in available partition • -mapfile gives logical mapping of processors • Can improve MPI performance in some cases • Can be used to ensure VN mode for 2*np ≤ partition size • Can be used to change ratio of compute nodes to I/O nodes • Recommendation: Contact SDSC if you you want to use mapfile, since no documentation is available • -cwd gives current working directory • Needed if there is an input file

  22. Running jobs: Partition Layout and Usage Guidelines • To make effective use of the Blue Gene, production runs should generally use one-fourth or more of the machine, i.e., 256 or more compute nodes. Thus the following predefined partitions are provided for production runs. • rack:all 1,024 nodes • top & bot: 512 nodes • top256–1 & top256–2 256 nodes in each half of the top midplane • bot256–1 & bot256–2 256 nodes in each half of the bottom midplane • Smaller 64 (bot64-1, …, bot64-8) and 128 (bot128-1 , … , bot128-4) node partitions are available for test runs.

  23. Running jobs: Partition Layout and Usage Guidelines

  24. BG System Overview: References • Blue Gene Web site at SDSC http://www.sdsc.edu/user_services/bluegene • Loadleveler guide http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.loadl.doc/loadl331/am2ug30305.html • Blue Gene Application development guide (from IBM redbooks) http://www.redbooks.ibm.com/abstracts/sg247179.html

More Related