Introduction to Scientific Computing on the IBM SP and Regatta - PowerPoint PPT Presentation

slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Introduction to Scientific Computing on the IBM SP and Regatta PowerPoint Presentation
Download Presentation
Introduction to Scientific Computing on the IBM SP and Regatta

play fullscreen
1 / 80
Introduction to Scientific Computing on the IBM SP and Regatta
91 Views
Download Presentation
james-estrada
Download Presentation

Introduction to Scientific Computing on the IBM SP and Regatta

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Introduction to Scientific Computing on the IBM SP and Regatta Doug Sondak sondak@bu.edu

  2. Outline • Friendly Users (access to Regatta) • hardware • batch queues (LSF) • compilers • libraries • MPI • OpenMP • debuggers • profilers and hardware counters

  3. Friendly Users

  4. Friendly Users • Regatta • not presently open to general user community • will be open to a small number of “friendly users” to help us make sure everything’s working ok

  5. Friendly Users (cont’d) • Friendly-user rules 1. We expect the friendly-user period to last 4-6 weeks 2. No charge for CPU time! 3. Must have “mature” code • code must currently run (we don’t want to test how well the Regatta runs emacs!) • serial or parallel

  6. Friendly Users (3) • Friendly-user rules (cont’d) 4. We want feedback! • What did you encounter that prevented porting your code from being a “plug and play” operation? • If it was not obvious to you, it was not obvious to some other users! 5. Timings are required for your code • use time command • report wall-clock time • web-based form for reporting results

  7. Friendly Users (4) • Friendly-user application and report form: • first go to SP/Regatta repository: http://scv.bu.edu/SCV/IBMSP/ • click on Friendly Users link at bottom of menu on left-hand side of page • timings required for the Regatta and either the O2k or SP (both would be great!)

  8. Hardware

  9. Hal (SP) • Power3 processors • 375 MHz • 4 nodes • 16 processors each • shared memory on each node • 8GB memory per node • presently can use up to 16 procs.

  10. Hal (cont’d) • L1 cache • 64 KB • 128-byte line • 128-way set associative • L2 cache • 4 MB • 128-byte line • direct-mapped (“1-way” set assoc.)

  11. Twister (Regatta) • Power4 processors • 1.3 GHz • 2 CPUs per chip (interesting!) • 3 nodes • 32 processors each • shared memory on each node • 32GB memory per node • presently can use up to 32 procs.

  12. Twister (cont’d) • L1 cache • 32 KB per proc. (64 KB per chip) • 128-byte line • 2-way set associative

  13. Twister (3) • L2 cache • 1.41 MB • shared by both procs. on a chip • 128-byte line • 4-to-8 way set associative • unified • data, instructions, page table entries

  14. Twister (4) • L3 cache • 128 MB • off-chip • shared by 8 procs. • 512-byte “blocks” • coherence maintained at 128-bytes • 8-way set associative

  15. Batch Queues

  16. Batch Queues • LSF batch system • bqueues for list of queues QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP p4-mp32 10 Open:Active 1 1 - - 0 0 0 0 p4-mp16 9 Open:Active 2 1 - - 0 0 0 0 p4-short 8 Open:Active 2 1 - - 0 0 0 0 p4-long 7 Open:Active 16 5 - - 0 0 0 0 sp-mp16 6 Open:Active 2 1 - 1 2 1 1 0 sp-mp8 5 Open:Active 2 1 - - 1 0 1 0 sp-long 4 Open:Active 8 2 - - 20 12 8 0 sp-short 3 Open:Active 2 1 - - 0 0 0 0 graveyard 2 Open:Inact - - - - 0 0 0 0 donotuse 1 Open:Active - - - - 0 0 0 0

  17. Batch Queues (cont’d) • p4 queues are on the Regatta • sp queues are on the SP (surprise!) • “long” and “short” queues are serial • for details see http://scv.bu.edu/SCV/scf-techsumm.html • will not include Regatta info. until it’s open to all users • bsub to submit job • bjobs to monitor job

  18. Compilers

  19. Compiler Names • AIX uses different compiler namesto perform some tasks which are handled by compiler flagson many other systems • parallel compiler names differ for SMP, message-passing, and combined parallelization methods

  20. Compilers (cont’d) Serial MPI OpenMP Mixed Fortran 77 xlf mpxlf xlf_r mpxlf_r Fortran 90 xlf90 mpxlf90 xlf90_r mpxlf90_r Fortran 95 xlf95 mpxlf95 xlf95_r mpxlf95_r C cc mpcc cc_r mpcc_r xlc mpxlc xlc_r mpxlc_r C++ xlC mpCC xlC_r mpCC_r gcc and g++ are also available

  21. Compilers (3) • xlc default flags -qalias=ansi • optimizer assumes that pointers can only point to an object of the same type (potentially better optimization) -qlanglvl=ansi • ansi c -qro • string literals (e.g., char *p = ”mystring”;) placed in “read-only” memory (text segment); cannot be modified

  22. Compilers (4) • xlc default flags (cont’d) -qroconst • constants placed in read-only memory

  23. Compilers (5) • cc default flags • -qalias=extended • optimizer assumes that pointers may point to object whose address is taken, regardless of type (potentially weaker optimization) • -qlanglvl=extended • extended (not ansi) c • “compatibility with the RT compiler and classic language levels” • -qnoro • string literals (e.g., char *p = ”mystring”;) can be modified • may use more memory than -qro

  24. Compilers (6) • cc default flags (cont’d) • -qnoroconst • constants not placed in read-only memory

  25. Default Fortran Suffixes xlf .f xlf90.f f90 .f90 xlf95.f f95.f mpxlf .f mpxlf90 .f90 mpxlf95.f Same except for suffix

  26. Compiler flags • Specify source file suffix -qsuffix=f=f90 (lets you use xlf90 with .f90 suffix) • 64-bit • q64 • use if you need more than 2GB

  27. flags cont’d • Presently a foible on twister (Regatta) • if compiling with -q64 and using MPI, must compile with mp…_r compiler, even if you’re not using SMP parallelization

  28. flags (3) • IBM optimization levels -Obasic optimization -O2 same as -O -O3 more aggressive optimization -O4 even more aggressive optimization; optimize for current architecture; IPA -O5 aggressive IPA

  29. flags (4) • If using O3 or below, can optimize for local hardware (done automatically for -O4 and -O5): -qarch=auto optimize for resident architecture -qtune=autooptimize forresident processor -qcache=auto optimize for resident cache

  30. flags (5) • If you’re using IPA and you get warnings about partition sizes, try -qipa=partition=large • default data segment limit 256MB • data segment contains static, common, and allocatable variables and arrays • can increase limit to a maximum of 2GB with 32-bit compilation -bmaxdata:0x80000000 • can use more than 2GB data with -q64

  31. flags (6) • -O5 does not include function inlining • function inlining flags: -Q compiler decides what functions to inline -Q+func1:func2 only inline specified functions -Q -Q-func1:func2 let compiler decide, but do not inline specified functions

  32. Libraries

  33. Scientific Libraries • Contain • Linear Algebra Subprograms • Matrix Operations • Linear Algebraic Equations • Eigensystem Analysis • Fourier Transforms, Convolutions and Correlations, and Related Computations • Sorting and Searching • Interpolation • Numerical Quadrature • Random Number Generation

  34. Scientific Libs. Cont’d • Documentation - go to IBM Repository: http://scv.bu.edu/SCV/IBMSP/ • click on Libraries • ESSLSMP • for use with “SMP processors” (that’s us) • some serial, some parallel • parallel versions use multiple threads • thread safe; serial versions may be called within multithreaded regions (or on single thread) • link with -lesslsmp

  35. Scientific Libs. (3) • PESSLSMP • message-passing (MPI, PVM) -lpesslsmp -lesslsmp -lblacssmp

  36. Fast Math • MASS library • Mathematical Acceleration SubSystem • faster versions of some Fortran intrinsic functions • sqrt, rsqrt, exp, log, sin, cos, tan, atan, atan2, sinh, cosh, tanh, dnint, x**y • work with Fortran or C • differ from standard functions in last bit (at most)

  37. Fast Math (cont’d) • simply link to mass library: Fortran: -lmass C: -lmass -lm • sample approx. speedups exp 2.4 log 1.6 sin 2.2 complex atan 4.7

  38. Fast Math (3) • Vector routines offer even more speedup, but require minor code changes • link to -lmassv • subroutine calls • prefix name with vs for 4-byte reals (single precision) and v for 8-byte reals (double precision)

  39. Fast Math (4) • example: single-precision exponential call vsexp(y,x,n) • x is the input vector of length n • y is the output vector of length n • sample speedups (single & double) exp 9.7 6.7 log 12.3 10.4 sin 10.0 9.8 complex atan 16.7 16.5

  40. Fast Math (5) • For details see the following file on hal: file:/usr/lpp/mass/MASS.readme

  41. MPI

  42. MPI • MPI works differently on IBM than on other systems • first compile code using compiler with mp prefix, e.g., mpcc • this automatically links to MPI libraries; do not use -lmpi

  43. POE • Parallel Operating Environment • controls parallel operation, including running MPI code

  44. Running MPI Code • Do not use mpirun! • poe mycode -procs 4 • file re-direction: poe mycode < myin > myout-procs 4 • note: no quotes • a useful flag: -labelio yes labels output with process number (0, 1, 2, …) • also setenv MP_LABELIOyes

  45. OpenMP

  46. SMP Compilation • OpenMP • append compiler name with _r • use-qsmp=ompflag SGI: f77 -mp mycode.f IBM: xlf_r -qsmp=omp mycode.f • Automatic parallelization SGI: f77 -apo mycode.f IBM: xlf_r -qsmp mycode.f

  47. SMP Compilation cont’d • Listing files for auto-parallelization SGI: f77 -apo list mycode.f IBM: xlf_r -qsmp -qreport=smplist mycode.f

  48. SMP Environment • Per-thread stack limit • default 4MB • can be increased by using environment variable setenv XLSMPOPTS $XLSMPOPTS\:stack=size where size is the new size limit in bytes

  49. Running SMP • Running is the same as on other systems, e.g., #!/bin/tcsh setenv OMP_NUM_THREADS 4 mycode < myin > myout exit

  50. OpenMP functions • On IBM, must declare OpenMP Fortran functions integer OMP_GET_NUM_THREADS (not necessary on SGI)