vtf applications performance and scalability l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
VTF Applications Performance and Scalability PowerPoint Presentation
Download Presentation
VTF Applications Performance and Scalability

Loading in 2 Seconds...

play fullscreen
1 / 17

VTF Applications Performance and Scalability - PowerPoint PPT Presentation


  • 399 Views
  • Uploaded on

VTF Applications Performance and Scalability Sharon Brunett CACR/Caltech ASCI Site Review October 28, 29 2003 LLNL’s IBM SP3 (frost) 65 node SMP , 375 MHz Power3 Nighthawk-2 (16 CPUs/node) 16 GB memory/node ~ 20 TB global parallel file system SP switch2, colony switch

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'VTF Applications Performance and Scalability' - jacob


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
vtf applications performance and scalability

VTF Applications Performance and Scalability

Sharon Brunett

CACR/Caltech

ASCI Site Review

October 28, 29 2003

asci platform specifics
LLNL’s IBM SP3 (frost)

65 node SMP , 375 MHz Power3 Nighthawk-2 (16 CPUs/node)

16 GB memory/node

~ 20 TB global parallel file system

SP switch2, colony switch

2 GB/sec node-to-node bandwidth

bi-directional

LANL’s HP/Compaq Alphaserver ES45 (QSC)

256 node SMP, 1.25 GHz Alpha EV6 ( 4 CPUs/node)

16 GB memory/node

~ 12 TB global file system

Quadrics network interconnect (QsNet)

2 mus latency

300 MB/sec bandwidth

ASCI Platform Specifics
multiscale polycrystal studies
Multiscale Polycrystal Studies
  • Quantitative assessment of microstructural effects in macroscopic material response through the computation of full-field solutions of polycrystals
  • Inhomogeneous plastic deformation fields
  • Grain-boundary effects:
    • Stress concentration
    • Dislocation pile-up
    • Constraint-induced multislip
  • Size dependence: (inverse) Hall-Petch effect
  • Resolve (as opposed to model) mesoscale behavior exploiting the power of high-performance computing
  • Enable full-scale simulation of engineering systems incorporating micromechanical effects.
mesh generation
Mesh Generation
  • Ingrain subdivision behavior can be simulated in both single crystals and polycrystals.
    • texture simulation results agree well with experimental results
  • Mesh generation method keeps the topology of individual grain shapes
    • Enables effective interactions between grains
  • Increasing of the grain count in polycrystals gives a more stable mechanical response.

Single grain corresponding to a single cell in a crystal

1 5 million element 1241 grain multiscale polycrystal simulation
1.5 Million Element, 1241 Grain Multiscale Polycrystal Simulation

Simulation carried out on 1024 processors of LLNL’s IBM SP3, frost

multiscale polycrystal performance
Multiscale Polycrystal Performance
  • Aggregate parallel performance
    • LANL’s QSC
      • Floating point operations 10.67% of peak
      • Integer operations 15.39% of peak
      • Memory operations 22.08 % of peak
        • DCPI hardware counters used to collect data
        • Qopcounter tool used to analyze DCPI database
    • LLNL’s Frost
      • L1 cache hit rate 98%
        • Load/store instructions executed w/o main memory access
      • Load Store Unit idle 36%
      • Floating point operations 4.47% of peak
        • Hpmcount tool used to count hardware events during program execution
multiscale polycrystal performance ii
Multiscale Polycrystal Performance II
  • MPI routines can consume ~ 30% of runtime for large runs on Frost
    • Workload imbalance as grains are distributed across nodes
    • MPI_Waitall every step dominating communications time
      • Nearest neighbor sends take longer from nodes with computationally heavy grains
    • Routines taking the most CPU time on QSC
      • resolved_fcc_cuitino 18.85%
      • upslip_fcc_cuitino_explicit 11.74%
      • setafcc 9.16%
      • matvec 8.5 %
        • ~50% of execution time in 4 routines
    • Room for performance improvement with better load balancing and routine level optimization
scaling for polycrystalline copper in a shear compression specimen configuration
Scaling for Polycrystalline Copper in a Shear Compression Specimen Configuration

elements

LANL’s HP/Compaq QSC system

3d converging shock simulations in a wedge
3D Converging Shock Simulations in a Wedge
  • 1024 processor ASCI Frost run of a converging shock. The interface is nominally a 2D ellipse perturbed with a prescribed spectrum and randomized phases.
    • The 2D elliptical interface is computed using local shock polar analysis to yield a perfectly circular transmitted shock
  • Resolution: 2000x400x400 with over 1T Byte of data generated.

Density Pressure

density field in a 3d wedge
Density Field in a 3D Wedge

Density field in the Wedge. The transmitted shock front appears to be stable

while the gas interface is Richtmyer-Meshkov unstable.

The simulation took place on 1024 processors of LLNL’s IBM SP3, frost,

2000x400x400 initial grid.

wedge3d performance on llnl s ibm sp3 frost
Wedge3D Performance on LLNL’s IBM SP3, Frost
  • Aggregate parallel performance for 1400x280x280 grid
    • LLNL’s Frost
      • Floating point operations 5.8 to 10% of peak, depending on node
        • Hpmcount tool used to count hardware events during program execution
    • Most time consuming communication calls
      • MPI_Wait() and MPI_Allreduce
      • Accounting for 3 to 30% of runtime on 128 way run
        • 175x70x70 grid per processor
        • Occasional high MPI time on a few nodes seem to be caused by system daemons competing for resources
fragmentation 2d scaling on lanl s hp compaq qsc
Fragmentation 2D Scaling on LANL’s HP/Compaq, QSC

450K

elements

61K -> 915K

elements

85K -> 1.1M

elements

Levels of subdivision

450K to 1.1M elements

fragmentation 2d performance on lanl s hp compaq qsc
Fragmentation 2D Performance on LANL’s HP/Compaq, QSC
  • Procedures with highest CPU cycle consumption
    • element_driver 14.9%
    • assemble 13.9%
    • NewNeohookean 8.12%
      • 16 processor run with 2 levels of subdivision (60K elements)
      • Dcpiprof too used to profile run
      • Problems processing dcpi database FLOP rates for large runs
        • reported to LANL support
        • small runs yield 3% FLOP peak
    • Only ~ 10% spend in fragmentation routines!
  • Much room for improvement on our I/O performance dumping to the parallel file system (/scratch[1,2])