1 / 40

Analysis and Performance Results of a Molecular Modeling Application on Merrimac

Analysis and Performance Results of a Molecular Modeling Application on Merrimac. Mattan Erez , Jung Ho Ahn, Ankit Garg, William J. Dally, Eric Darve (Stanford Univ.) Presented by Jiahua He. Content. Background Motivation Merrimac Architecture Application: StreamMD Performance Evaluation

Download Presentation

Analysis and Performance Results of a Molecular Modeling Application on Merrimac

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analysis and Performance Results of a Molecular Modeling Application on Merrimac Mattan Erez,Jung Ho Ahn,Ankit Garg,William J. Dally,Eric Darve (Stanford Univ.) Presented by Jiahua He

  2. Content • Background • Motivation • Merrimac Architecture • Application: StreamMD • Performance Evaluation • Conclusions and Discussions

  3. Parallel Architectures • Flynn taxonomy • SISD (sequential machine), SIMD, MIMD,MISD (no commercial system) • SIMD • Processor-array machine • Single processor vector machine • MIMD • PVP, SMP, DSM, MPP, Cluster

  4. single stream of instructions fetch decode broadcast Control Unit Processing Element Processing Element Processing Element Processing Element Processing Element Processor-Array Machine • Control processor issues instructions • All processors in processor array operate the instructions in lock-step • Distributed memory • Need permutation if data not aligned

  5. … … vr2 vr1 vr3 + (logically, performs #elts adds in parallel) Vector Machine • A processor can do element-wise operations on entire vectors with a single instruction • Dominated the high performance computer market for about 15 years • Overtaken by MPP in 90s • Re-emerges in recent years (Earth Simulator and Cray X1)

  6. P1 NI P0 NI Pn NI memory memory memory . . . interconnect MPP and Cluster • Distributed memory • Each processor/node has its own private memory • Nodes may be SMPs • MIMD • Nodes execute different instructions asynchronously • Nodes communicate and synchronize by interconnection network

  7. Earth Simulator • Vector machine re-emerges • Rmax 36 GFLOPS > Rmax sum of top 10 • Vector machines focused on powerful processors • MPP or Cluster focused on large-scale “clustering” • Trend: merge the above two

  8. Content • Background • Motivation • Merrimac Architecture • Application: StreamMD • Performance Evaluation • Conclusions and Discussions

  9. Modern VLSI Technology • Arithmetic is cheap • 100s of GFLOPS/chip today • TFLOPS in 2010 • Bandwidth is expensive • General purpose processor architectures have not adapted to this change

  10. Stream Processor • One control unit and 100s of FPUs • 90nm fabrication process: 64-bit, 0.5mm2, 50pJ • Deep register hierarchy with high local bandwidth • Match bandwidth demands and tech. limits • Stream: sequence of data objects • Expose large amounts of data parallelism • Keep 100s of FPUs per processor busy • Hide long latencies of memory operations

  11. Stream Processor (cont’d) • Expose multiple levels of locality • Short term producer-consumer locality (LRF) • Long term producer-consumer locality (SRF) • Cannot be exploited by caches – no reuse, no spatial locality • Scalable • 128GFLOPS processor • 16 node 2TFLOPS single board workstation • 16,384 node 2PFLOPS supercomputer with 16 cabinets

  12. Content • Background • Motivation • Merrimac Architecture • Application: StreamMD • Performance Evaluation • Conclusions and Discussions

  13. Merrimac Processor • Scalar core (1) • Perform control code and issue stream instructions • Arithmetic clusters (16) • 64-bit multiply-accumulate (MADD) FPUs (4) • Execute the same VLIW instruction • Local register file (LRF) per FPU (192 words) • Short term producer-consumer locality in a kernel • Stream register file (SRF) per cluster (8K words) • Long term producer-consumer locality across kernels • Staging area for memory data transfer to hide latencies

  14. Architecture of Merrimac

  15. kernel1 kernel2 Stream Programming Model • Cast the computation as a collection of streams passing through a series of computational kernels. • Data parallelism • Across stream elements • Task parallelism • Across kernels

  16. Memory System • A stream memory instruction transfers entire stream • Address generator (2) • 8 single-word addresses every cycle • Stride access or gathers/scatters pattern • Cache (128K words, 64GB/s) • Directly interface with external DRAM and network • External DRAM (2GB, 38.4GB/s) • Single-word remote memory access • Scatter-add operation

  17. Interconnection Network (Fat Tree)

  18. Content • Background • Motivation • Merrimac Architecture • Application: StreamMD • Performance Evaluation • Conclusions and Discussions

  19. Molecular Dynamics • Explore kinetic and thermodynamic properties of molecular system by simulating atomic models • water  water molecule • protein  water molecules • GROMAC: fastest MD code available • Cut-off distance approximation • Neighbor list (neighbors within rc)

  20. StreamMD • Single kernel: non-bonded interaction between all atom pairs of a molecule and one of its neighbor • Pseudo code: c_positions = gather(positions, i_central); n_positions = gather(positions, i_neighbor); partial_forces = compute_force(c_positions, n_positions); forces = scatter_add(partial_forces, i_forces);

  21. Latency Tolerance • Pipeline the requests • To amortize long initial latency • By issuing a memory op of long stream • Hide memory ops with computations • Concurrently executing memory ops and kernel computations • Strip-mining • Large data set  smaller strips • Outer loop (done manually)

  22. Parallelism • 4 variants to exploit parallelism • Also implemented on Pentium 4 for comparison

  23. “Expanded” Variant • Simplest version • Fully expand the interaction list • For each cluster per iteration • Read 2 interacting molecules • Produce 2 partial forces

  24. “Fixed” Variant • Fixed-length neighbor list of length L • For each cluster • Read a central molecule once every L iteration • Read a neighbor molecule each iteration • Partial forces of central molecule are reduced in cluster • Repeat central molecule in i_central • Add dummy_neighbor in i_neighbor if needed

  25. “Variable” Variant • Variable-length neighbor list • Process inputs and produce outputs at a different rate for each cluster • Merrimac’s inter-cluster communication • Conditional streams mechanism • Indexable SRF • Instructions to read new central position and write partial forces are issued on every iteration but with a condition • Slight overhead of unexecuted instructions

  26. “Duplicated” Variant • Fixed-length neighbor list • Duplicate all interaction calculations • Reduce complete force for central molecule within cluster • No partial force for neighbor molecule is written out

  27. Locality • Only short term producer consumer locality within a single kernel • Computing partial forces • Internal reduction of forces within a cluster • Computation/bandwidth trade-off • Extra computation for interactions with dummy molecules: “fixed” variant • Extreme case: “duplicated” variant • Need more sophisticated schemes (discuss later)

  28. Content • Background • Motivation • Merrimac Architecture • Application: StreamMD • Performance Evaluation • Conclusions and Discussions

  29. Experiment Setup • Single-node experiments • 900 water-molecule system • Cycle-accurate simulator of Merrimac • 4 variants of StreamMD • Pentium 4 version • Latest version of GROMACS • Fully hand optimized • Single precision SSE

  30. Latency Tolerance • Snippet of the execution of “duplicated” variant • Left column • Kernel computations • Right column • Memory operations • Perfect overlap of memory and computation

  31. Locality • Arithmetic intensities • “fixed” and “variable” depend on data set • Small diff  compiler efficiently utilize register hierarchy • Reference percentages • Nearly all to LRFs • Small diff  use SRF just as staging area for memory

  32. Performance • “variable” outperforms “expanded”by 84%, “fixed”by 26%, “duplicated”by 119%, and “Pentium 4” by a factor of 13.2 • 38.8 GFLOPS is 50% of the optimal solution GFLOPS

  33. Automatic Optimizations • Communication scheduling • SRF decouples memory from computation • Loop unrolling and software pipelining • Improve execution rate by 83% • Stream scheduling • SRF is software managed • Capture long term producer consumer locality by intelligent eviction

  34. Computation/bandwidth Trade-off • Blocking technique • Group molecules into cubic clusters of size r3 • Pave the rc3 (cut-off radius) sphere with cubic clusters • Memory bandwidth requirement scales as O(r-3) • Extra computation between rc and rc+2sqrt(3)r • Minimum occurs at about 3 molecules per cluster (1.43)

  35. Content • Background • Motivation • Merrimac Architecture • Application: StreamMD • Performance Evaluation • Conclusions and Discussions

  36. Conclusions • Reviewed the architecture and organization of Merrimac • Presented app StreamMD, implemented 4 variants and evaluated their performance • Compared Merrimac’s suitability for molecular dynamic app against a conventional Pentium 4 processor

  37. Special Applications? • Merrimac is tuned for scientific applications • Programming model • A collection of streams pass through a series of computational kernels • Need large data level parallelism to utilize the FPUs • Task parallelism just can be exploit across nodes because of SIMD

  38. Easy to Program? • Effective automatic compilation • Communication scheduling and stream scheduling (shown earlier) • Highly optimized code for conventional processors is often written in assembly • Performance of different StreamMD variants vary only by 2 fold (shown earlier)

  39. Compare with Supercomputer? • Only comparing with Pentium 4 seems not convincing • MDGRAPE-3 of Protein Explorer can achieve 165 GFLOPS out of 200 GFLOPS (peak) • But it is special purpose design • How about vector machines? • Lack of standard benchmarks

  40. Thanks! And questions?

More Related