1 / 16

Optimizing the Performance of Streaming Numerical Kernels on the IBM Blue Gene/P PowerPC 450

Optimizing the Performance of Streaming Numerical Kernels on the IBM Blue Gene/P PowerPC 450. Tareq Malas Advisors: Prof. David Keyes, Dr. Aron Ahmadia Collaborators: Jed Brown, Dr. John Gunnels. King Abdullah University of Science and Technology November 2011. Motivation.

keren
Download Presentation

Optimizing the Performance of Streaming Numerical Kernels on the IBM Blue Gene/P PowerPC 450

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimizing the Performance of Streaming Numerical Kernels on the IBM Blue Gene/P PowerPC 450 Tareq Malas Advisors: Prof. David Keyes, Dr. AronAhmadia Collaborators: Jed Brown, Dr. John Gunnels King Abdullah University of Science and Technology November 2011

  2. Motivation 7-point stencil operator 27-point stencil operator • PowerPC 450: a representation toexascalearchitectures • Increased parallelism: vectorizationand multi-issue pipeline • Silicon and power savings: in-order execution • Streaming numerical kernels: • At the heart of many scientific applications • Bottleneck in scientific codes

  3. Why is tuning computation on the BG/P PowerPC 450 difficult? For (i=0; i<N; i++) A[i] = B[i] + B[i+1] Not Aligned • Utilizes features to improve efficiency • SIMDized fused floating point units

  4. Why is tuning computation on the BG/P PowerPC 450 difficult? 1 load A 2 add B 3 load C 4 load D 5 add D 6 add E 7 add F 1 load A 2 add B 3 load C 6 add E 4 load D 7 add F 5 add D • Utilizes features to improve efficiency • SIMDized fused floating point units • Superscalar processor with In-order execution • at the core level

  5. Engineering tactics • Divide and conquer: 3-point stencil • Optimize then replicate into larger stencils • Design focus: computer architecture • Fully utilize SIMD capabilities • Reduce pipeline stalls: unroll-and-jam and instructions interleaving (reordering) • Technique: assembly synthesis in Python • Accelerates prototyping • Simplifies source

  6. 3-point stencil SIMDization r3 = a2*W0 + a3*W1 + a4*W2 k Primary | Secondary Primary | Secondary Primary | Secondary • And more … • Regular SIMD Cross • Copy-primary Utilizing the SIMD-like unit features:

  7. Mutate-mutate Vs. load-copy • Mutate-mutate • Fully utilizes the FPU • Requires less registers • Load-copy • Requires less load cycles

  8. Unroll-and-jamreduce data hazards A[0] += q*B[0] stall A[0] += p*B[1] stall A[0] += q*B[2] stall A[0] += p*B[3] . . += 2 sources, 1 destinations A[0] += q*B[0] A[1] += q*B[6] A[0] += p*B[1] A[1] += p*B[7] . . += += 2 sources, 2 destinations For (i=0; i<4; i++) For (j=0; j<5; j++) A[i] += q*B[i][j] + p*B[i][j+1] For (i=0; i<4; i+=2) For (j=0; j<5; j++) A[i] += q*B[i][j] + p*B[i][j+1] A[i+1] += q*B[i+1][j] + p*B[i+1][j+1]

  9. Unroll-and-jamdata reuse

  10. Pythonic code synthesisoverview PowerPC 450 simulator Python code GPR FPR Memory Register allocation Simulation log and debugging information Instruction scheduler and simulator Instructions (list of objects) C code generator C code template Documented C code

  11. Pythonic code synthesisinstruction scheduling • Goal: • Run load/store and FMA instructions each cycle • Reduce read-after-write (RAW) data dependency hazards • Technique (Greedy) per cycle: • Create a list of instructions with no RAW hazards • Execute the instruction(s) that will require the minimal stall • Repeat until all instructions are executed

  12. Unroll-and-jam effects27-point stencil

  13. Kernel and L2 effects7-point stencil

  14. Unroll-and-jam effects3-point stencil

  15. Instruction scheduling optimization formulation

  16. Conclusion • SIMDizing the computations of streaming numerical kernels is challenging • Assembly programming is important for “peak” hardware utilization • We introduced a code synthesis and simulation framework that facilitates: • A faster development-testing loop • Instruction reordering for improved efficiency • Cycle-accurate performance modeling

More Related