tuning stencils n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Tuning Stencils PowerPoint Presentation
Download Presentation
Tuning Stencils

Loading in 2 Seconds...

play fullscreen
1 / 29

Tuning Stencils - PowerPoint PPT Presentation


  • 93 Views
  • Uploaded on

Tuning Stencils. Kaushik Datta Microsoft Site Visit April 29, 2008. Stencil Code Overview. For a given point, a stencil is a pre-determined set of nearest neighbors (possibly including itself)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Tuning Stencils' - geoffrey-rios


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
tuning stencils

Tuning Stencils

Kaushik Datta

Microsoft Site Visit

April 29, 2008

stencil code overview
Stencil Code Overview
  • For a given point, a stencil is a pre-determined set of nearest neighbors (possibly including itself)
  • A stencil code updates every point in a regular grid with a constant weighted subset of its neighbors (“applying a stencil”)

2D Stencil

3D Stencil

stencil applications
Stencil Applications
  • Stencils are critical to many scientific applications:
    • Diffusion, Electromagnetics, Computational Fluid Dynamics
    • Both uniform and adaptive block-structured meshes
  • Many type of stencils
    • 1D, 2D, 3D meshes
    • Number of neighbors (5-pt, 7-pt, 9-pt, 27-pt,…)
    • Gauss-Seidel (update in place) vs Jacobi iterations (2 meshes)
  • Varying boundary conditions (constant vs. periodic)
na ve stencil code
Naïve Stencil Code

void stencil3d(double A[], double B[], int nx, int ny, int nz) {

for all grid indices in x-dim {

for all grid indices in y-dim {

for all grid indices in z-dim {

B[center] = S0* A[center] +

S1*(A[top] + A[bottom] +

A[left] + A[right] +

A[front] + A[back]);

}

}

}

}

our stencil code
Our Stencil Code
  • Executes a 3D, 7-point, Jacobi iteration on a 2563 grid
  • Performs 8 flops (6 adds, 2 mults) per point
  • Parallelization performed with pthreads
  • Thread affinity: multithreading, then multicore, then multisocket
  • Flop:Byte Ratio
    • 0.33 (write allocate architectures)
    • 0.5 (Ideal)
cache based architectures
Cache-Based Architectures

Intel Clovertown

AMD Barcelona

Sun Victoria Falls

autotuning
Autotuning
  • Provides a portable and effective method for tuning
  • Limiting the search space:
    • Searching the entire space is intractable
    • Instead, we ordered the optimizations appropriately for a given platform
    • To find best parameters for a given optimization, performed exhaustive search
    • Each optimization was applied on top of all previous optimizations
    • In general, can also use heuristics/models to prune search space
naive code
Naive Code

x

z (unit-stride)

y

  • Naïve code is a simple, threaded stencil kernel
  • Domain partitioning performed only in least contiguous dimension
  • No optimizations or tuning was performed
na ve
Naïve

Naive

Intel Clovertown

AMD Barcelona

Sun Victoria Falls

numa aware
NUMA-Aware

Intel Clovertown

AMD Barcelona

Sun Victoria Falls

  • Exploited “first-touch” page mapping policy on NUMA architectures
  • Due to our affinity policy, benefit only seen when using both sockets
numa aware1
NUMA-Aware

+NUMA-Aware

Naive

Intel Clovertown

AMD Barcelona

Sun Victoria Falls

loop unrolling reordering
Loop Unrolling/Reordering
  • Allows for better use of registers and functional units
  • Best inner loop chosen by iterating many times over a grid size that fits into L1 cache (x86 machines) or L2 cache (VF)
    • should eliminate any effects from memory subsystem
  • This optimization is independent of later memory optimizations
loop unrolling reordering1
Loop Unrolling/Reordering

+Loop Unrolling/Reordering

+NUMA-Aware

Naive

Intel Clovertown

AMD Barcelona

Sun Victoria Falls

padding
Padding

x

z (unit-stride)

y

  • Used to reduce conflict misses and DRAM bank conflicts
  • Drawback: Larger memory footprint
  • Performed search to determine best padding amount
  • Only padded in unit-stride dimension

Padding

Amount

padding1
Padding

+Padding

+Loop Unrolling/Reordering

+NUMA-Aware

Naive

Intel Clovertown

AMD Barcelona

Sun Victoria Falls

thread cache blocking
Thread/Cache Blocking

x

z (unit-stride)

y

  • Performed exhaustive search over all possible power-of-two parameter values
  • Every thread block is the same size and shape
    • Preserves load balancing
  • Did NOT cut in contiguous dimension on x86 machines
    • Avoids interrupting HW prefetchers
  • Only performed cache blocking in one dimension
    • Sufficient to fit three read planes and one write plane into cache

Thread Blocks in x: 4

Thread Blocks in y: 2

Thread Blocks in z: 2

Cache Blocks in y: 2

thread cache blocking1
Thread/Cache Blocking

+Thread/Cache Blocking

+Padding

+Loop Unrolling/Reordering

+NUMA-Aware

Naive

Intel Clovertown

AMD Barcelona

Sun Victoria Falls

software prefetching
Software Prefetching
  • Allows us to hide memory latency
  • Searched over varying prefetch distances and granularities (e.g. prefetch every register block, plane, or pencil)
software prefetching1
Software Prefetching

+Prefetching

+Thread/Cache Blocking

+Padding

+Loop Unrolling/Reordering

+NUMA-Aware

Naive

Intel Clovertown

AMD Barcelona

Sun Victoria Falls

simdization
SIMDization
  • Requires complete code rewrite to utilize 128-bit SSE registers
  • Allows single instruction to add/multiply two doubles
  • Only possible on the x86 machines
  • Padding performed to achieve proper data alignment (not to avoid conflicts)
  • Searched over register block sizes and prefetch distances simultaneously
simdization1
SIMDization

+SIMDization

+Prefetching

+Thread/Cache Blocking

+Padding

+Loop Unrolling/Reordering

+NUMA-Aware

Naive

Intel Clovertown

AMD Barcelona

Sun Victoria Falls

cache bypass
Cache Bypass
  • Writes data directly to write-back buffer
    • No data load on write miss
  • Changes stencil kernel’s flop:byte ratio from 1/3 to 1/2
    • Reduces memory data traffic by 33%
  • Still requires the SIMDized code from the previous optimization
  • Searched over register block sizes and prefetch distances simultaneously
cache bypass1
Cache Bypass

+Cache Bypass

+SIMDization

+Prefetching

+Thread/Cache Blocking

+Padding

+Loop Unrolling/Reordering

+NUMA-Aware

Naive

Intel Clovertown

AMD Barcelona

Sun Victoria Falls

collaborative threading
Collaborative Threading

z (unit-stride)

y

t0

t0

t0

t0

t0

t0

t0

t0

t2

t4

t2

t2

t2

t2

t4

t4

t2

t4

t4

t4

t4

t2

t4

t2

t6

t6

t6

t6

t6

t6

t6

t6

t1

t1

t1

t1

t1

t1

t1

t1

t3

t5

t3

t3

t5

t5

t5

t3

t3

t5

t3

t5

t3

t5

t5

t3

t7

t7

t7

t7

t7

t7

t7

t7

x

z (unit-stride)

y

  • Requires another complete code rewrite
  • CT allows for better L1 cache utilization when switching threads
  • Only effective on VF due to:
    • very small L1 cache (8 KB) shared by 8 HW threads
    • lack of hardware prefetchers (allows us to cut in contiguous dimension)
  • Drawback: Parameter space becomes very large

No Collaboration

With Collaboration

Thread Blocks in x: 4

Large Coll. TBs in y: 4

Large Coll. TBs in z: 2

Thread Blocks in y: 2

Small Coll. TBs in y: 2

Thread Blocks in z: 2

Small Coll. TBs in z: 4

Cache Blocks in y: 2

collaborative threading1
Collaborative Threading

+Collaborative Threading

+Cache Bypass

+SIMDization

+Prefetching

+Thread/Cache Blocking

+Padding

+Loop Unrolling/Reordering

+NUMA-Aware

Naive

Intel Clovertown

AMD Barcelona

Sun Victoria Falls

autotuning results
Autotuning Results

+Collaborative Threading

+Cache Bypass

+SIMDization

+Prefetching

+Thread/Cache Blocking

+Padding

+Loop Unrolling/Reordering

+NUMA-Aware

Naive

1.9x Better

5.4x Better

Intel Clovertown

AMD Barcelona

10.4x Better

Sun Victoria Falls

architecture comparison
Architecture Comparison

Double Precision

Single Precision

Performance

Power Efficiency

conclusions
Conclusions
  • Compilers alone fail to fully utilize system resources
  • Programmers may not even know that system is being underutilized
  • Autotuning provides a portable and effective solution
    • Produces up to a 10.4x improvement over compiler alone
  • To make autotuning tractable:
    • Choose the order of optimizations appropriately for the platform
    • Prune the search space intelligently for large searches
  • Power efficiency has become a valuable metric
  • Local store-based architectures (e.g Cell and G80) usually more efficient than cache-based machines
acknowledgements
Acknowledgements
  • Sam Williams for:
    • writing the Cell stencil code
    • guiding my work by autotuning SpMV and LBMHD
  • Vasily Volkov for writing the G80 CUDA code
  • Kathy Yelick and Jim Demmel for general advice and feedback