Tuning Stencils

1 / 29

# Tuning Stencils - PowerPoint PPT Presentation

Tuning Stencils. Kaushik Datta Microsoft Site Visit April 29, 2008. Stencil Code Overview. For a given point, a stencil is a pre-determined set of nearest neighbors (possibly including itself)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Tuning Stencils' - geoffrey-rios

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Tuning Stencils

Kaushik Datta

Microsoft Site Visit

April 29, 2008

Stencil Code Overview
• For a given point, a stencil is a pre-determined set of nearest neighbors (possibly including itself)
• A stencil code updates every point in a regular grid with a constant weighted subset of its neighbors (“applying a stencil”)

2D Stencil

3D Stencil

Stencil Applications
• Stencils are critical to many scientific applications:
• Diffusion, Electromagnetics, Computational Fluid Dynamics
• Both uniform and adaptive block-structured meshes
• Many type of stencils
• 1D, 2D, 3D meshes
• Number of neighbors (5-pt, 7-pt, 9-pt, 27-pt,…)
• Gauss-Seidel (update in place) vs Jacobi iterations (2 meshes)
• Varying boundary conditions (constant vs. periodic)
Naïve Stencil Code

void stencil3d(double A[], double B[], int nx, int ny, int nz) {

for all grid indices in x-dim {

for all grid indices in y-dim {

for all grid indices in z-dim {

B[center] = S0* A[center] +

S1*(A[top] + A[bottom] +

A[left] + A[right] +

A[front] + A[back]);

}

}

}

}

Our Stencil Code
• Executes a 3D, 7-point, Jacobi iteration on a 2563 grid
• Performs 8 flops (6 adds, 2 mults) per point
• Parallelization performed with pthreads
• Thread affinity: multithreading, then multicore, then multisocket
• Flop:Byte Ratio
• 0.33 (write allocate architectures)
• 0.5 (Ideal)
Cache-Based Architectures

Intel Clovertown

AMD Barcelona

Sun Victoria Falls

Autotuning
• Provides a portable and effective method for tuning
• Limiting the search space:
• Searching the entire space is intractable
• Instead, we ordered the optimizations appropriately for a given platform
• To find best parameters for a given optimization, performed exhaustive search
• Each optimization was applied on top of all previous optimizations
• In general, can also use heuristics/models to prune search space
Naive Code

x

z (unit-stride)

y

• Naïve code is a simple, threaded stencil kernel
• Domain partitioning performed only in least contiguous dimension
• No optimizations or tuning was performed
Naïve

Naive

Intel Clovertown

AMD Barcelona

Sun Victoria Falls

NUMA-Aware

Intel Clovertown

AMD Barcelona

Sun Victoria Falls

• Exploited “first-touch” page mapping policy on NUMA architectures
• Due to our affinity policy, benefit only seen when using both sockets
NUMA-Aware

+NUMA-Aware

Naive

Intel Clovertown

AMD Barcelona

Sun Victoria Falls

Loop Unrolling/Reordering
• Allows for better use of registers and functional units
• Best inner loop chosen by iterating many times over a grid size that fits into L1 cache (x86 machines) or L2 cache (VF)
• should eliminate any effects from memory subsystem
• This optimization is independent of later memory optimizations
Loop Unrolling/Reordering

+Loop Unrolling/Reordering

+NUMA-Aware

Naive

Intel Clovertown

AMD Barcelona

Sun Victoria Falls

x

z (unit-stride)

y

• Used to reduce conflict misses and DRAM bank conflicts
• Drawback: Larger memory footprint
• Performed search to determine best padding amount
• Only padded in unit-stride dimension

Amount

+Loop Unrolling/Reordering

+NUMA-Aware

Naive

Intel Clovertown

AMD Barcelona

Sun Victoria Falls

x

z (unit-stride)

y

• Performed exhaustive search over all possible power-of-two parameter values
• Every thread block is the same size and shape
• Preserves load balancing
• Did NOT cut in contiguous dimension on x86 machines
• Avoids interrupting HW prefetchers
• Only performed cache blocking in one dimension
• Sufficient to fit three read planes and one write plane into cache

Thread Blocks in x: 4

Thread Blocks in y: 2

Thread Blocks in z: 2

Cache Blocks in y: 2

+Loop Unrolling/Reordering

+NUMA-Aware

Naive

Intel Clovertown

AMD Barcelona

Sun Victoria Falls

Software Prefetching
• Allows us to hide memory latency
• Searched over varying prefetch distances and granularities (e.g. prefetch every register block, plane, or pencil)
Software Prefetching

+Prefetching

+Loop Unrolling/Reordering

+NUMA-Aware

Naive

Intel Clovertown

AMD Barcelona

Sun Victoria Falls

SIMDization
• Requires complete code rewrite to utilize 128-bit SSE registers
• Allows single instruction to add/multiply two doubles
• Only possible on the x86 machines
• Padding performed to achieve proper data alignment (not to avoid conflicts)
• Searched over register block sizes and prefetch distances simultaneously
SIMDization

+SIMDization

+Prefetching

+Loop Unrolling/Reordering

+NUMA-Aware

Naive

Intel Clovertown

AMD Barcelona

Sun Victoria Falls

Cache Bypass
• Writes data directly to write-back buffer
• No data load on write miss
• Changes stencil kernel’s flop:byte ratio from 1/3 to 1/2
• Reduces memory data traffic by 33%
• Still requires the SIMDized code from the previous optimization
• Searched over register block sizes and prefetch distances simultaneously
Cache Bypass

+Cache Bypass

+SIMDization

+Prefetching

+Loop Unrolling/Reordering

+NUMA-Aware

Naive

Intel Clovertown

AMD Barcelona

Sun Victoria Falls

z (unit-stride)

y

t0

t0

t0

t0

t0

t0

t0

t0

t2

t4

t2

t2

t2

t2

t4

t4

t2

t4

t4

t4

t4

t2

t4

t2

t6

t6

t6

t6

t6

t6

t6

t6

t1

t1

t1

t1

t1

t1

t1

t1

t3

t5

t3

t3

t5

t5

t5

t3

t3

t5

t3

t5

t3

t5

t5

t3

t7

t7

t7

t7

t7

t7

t7

t7

x

z (unit-stride)

y

• Requires another complete code rewrite
• CT allows for better L1 cache utilization when switching threads
• Only effective on VF due to:
• very small L1 cache (8 KB) shared by 8 HW threads
• lack of hardware prefetchers (allows us to cut in contiguous dimension)
• Drawback: Parameter space becomes very large

No Collaboration

With Collaboration

Thread Blocks in x: 4

Large Coll. TBs in y: 4

Large Coll. TBs in z: 2

Thread Blocks in y: 2

Small Coll. TBs in y: 2

Thread Blocks in z: 2

Small Coll. TBs in z: 4

Cache Blocks in y: 2

+Cache Bypass

+SIMDization

+Prefetching

+Loop Unrolling/Reordering

+NUMA-Aware

Naive

Intel Clovertown

AMD Barcelona

Sun Victoria Falls

Autotuning Results

+Cache Bypass

+SIMDization

+Prefetching

+Loop Unrolling/Reordering

+NUMA-Aware

Naive

1.9x Better

5.4x Better

Intel Clovertown

AMD Barcelona

10.4x Better

Sun Victoria Falls

Architecture Comparison

Double Precision

Single Precision

Performance

Power Efficiency

Conclusions
• Compilers alone fail to fully utilize system resources
• Programmers may not even know that system is being underutilized
• Autotuning provides a portable and effective solution
• Produces up to a 10.4x improvement over compiler alone
• To make autotuning tractable:
• Choose the order of optimizations appropriately for the platform
• Prune the search space intelligently for large searches
• Power efficiency has become a valuable metric
• Local store-based architectures (e.g Cell and G80) usually more efficient than cache-based machines
Acknowledgements
• Sam Williams for:
• writing the Cell stencil code
• guiding my work by autotuning SpMV and LBMHD
• Vasily Volkov for writing the G80 CUDA code
• Kathy Yelick and Jim Demmel for general advice and feedback