Divergence Analysis with Affine Constraints

1 / 34

# Divergence Analysis with Affine Constraints - PowerPoint PPT Presentation

λ. Programming Languages Laboratory. Divergence Analysis with Affine Constraints. Diogo Sampaio , Sylvain Collange and Fernando Pereira The Federal University of Minas Gerais - Brazil. The Objective of this work is to speedup code that runs on GPUs .

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Divergence Analysis with Affine Constraints' - blanca

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

λ

Programming Languages Laboratory

### Divergence Analysis with Affine Constraints

DiogoSampaio, Sylvain Collange and

Fernando Pereira

The Federal University of Minas Gerais - Brazil

• Divergence analysis with affine constraints.
• We will achieve this goal via two contributions.
• Divergence aware register allocation.

Which

enables…

Motivation
• Yet, programming efficient GPGPU applications is hard.
• Complex interplay with the hardware.
• Threads execute in lock step, but divergencesmay happen.
• General Purpose Programming in Graphics Processing Units is a reality today.
• Lots of academic research.
• Many industrial applications
What are Divergences?

__global__ void

ex (float* v) {

if (v[tid] < 0.0) {

v[tid] /= 2;

} else {

v[tid] = 0.0;

}

}

• Why do we have divergences in this kernel?

Below we have a simple kernel, and its Control Flow Graph:

Uniform and Divergent Variables
• If a variable has always the same value for all the threads in execution, then we call it uniform.
• If different threads in execution may see the same variable name with different values, this variable is called divergent.
• Which variables are divergent?
• The thread identifier is always divergent.
• Variables that depend on divergent variables are also divergent.
• Data dependences.
• Control dependences.
Data Dependences

The value of %r1 may be different

If a variable v is defined by an in instruction that uses a variable u, then v is data-dependent on u.

In the figure, %r1 depends on v and on %tid.

Control Dependences

Depending on how each thread

branches at the end of B0, %f2

may be %f1/2 or 0.0 at BST.

If the value assigned to a variable v is controlled by a variable u, then v is control-dependent on u.

In the figure, %f2 is control dependent on %p1.

Affine Variables

The loop always executes the same number

of iterations for all the threads

Some divergent variables are special: they are affine expressions of the thread identifier, e.g,. v = C×Tid + N.

Example: the kernel below computes the average of each column of a matrix:

Affine Variables

Variable i is divergent, yet, it is very regular: each thread sees it as "Tid+N × c", where N is the current loop iteration.

We say that i is an affine variable.

In this case, i = Tid + 10 * c

The Divergent Analysis with Affine Constraints
• This analysis classifies variables as uniform, affine or divergent.
• Our divergence analysis is a dataflow analysis.
• We associate an abstract state with each variable.
• This abstract state is a pair (a, b), which means a × Tid + b.
• Each element in the pair can be:
• A constant, which we denote by 'C'
• A non-initialized value, which we denote by '?'
• An unknown value, which we denote by 'D'
Uniform Variables

No worries:

we shall explain

how we find

these abstract

states!

• A uniform variable v is bound to the state (0, X), which means 0 × Tid + X.
• If X is a known constant, then v is a constant.
Divergent Variables

No worries:

we shall explain

how we find

these abstract

states!

A divergent variable v is bound to the state (D, D), which means that we do not know anything about the runtime values that this variable can assume.

Affine Variables

Ok: it is about

time to explain

how we find

these abstract

states.

An affine variable v is bound to the state (c, X), which means c × Tid + X. The factor c is always a known constant, X can be either a known constant, or D.

Solving Divergence Analysis

Once we have initialized every variable, then we start iterating a few propagation rules, until we reach a fixed point.

Initially every variable is bound to the abstract state (?, ?), unless…

It is initialized with a constant, e.g., if we have the assignment v = 10, then [v] = (0, 10). Unless….

It is initialized with a constant expression of Tid, e.g., if v = 10 * Tid + 3, then [v] = (10, 3). Unless…

The variable is a function parameter, and its abstract state is (0, D).

The Propagation Rules
• There are many different propagation tables (we call them dataflow equations).
• We have one table for each different program instruction.
• Lets consider, for instance, that the program contains an instruction v = v1 + v2. The abstract state of v1, e.g., [v1] is given by the bluecolumn, and [v2] by the cantaloupe.
Applying the Rules

We work on the program dependence graph.

Variables to be processed are placed in a worklist.

Applying the Rules

Where there is any variable v in the worklist, we try to process the instructions that use v.

Applying the Rules

We have

removed

Tid from the

worklist, and

it.

If all the dependences of a variable v have been processed, then we can remove v from the worklist.

If we process an instruction that defines variable w, then we add w to the worklist.

Reaching a Fixed Point
• We keep performing this abstract interpretation, until the worklist is empty.
• This happens once we reach a fixed point.
How to Use the Divergence Analysis
• Divergence analysis with affine constraints.
• Divergence aware register allocation.
• There are many compiler optimizations that need the information provided by the divergence analysis.
• We are using the results of our divergence analysis with affine constraints to guide a register allocator.
• We call it The Divergence Aware Register Allocator.
What is Register Allocation?
• Register allocation is the problem of finding locations for the variables in a program.
• Variables can stay in registers or in memory.
• Variables sent to memory are called spills.
• In Graphics Processing Units we have roughly three types of memory:
• Local: outside-chip and private to each thread.
• Global: outside-chip and visible to every thread.
• Shared: inside-chip and visible to every thread (in the same warp – lets abstract this detail away).
The Key Insight: where to place spills
• A traditional allocator moves every spilled variable to the local memory. However, we can do much better:
• Uniform spilled variables can be placed in the shared memory.
• And affine spilled variables can be also placed in the shared memory.
• But this is a bit trickier, and I shall explain it later.
Example

Uniform

Affine

Divergent

0×Tid + D

c×Tid + D

D×Tid + D

Redundancy:

Uniform variables always have the same value for all the threads. Would it not be better to keep only one image of each spilled uniform variable?

Moreover, we can also share affine variables, as we will explain soon.

The benefits of our allocator
• A traditional allocator spills everything to the local memory.
• The divergent aware allocator uses more the shared memory. This has many advantages:
• Shared memory is faster.
• Less memory is used to spill variables.
How to Spill Affine Values?

store: st.local N 0xFFFFFC32

changes to: st.shared t0 0xFFFFFC32

Load: ld.local N 0xFFFFFC32

changes to: ld.shared t0 0xFFFFFC32

N = 2*tid + t0

An affine value is like C×Tid + N, where C is a constant known at compilation time. Lets assume an expression like: N = 2*tid + t0

Implementation
• We have successfully tested our divergence analysis in all the 177 different CUDA kernels that we took from the Rodinia and NVIDIA SDK 3.1 benchmark suites.
• We have implemented the affine analysis and the divergence aware register allocator in Ocelot, an open source PTX optimizer.
• More than 10,000 lines of code!
• This compiler is used in the industry.
Performance

% faster (execution time)

% faster than naive linearscan execution

Gtx 570 / Nvidia CUDA driver and toolkit 3.2 / 32 bit linux/ 8 register per thread

Conclusions

Questions?

• New directions to divergence aware optimizations.
• So far, optimizations have been focusing on branch fusion and synchronization of divergent threads.
• Open source implementation already been used by the Ocelot community.
• To know more:
• http://simdopt.wordpress.com
What if the affine expression is formed by constants only?

store: st.local N 0xFFFFFC32

the store is completely removed

Load: ld.local N 0xFFFFFC32

changes to: N = 2*tid + 3

We have all the information to reconstruct N!

If the affine expression is like C0×Tid + C0, where C0 and C1 are constants, then we do not need neither loads nor stores (this is rematerialization). For instance, assume N = 2*tid + 3

Classification of spilled Variables.