Loading in 2 Seconds...

Muthu Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1

Loading in 2 Seconds...

- By
**talen** - Follow User

- 86 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Muthu Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1' - talen

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories

Talk OutlineAutomatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Talk OutlineAutomatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Talk OutlineAutomatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Talk OutlineAutomatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Talk OutlineAutomatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Muthu Baskaran1 Uday Bondhugula1 Sriram Krishnamoorthy 1

J Ramanujam2 Atanas Rountev1 P Sadayappan1

1Department of Computer Science & Engineering

The Ohio State University

2Department of Electrical and Computer Engineering

Louisiana State University

Talk Outline

Introduction

Challenges

Automatic Data Management

Multi-level Tiling

Experiments

Related Work

Summary

Ongoing and Future Work

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Emergence of Multi-core Architectures

- Single-processor performance
- Improved by ~50%/yr for almost two decades
- Clock speed, ILP, …
- Clock speed increased over 100x
- Limits to single-processor performance growth
- Increase in power density
- Flattening of clock speed due to power limitation
- Transistor density continues to rise unabated
- Multiple cores are now the best option for sustained performance growth

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Scratchpad Memories (1/2)

- Need to optimize memory bandwidth and latency in multi-core architectures
- Traditional solution: introducing a cache hierarchy
- Drawback
- Caches are hardware-managed - difficult to model miss behavior and to predict program execution times
- Solution in many modern architectures: fast on-chip explicitly managed memory - scratchpad memory (local memory store)

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Scratchpad Memories (2/2)Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

- Scratchpads
- Software-managed
- Control over data movement
- Easier to model performance
- Burden on programmer/compiler to manage and utilize
- Lower power per chip area required compared to cache
- Some modern architectures having scratchpad memories
- GPU
- Cell
- MPSoC

Talk OutlineAutomatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Introduction

Challenges

Automatic Data Management

Multi-level Tiling

Experiments

Related Work

Summary

Ongoing and Future Work

ChallengesAutomatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

- Effective management of on-chip scratchpads in multi-core architectures
- Utilize limited capacity of scratchpad
- Optimize data movement
- Effective computation mapping in many-core architectures with multiple levels of parallelism
- Exploit available parallelism
- Account for scratchpad capacity constraints

Talk OutlineAutomatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Introduction

Challenges

Automatic Data Management

Multi-level Tiling

Experiments

Related Work

Summary

Ongoing and Future Work

Data Management IssuesAutomatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

- Orchestration of data movement between off-chip global and on-chip scratchpad memory
- Decisions on
- What data elements to move in and out of scratchpad
- When to move data
- How to move data
- How to access the data elements copied to scratchpad

Overview of Automatic Data Management Approach (1/2)Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

- Allocation of storage space (as arrays) in the scratchpad memory for local copies
- Determination of access functions of arrays in scratchpad memories
- Generation of code for moving data between scratchpad (local) and off-chip (global) memories

Overview of Automatic Data Management Approach (2/2)Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

- Targeted at affine programs
- Dense arrays
- Loop bounds – affine functions of outer loop variables, constants and program parameters
- Array access functions - affine functions of surrounding loop variables, constants and program parameters
- Developed using polyhedral model
- an algebraic framework for representing affine programs – statement domains, dependences, array access functions – and affine program transformations

Polyhedral Model Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

.

.

0 1 1 0

1 0 0 1

+

+

₣1a (xS1)=

1 0 -1

i j 1

.

-1 0 4

IS1 =

0 1 -2

0 -1 4

≥ 0

i j

i j

0 0

i j

0 0

.

0 -1

1 0 0 1

+

₣3a (xS1)=

₣2a (xS1)=

for (i=1; i<=4; i++)

for (j=2; j<=4; j++)

S1: a[i][j] = a[j][i] + a[i][j-1];

j≥2

j≤4

i

(m,m)

i≤4

i j

xS1=

i≥1

(0,0)

j

DS1a = ₣1a IS1

Automatic Data AllocationAutomatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

- Given a program block, identify the storage space needed for each non-overlapping accessed region of all arrays
- Access functions of array references may be non-uniformly generated
- For architectures (e.g. nVIDIAGeForce GPU) supporting direct data access from off-chip memory
- Estimate extent of reuse of data to determine whether or not to copy to scratchpad

Algorithm and IllustrationAutomatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Array A

28

for ( i=10;i<=14;i++) {

for ( j=10;j<=14;j++) {

A[i ][ j+1] = A[i+j ][ j+1] *3;

for (k=11;k<=20;k++)

B[i ][ j+k] = A[i ][k] + B[i+j ][k];

}

}

20

14

10

11

20

Local Array LA1:

lb ( i ) = 20; ub( i ) = 28

lb ( j ) = 11; ub( j ) = 15

- Find the set of all data spaces accessed by all references to an array
- Access function of the reference
- Iteration space of the statement that holds the reference

- Partition the set of all data spaces into maximal disjoint non-overlapping subset of data spaces

Local Array LA0:

lb ( i ) = 10; ub( i ) = 14

lb ( j ) = 11; ub( j ) = 20

- Find the bounding box of each partition of data spaces

- Local memory array for each bounding box

Accessing Arrays in Scratchpad Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

for ( i=10;i<=14;i++) {

for ( j=10;j<=14;j++) {

LA0[i-10][j+1-11] = LA1[i+j-20][j+1-11]*3;

for (k=11;k<=20;k++)

LB0[i-10][j+k-21] = LA0[i-10][k-11] + LB1[i+j-20][k-11];

}

}

for ( i=10;i<=14;i++) {

for ( j=10;j<=14;j++) {

A[i ][ j+1] = A[i+j ][ j+1]*3;

for (k=11;k<=20;k++)

B[i ][ j+k] = A[i ][k] + B[i+j ][k];

}

}

- Array dimension in scratchpad may be lower than original array dimension, depending on accessed data
- Access function in local memory array
- Original access function or reduced access function with offsets – lower bounds (in each dimension) of scratchpad array

Data Movement Code GenerationAutomatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

/* Data Move in code */

for (i=10;i<=14;i++) {

for (j=11;j<=20;j++)

LA0[i-10][j-11] = A[i][j] ;

}

for (i=20;i<=28;i++) {

for (j=max(i-13,11);j<=min(15,i-9); j++)

LA1[i-20][j-11] = A[i][j] ;

}

/* Data Move out code */

for (i=10;i<=14;i++) {

for (j=11;j<=15;j++)

A[i][j] = LA0[i-10][j-11];

}

- Generation of loop structure
- Scanning of polytopes (using CLooG - a tool for code generation) corresponding to data spaces of
- read references – for moving data into scratchpad
- write references – for moving data out of scratchpad
- Generation of loop body (data movement statement)
- Copy from a location in scratchpad buffer to off-chip memory location or vice versa

Introduction

Challenges

Automatic Data Management

Multi-level Tiling

Experiments

Related Work

Summary

Ongoing and Future Work

GPU architectureAutomatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

. . .

Scratchpad

Scratchpad

Scratchpad

Off-chip memory

- Architectural components
- Slow off-chip (global) memory
- Two levels of parallelism
- Set of multiprocessors
- Set of processor cores in each multiprocessor
- Scratchpad on each multiprocessor, shared by its processor cores

Multi-level Tiling ApproachAutomatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

- Tiling transformation framework recently developed at OSU by Bondhugula (CC-08, PLDI-08)
- Finds tiling transformations or hyperplanes
- for sequences of imperfectly nested loops
- enables communication minimal parallelization and locality optimization
- Identifies loops to tile for parallelism and data locality
- Multiple levels of tiling
- for exploiting parallelism across multiple parallel levels
- Additional tiling (sequential) at each level with scratchpad memory
- If data required by tile executing at the level exceeds memory
- Data movement at the start and end of each sequential tile
- Synchronization points to ensure consistency

ExampleAutomatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

// Tiling to distribute at the outer level

FORALL iT = 1, Ni, Ti

FORALL jT = 1, Nj, Tj

// Tiling to satisfy scratchpad memory limit

FOR i' = iT, min(iT+Ti-1,Ni), ti'

FOR j' = jT, min(jT+Tj-1,Nj), tj'

FOR k' = 1, WS, tk'

FOR l'= 1, WS, tl'

FORALL i = 1, Ni

FORALL j = 1, Nj

FOR k = 1, WS

FOR l = 1, WS

S1

END FOR

END FOR

END FORALL

END FORALL

<Data move in Code>

// Tiling to distribute at the inner level

FORALL it = i', min(i'+ti'-1,Ni), ti

FORALL jt = j', min(j'+tj'-1,Nj), tj

FOR i = it, min(it+ti-1,Ni)

FOR j = jt, min(jt+tj-1,Nj)

FOR k = k', min(k'+tk'-1,WS)

FOR l = l', min(l'+tl'-1,WS)

S1

END FOR

END FOR

END FOR

END FOR

END FORALL

END FORALL

<Data move out Code>

END FOR

END FOR

END FOR

END FOR

END FORALL

END FORALL

Tile Size DeterminationAutomatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

- Handling scratchpad memory constraints
- Cost model for data movement

C = Nx (S+ (Vx L)/P)

N– Number of data movements

S– Sync cost per data movement

V– Number of elements per data movement (based on tile sizes)

L – Cost to transfer one element

P – Number of processes involved in data movement

- Tile size search formulation
- Constraint: memory requirement within limit
- Objective function: minimize data movement cost, C

Illustration of tile size search formulationAutomatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

- Loop nest of m loops with tile sizes t1, t2,.., tm
- nl local arrays
- Mj – Memory (as a function of tile sizes) for local array j
- V inj and Voutj – Volume (as a function of tile sizes) moved in to and out of local array memory j, respectively
- rj – position in the loop nest where the data movement code of array j is placed
- Mup - total scratchpad memory

Variables:

t1, t2,.., tm

Memory Constraint:

Objective function:

Introduction

Challenges

Automatic Data Management

Multi-level Tiling

Experiments

Related Work

Summary

Ongoing and Future Work

Motion Estimation Kernel (1/2)Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Machine Information:

NVIDIA GeForce 8800 GTX

16 x 8 cores @ 1.35 GHz

768 MB off-chip memory

16 x 16 KB scratchpad

1D Jacobi Kernel (1/2)Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Machine Information:

NVIDIA GeForce 8800 GTX

16 x 8 cores @ 1.35 GHz

768 MB off-chip memory

16 x 16 KB scratchpad

Motion Estimation Kernel (2/2)Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Machine Information:

NVIDIA GeForce 8800 GTX

16 x 8 cores @ 1.35 GHz

768 MB off-chip memory

16 x 16 KB scratchpad

Tile size from model

1D Jacobi Kernel (2/2)Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Machine Information:

NVIDIA GeForce 8800 GTX

16 x 8 cores @ 1.35 GHz

768 MB off-chip memory

16 x 16 KB scratchpad

Tile size from model

Introduction

Challenges

Automatic Data Management

Multi-level Tiling

Experiments

Related Work

Summary

Ongoing and Future Work

Related WorkAutomatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

- Scratchpad memory management
- Data reuse - Issenin et al. [DAC06]
- Allocation for uniformly generated references
- Schreiber and Cronquist [HPLTR04]
- Anantharaman and Pande [RTSS98]
- Kandemir et al. [CAD04]
- Improving performance on cached architectures
- Ferrante et al. [LCPC92]
- Gallivan et al. [ICS88]
- Multi-level tiling
- Fatahalian et al. [SC06]– various levels of memory
- Bikshandi et al. [PPOPP06] and Renganarayanan et al. [SC07, IPDPS07] – parallelism and locality

Introduction

Challenges

Automatic Data Management

Multi-level Tiling

Experiments

Related Work

Summary

Ongoing and Future Work

SummaryAutomatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

- Addressed two issues in compiling for modern multi-level parallel architectures with scratchpads
- Data management in scratchpad memory
- Data allocation
- Access in scratchpad
- Code generation for data movement
- Mapping of computation in regular programs on to multiple levels of parallel units
- Experimental evaluation using nVIDIA GPU

Introduction

Challenges

Automatic Data Management

Multi-level Tiling

Experiments

Related Work

Summary

Ongoing and Future Work

Ongoing and Future WorkAutomatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Developing an end-to-end compiler framework for modern many-core architectures like GPUs

Algorithms developed in this work – an integral part of the overall compiler framework

Further optimize transformations like tiling, for modern architectures like GPUs, using model-driven empirical search

Thank youAutomatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

Download Presentation

Connecting to Server..