slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Muthu Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1 PowerPoint Presentation
Download Presentation
Muthu Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1

Loading in 2 Seconds...

play fullscreen
1 / 34

Muthu Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1 - PowerPoint PPT Presentation


  • 86 Views
  • Uploaded on

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories. Muthu Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1 J Ramanujam 2 Atanas Rountev 1 P Sadayappan 1

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Muthu Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1' - talen


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories

Muthu Baskaran1 Uday Bondhugula1 Sriram Krishnamoorthy 1

J Ramanujam2 Atanas Rountev1 P Sadayappan1

1Department of Computer Science & Engineering

The Ohio State University

2Department of Electrical and Computer Engineering

Louisiana State University

talk outline
Talk Outline

Introduction

Challenges

Automatic Data Management

Multi-level Tiling

Experiments

Related Work

Summary

Ongoing and Future Work

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

emergence of multi core architectures
Emergence of Multi-core Architectures
  • Single-processor performance
    • Improved by ~50%/yr for almost two decades
    • Clock speed, ILP, …
    • Clock speed increased over 100x
  • Limits to single-processor performance growth
    • Increase in power density
    • Flattening of clock speed due to power limitation
  • Transistor density continues to rise unabated
  • Multiple cores are now the best option for sustained performance growth

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

scratchpad memories 1 2
Scratchpad Memories (1/2)
  • Need to optimize memory bandwidth and latency in multi-core architectures
  • Traditional solution: introducing a cache hierarchy
  • Drawback
    • Caches are hardware-managed - difficult to model miss behavior and to predict program execution times
  • Solution in many modern architectures: fast on-chip explicitly managed memory - scratchpad memory (local memory store)

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

scratchpad memories 2 2
Scratchpad Memories (2/2)
  • Scratchpads
    • Software-managed
      • Control over data movement
      • Easier to model performance
      • Burden on programmer/compiler to manage and utilize
    • Lower power per chip area required compared to cache
  • Some modern architectures having scratchpad memories
    • GPU
    • Cell
    • MPSoC

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

talk outline1
Talk Outline

Introduction

Challenges

Automatic Data Management

Multi-level Tiling

Experiments

Related Work

Summary

Ongoing and Future Work

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

challenges
Challenges
  • Effective management of on-chip scratchpads in multi-core architectures
    • Utilize limited capacity of scratchpad
    • Optimize data movement
  • Effective computation mapping in many-core architectures with multiple levels of parallelism
    • Exploit available parallelism
    • Account for scratchpad capacity constraints

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

talk outline2
Talk Outline

Introduction

Challenges

Automatic Data Management

Multi-level Tiling

Experiments

Related Work

Summary

Ongoing and Future Work

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

data management issues
Data Management Issues
  • Orchestration of data movement between off-chip global and on-chip scratchpad memory
  • Decisions on
    • What data elements to move in and out of scratchpad
    • When to move data
    • How to move data
    • How to access the data elements copied to scratchpad

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

overview of automatic data management approach 1 2
Overview of Automatic Data Management Approach (1/2)
  • Allocation of storage space (as arrays) in the scratchpad memory for local copies
  • Determination of access functions of arrays in scratchpad memories
  • Generation of code for moving data between scratchpad (local) and off-chip (global) memories

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

overview of automatic data management approach 2 2
Overview of Automatic Data Management Approach (2/2)
  • Targeted at affine programs
      • Dense arrays
      • Loop bounds – affine functions of outer loop variables, constants and program parameters
      • Array access functions - affine functions of surrounding loop variables, constants and program parameters
  • Developed using polyhedral model
    • an algebraic framework for representing affine programs – statement domains, dependences, array access functions – and affine program transformations

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

polyhedral model
Polyhedral Model

.

.

0 1 1 0

1 0 0 1

+

+

₣1a (xS1)=

1 0 -1

i j 1

.

-1 0 4

IS1 =

0 1 -2

0 -1 4

≥ 0

i j

i j

0 0

i j

0 0

.

0 -1

1 0 0 1

+

₣3a (xS1)=

₣2a (xS1)=

for (i=1; i<=4; i++)

for (j=2; j<=4; j++)

S1: a[i][j] = a[j][i] + a[i][j-1];

j≥2

j≤4

i

(m,m)

i≤4

i j

xS1=

i≥1

(0,0)

j

DS1a = ₣1a IS1

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

automatic data allocation
Automatic Data Allocation
  • Given a program block, identify the storage space needed for each non-overlapping accessed region of all arrays
    • Access functions of array references may be non-uniformly generated
  • For architectures (e.g. nVIDIAGeForce GPU) supporting direct data access from off-chip memory
    • Estimate extent of reuse of data to determine whether or not to copy to scratchpad

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

algorithm and illustration
Algorithm and Illustration

Array A

28

for ( i=10;i<=14;i++) {

for ( j=10;j<=14;j++) {

A[i ][ j+1] = A[i+j ][ j+1] *3;

for (k=11;k<=20;k++)

B[i ][ j+k] = A[i ][k] + B[i+j ][k];

}

}

20

14

10

11

20

Local Array LA1:

lb ( i ) = 20; ub( i ) = 28

lb ( j ) = 11; ub( j ) = 15

  • Find the set of all data spaces accessed by all references to an array
    • Access function of the reference
    • Iteration space of the statement that holds the reference
  • Partition the set of all data spaces into maximal disjoint non-overlapping subset of data spaces

Local Array LA0:

lb ( i ) = 10; ub( i ) = 14

lb ( j ) = 11; ub( j ) = 20

  • Find the bounding box of each partition of data spaces
  • Local memory array for each bounding box

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

accessing arrays in scratchpad
Accessing Arrays in Scratchpad

for ( i=10;i<=14;i++) {

for ( j=10;j<=14;j++) {

LA0[i-10][j+1-11] = LA1[i+j-20][j+1-11]*3;

for (k=11;k<=20;k++)

LB0[i-10][j+k-21] = LA0[i-10][k-11] + LB1[i+j-20][k-11];

}

}

for ( i=10;i<=14;i++) {

for ( j=10;j<=14;j++) {

A[i ][ j+1] = A[i+j ][ j+1]*3;

for (k=11;k<=20;k++)

B[i ][ j+k] = A[i ][k] + B[i+j ][k];

}

}

  • Array dimension in scratchpad may be lower than original array dimension, depending on accessed data
  • Access function in local memory array
    • Original access function or reduced access function with offsets – lower bounds (in each dimension) of scratchpad array

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

data movement code generation
Data Movement Code Generation

/* Data Move in code */

for (i=10;i<=14;i++) {

for (j=11;j<=20;j++)

LA0[i-10][j-11] = A[i][j] ;

}

for (i=20;i<=28;i++) {

for (j=max(i-13,11);j<=min(15,i-9); j++)

LA1[i-20][j-11] = A[i][j] ;

}

/* Data Move out code */

for (i=10;i<=14;i++) {

for (j=11;j<=15;j++)

A[i][j] = LA0[i-10][j-11];

}

  • Generation of loop structure
    • Scanning of polytopes (using CLooG - a tool for code generation) corresponding to data spaces of
      • read references – for moving data into scratchpad
      • write references – for moving data out of scratchpad
  • Generation of loop body (data movement statement)
    • Copy from a location in scratchpad buffer to off-chip memory location or vice versa

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

talk outline3
Talk Outline

Introduction

Challenges

Automatic Data Management

Multi-level Tiling

Experiments

Related Work

Summary

Ongoing and Future Work

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

gpu architecture
GPU architecture

. . .

Scratchpad

Scratchpad

Scratchpad

Off-chip memory

  • Architectural components
    • Slow off-chip (global) memory
    • Two levels of parallelism
      • Set of multiprocessors
      • Set of processor cores in each multiprocessor
    • Scratchpad on each multiprocessor, shared by its processor cores

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

multi level tiling approach
Multi-level Tiling Approach
  • Tiling transformation framework recently developed at OSU by Bondhugula (CC-08, PLDI-08)
    • Finds tiling transformations or hyperplanes
      • for sequences of imperfectly nested loops
      • enables communication minimal parallelization and locality optimization
    • Identifies loops to tile for parallelism and data locality
  • Multiple levels of tiling
    • for exploiting parallelism across multiple parallel levels
  • Additional tiling (sequential) at each level with scratchpad memory
      • If data required by tile executing at the level exceeds memory
      • Data movement at the start and end of each sequential tile
      • Synchronization points to ensure consistency

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

example
Example

// Tiling to distribute at the outer level

FORALL iT = 1, Ni, Ti

FORALL jT = 1, Nj, Tj

// Tiling to satisfy scratchpad memory limit

FOR i' = iT, min(iT+Ti-1,Ni), ti'

FOR j' = jT, min(jT+Tj-1,Nj), tj'

FOR k' = 1, WS, tk'

FOR l'= 1, WS, tl'

FORALL i = 1, Ni

FORALL j = 1, Nj

FOR k = 1, WS

FOR l = 1, WS

S1

END FOR

END FOR

END FORALL

END FORALL

<Data move in Code>

// Tiling to distribute at the inner level

FORALL it = i', min(i'+ti'-1,Ni), ti

FORALL jt = j', min(j'+tj'-1,Nj), tj

FOR i = it, min(it+ti-1,Ni)

FOR j = jt, min(jt+tj-1,Nj)

FOR k = k', min(k'+tk'-1,WS)

FOR l = l', min(l'+tl'-1,WS)

S1

END FOR

END FOR

END FOR

END FOR

END FORALL

END FORALL

<Data move out Code>

END FOR

END FOR

END FOR

END FOR

END FORALL

END FORALL

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

tile size determination
Tile Size Determination
  • Handling scratchpad memory constraints
    • Cost model for data movement

C = Nx (S+ (Vx L)/P)

N– Number of data movements

S– Sync cost per data movement

V– Number of elements per data movement (based on tile sizes)

L – Cost to transfer one element

P – Number of processes involved in data movement

    • Tile size search formulation
      • Constraint: memory requirement within limit
      • Objective function: minimize data movement cost, C

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

illustration of tile size search formulation
Illustration of tile size search formulation
  • Loop nest of m loops with tile sizes t1, t2,.., tm
  • nl local arrays
  • Mj – Memory (as a function of tile sizes) for local array j
  • V inj and Voutj – Volume (as a function of tile sizes) moved in to and out of local array memory j, respectively
  • rj – position in the loop nest where the data movement code of array j is placed
  • Mup - total scratchpad memory

Variables:

t1, t2,.., tm

Memory Constraint:

Objective function:

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

talk outline4
Talk Outline

Introduction

Challenges

Automatic Data Management

Multi-level Tiling

Experiments

Related Work

Summary

Ongoing and Future Work

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

motion estimation kernel 1 2
Motion Estimation Kernel (1/2)

Machine Information:

NVIDIA GeForce 8800 GTX

16 x 8 cores @ 1.35 GHz

768 MB off-chip memory

16 x 16 KB scratchpad

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

1d jacobi kernel 1 2
1D Jacobi Kernel (1/2)

Machine Information:

NVIDIA GeForce 8800 GTX

16 x 8 cores @ 1.35 GHz

768 MB off-chip memory

16 x 16 KB scratchpad

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

motion estimation kernel 2 2
Motion Estimation Kernel (2/2)

Machine Information:

NVIDIA GeForce 8800 GTX

16 x 8 cores @ 1.35 GHz

768 MB off-chip memory

16 x 16 KB scratchpad

Tile size from model

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

1d jacobi kernel 2 2
1D Jacobi Kernel (2/2)

Machine Information:

NVIDIA GeForce 8800 GTX

16 x 8 cores @ 1.35 GHz

768 MB off-chip memory

16 x 16 KB scratchpad

Tile size from model

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

talk outline5
Talk Outline

Introduction

Challenges

Automatic Data Management

Multi-level Tiling

Experiments

Related Work

Summary

Ongoing and Future Work

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

related work
Related Work
  • Scratchpad memory management
    • Data reuse - Issenin et al. [DAC06]
    • Allocation for uniformly generated references
      • Schreiber and Cronquist [HPLTR04]
      • Anantharaman and Pande [RTSS98]
      • Kandemir et al. [CAD04]
    • Improving performance on cached architectures
      • Ferrante et al. [LCPC92]
      • Gallivan et al. [ICS88]
  • Multi-level tiling
    • Fatahalian et al. [SC06]– various levels of memory
    • Bikshandi et al. [PPOPP06] and Renganarayanan et al. [SC07, IPDPS07] – parallelism and locality

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

talk outline6
Talk Outline

Introduction

Challenges

Automatic Data Management

Multi-level Tiling

Experiments

Related Work

Summary

Ongoing and Future Work

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

summary
Summary
  • Addressed two issues in compiling for modern multi-level parallel architectures with scratchpads
    • Data management in scratchpad memory
      • Data allocation
      • Access in scratchpad
      • Code generation for data movement
    • Mapping of computation in regular programs on to multiple levels of parallel units
  • Experimental evaluation using nVIDIA GPU

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

talk outline7
Talk Outline

Introduction

Challenges

Automatic Data Management

Multi-level Tiling

Experiments

Related Work

Summary

Ongoing and Future Work

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

ongoing and future work
Ongoing and Future Work

Developing an end-to-end compiler framework for modern many-core architectures like GPUs

Algorithms developed in this work – an integral part of the overall compiler framework

Further optimize transformations like tiling, for modern architectures like GPUs, using model-driven empirical search

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

thank you
Thank you

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008