1 / 34

Muthu Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories. Muthu Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1 J Ramanujam 2 Atanas Rountev 1 P Sadayappan 1

talen
Download Presentation

Muthu Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran1 Uday Bondhugula1 Sriram Krishnamoorthy 1 J Ramanujam2 Atanas Rountev1 P Sadayappan1 1Department of Computer Science & Engineering The Ohio State University 2Department of Electrical and Computer Engineering Louisiana State University

  2. Talk Outline Introduction Challenges Automatic Data Management Multi-level Tiling Experiments Related Work Summary Ongoing and Future Work Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

  3. Emergence of Multi-core Architectures • Single-processor performance • Improved by ~50%/yr for almost two decades • Clock speed, ILP, … • Clock speed increased over 100x • Limits to single-processor performance growth • Increase in power density • Flattening of clock speed due to power limitation • Transistor density continues to rise unabated • Multiple cores are now the best option for sustained performance growth Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

  4. Scratchpad Memories (1/2) • Need to optimize memory bandwidth and latency in multi-core architectures • Traditional solution: introducing a cache hierarchy • Drawback • Caches are hardware-managed - difficult to model miss behavior and to predict program execution times • Solution in many modern architectures: fast on-chip explicitly managed memory - scratchpad memory (local memory store) Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

  5. Scratchpad Memories (2/2) • Scratchpads • Software-managed • Control over data movement • Easier to model performance • Burden on programmer/compiler to manage and utilize • Lower power per chip area required compared to cache • Some modern architectures having scratchpad memories • GPU • Cell • MPSoC Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

  6. Talk Outline Introduction Challenges Automatic Data Management Multi-level Tiling Experiments Related Work Summary Ongoing and Future Work Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

  7. Challenges • Effective management of on-chip scratchpads in multi-core architectures • Utilize limited capacity of scratchpad • Optimize data movement • Effective computation mapping in many-core architectures with multiple levels of parallelism • Exploit available parallelism • Account for scratchpad capacity constraints Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

  8. Talk Outline Introduction Challenges Automatic Data Management Multi-level Tiling Experiments Related Work Summary Ongoing and Future Work Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

  9. Data Management Issues • Orchestration of data movement between off-chip global and on-chip scratchpad memory • Decisions on • What data elements to move in and out of scratchpad • When to move data • How to move data • How to access the data elements copied to scratchpad Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

  10. Overview of Automatic Data Management Approach (1/2) • Allocation of storage space (as arrays) in the scratchpad memory for local copies • Determination of access functions of arrays in scratchpad memories • Generation of code for moving data between scratchpad (local) and off-chip (global) memories Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

  11. Overview of Automatic Data Management Approach (2/2) • Targeted at affine programs • Dense arrays • Loop bounds – affine functions of outer loop variables, constants and program parameters • Array access functions - affine functions of surrounding loop variables, constants and program parameters • Developed using polyhedral model • an algebraic framework for representing affine programs – statement domains, dependences, array access functions – and affine program transformations Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

  12. Polyhedral Model . . 0 1 1 0 1 0 0 1 + + ₣1a (xS1)= 1 0 -1 i j 1 . -1 0 4 IS1 = 0 1 -2 0 -1 4 ≥ 0 i j i j 0 0 i j 0 0 . 0 -1 1 0 0 1 + ₣3a (xS1)= ₣2a (xS1)= for (i=1; i<=4; i++) for (j=2; j<=4; j++) S1: a[i][j] = a[j][i] + a[i][j-1]; j≥2 j≤4 i (m,m) i≤4 i j xS1= i≥1 (0,0) j DS1a = ₣1a IS1 Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

  13. Automatic Data Allocation • Given a program block, identify the storage space needed for each non-overlapping accessed region of all arrays • Access functions of array references may be non-uniformly generated • For architectures (e.g. nVIDIAGeForce GPU) supporting direct data access from off-chip memory • Estimate extent of reuse of data to determine whether or not to copy to scratchpad Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

  14. Algorithm and Illustration Array A 28 for ( i=10;i<=14;i++) { for ( j=10;j<=14;j++) { A[i ][ j+1] = A[i+j ][ j+1] *3; for (k=11;k<=20;k++) B[i ][ j+k] = A[i ][k] + B[i+j ][k]; } } 20 14 10 11 20 Local Array LA1: lb ( i ) = 20; ub( i ) = 28 lb ( j ) = 11; ub( j ) = 15 • Find the set of all data spaces accessed by all references to an array • Access function of the reference • Iteration space of the statement that holds the reference • Partition the set of all data spaces into maximal disjoint non-overlapping subset of data spaces Local Array LA0: lb ( i ) = 10; ub( i ) = 14 lb ( j ) = 11; ub( j ) = 20 • Find the bounding box of each partition of data spaces • Local memory array for each bounding box Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

  15. Accessing Arrays in Scratchpad for ( i=10;i<=14;i++) { for ( j=10;j<=14;j++) { LA0[i-10][j+1-11] = LA1[i+j-20][j+1-11]*3; for (k=11;k<=20;k++) LB0[i-10][j+k-21] = LA0[i-10][k-11] + LB1[i+j-20][k-11]; } } for ( i=10;i<=14;i++) { for ( j=10;j<=14;j++) { A[i ][ j+1] = A[i+j ][ j+1]*3; for (k=11;k<=20;k++) B[i ][ j+k] = A[i ][k] + B[i+j ][k]; } } • Array dimension in scratchpad may be lower than original array dimension, depending on accessed data • Access function in local memory array • Original access function or reduced access function with offsets – lower bounds (in each dimension) of scratchpad array Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

  16. Data Movement Code Generation /* Data Move in code */ for (i=10;i<=14;i++) { for (j=11;j<=20;j++) LA0[i-10][j-11] = A[i][j] ; } for (i=20;i<=28;i++) { for (j=max(i-13,11);j<=min(15,i-9); j++) LA1[i-20][j-11] = A[i][j] ; } /* Data Move out code */ for (i=10;i<=14;i++) { for (j=11;j<=15;j++) A[i][j] = LA0[i-10][j-11]; } • Generation of loop structure • Scanning of polytopes (using CLooG - a tool for code generation) corresponding to data spaces of • read references – for moving data into scratchpad • write references – for moving data out of scratchpad • Generation of loop body (data movement statement) • Copy from a location in scratchpad buffer to off-chip memory location or vice versa Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

  17. Talk Outline Introduction Challenges Automatic Data Management Multi-level Tiling Experiments Related Work Summary Ongoing and Future Work Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

  18. GPU architecture . . . Scratchpad Scratchpad Scratchpad Off-chip memory • Architectural components • Slow off-chip (global) memory • Two levels of parallelism • Set of multiprocessors • Set of processor cores in each multiprocessor • Scratchpad on each multiprocessor, shared by its processor cores Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

  19. Multi-level Tiling Approach • Tiling transformation framework recently developed at OSU by Bondhugula (CC-08, PLDI-08) • Finds tiling transformations or hyperplanes • for sequences of imperfectly nested loops • enables communication minimal parallelization and locality optimization • Identifies loops to tile for parallelism and data locality • Multiple levels of tiling • for exploiting parallelism across multiple parallel levels • Additional tiling (sequential) at each level with scratchpad memory • If data required by tile executing at the level exceeds memory • Data movement at the start and end of each sequential tile • Synchronization points to ensure consistency Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

  20. Example // Tiling to distribute at the outer level FORALL iT = 1, Ni, Ti FORALL jT = 1, Nj, Tj // Tiling to satisfy scratchpad memory limit FOR i' = iT, min(iT+Ti-1,Ni), ti' FOR j' = jT, min(jT+Tj-1,Nj), tj' FOR k' = 1, WS, tk' FOR l'= 1, WS, tl' FORALL i = 1, Ni FORALL j = 1, Nj FOR k = 1, WS FOR l = 1, WS S1 END FOR END FOR END FORALL END FORALL <Data move in Code> // Tiling to distribute at the inner level FORALL it = i', min(i'+ti'-1,Ni), ti FORALL jt = j', min(j'+tj'-1,Nj), tj FOR i = it, min(it+ti-1,Ni) FOR j = jt, min(jt+tj-1,Nj) FOR k = k', min(k'+tk'-1,WS) FOR l = l', min(l'+tl'-1,WS) S1 END FOR END FOR END FOR END FOR END FORALL END FORALL <Data move out Code> END FOR END FOR END FOR END FOR END FORALL END FORALL Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

  21. Tile Size Determination • Handling scratchpad memory constraints • Cost model for data movement C = Nx (S+ (Vx L)/P) N– Number of data movements S– Sync cost per data movement V– Number of elements per data movement (based on tile sizes) L – Cost to transfer one element P – Number of processes involved in data movement • Tile size search formulation • Constraint: memory requirement within limit • Objective function: minimize data movement cost, C Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

  22. Illustration of tile size search formulation • Loop nest of m loops with tile sizes t1, t2,.., tm • nl local arrays • Mj – Memory (as a function of tile sizes) for local array j • V inj and Voutj – Volume (as a function of tile sizes) moved in to and out of local array memory j, respectively • rj – position in the loop nest where the data movement code of array j is placed • Mup - total scratchpad memory Variables: t1, t2,.., tm Memory Constraint: Objective function: Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

  23. Talk Outline Introduction Challenges Automatic Data Management Multi-level Tiling Experiments Related Work Summary Ongoing and Future Work Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

  24. Motion Estimation Kernel (1/2) Machine Information: NVIDIA GeForce 8800 GTX 16 x 8 cores @ 1.35 GHz 768 MB off-chip memory 16 x 16 KB scratchpad Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

  25. 1D Jacobi Kernel (1/2) Machine Information: NVIDIA GeForce 8800 GTX 16 x 8 cores @ 1.35 GHz 768 MB off-chip memory 16 x 16 KB scratchpad Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

  26. Motion Estimation Kernel (2/2) Machine Information: NVIDIA GeForce 8800 GTX 16 x 8 cores @ 1.35 GHz 768 MB off-chip memory 16 x 16 KB scratchpad Tile size from model Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

  27. 1D Jacobi Kernel (2/2) Machine Information: NVIDIA GeForce 8800 GTX 16 x 8 cores @ 1.35 GHz 768 MB off-chip memory 16 x 16 KB scratchpad Tile size from model Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

  28. Talk Outline Introduction Challenges Automatic Data Management Multi-level Tiling Experiments Related Work Summary Ongoing and Future Work Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

  29. Related Work • Scratchpad memory management • Data reuse - Issenin et al. [DAC06] • Allocation for uniformly generated references • Schreiber and Cronquist [HPLTR04] • Anantharaman and Pande [RTSS98] • Kandemir et al. [CAD04] • Improving performance on cached architectures • Ferrante et al. [LCPC92] • Gallivan et al. [ICS88] • Multi-level tiling • Fatahalian et al. [SC06]– various levels of memory • Bikshandi et al. [PPOPP06] and Renganarayanan et al. [SC07, IPDPS07] – parallelism and locality Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

  30. Talk Outline Introduction Challenges Automatic Data Management Multi-level Tiling Experiments Related Work Summary Ongoing and Future Work Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

  31. Summary • Addressed two issues in compiling for modern multi-level parallel architectures with scratchpads • Data management in scratchpad memory • Data allocation • Access in scratchpad • Code generation for data movement • Mapping of computation in regular programs on to multiple levels of parallel units • Experimental evaluation using nVIDIA GPU Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

  32. Talk Outline Introduction Challenges Automatic Data Management Multi-level Tiling Experiments Related Work Summary Ongoing and Future Work Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

  33. Ongoing and Future Work Developing an end-to-end compiler framework for modern many-core architectures like GPUs Algorithms developed in this work – an integral part of the overall compiler framework Further optimize transformations like tiling, for modern architectures like GPUs, using model-driven empirical search Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

  34. Thank you Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008

More Related