1 / 57

Processor Architectures and Program Mapping

Processor Architectures and Program Mapping. Data Memory Management Part b: Loop transformations & Data Reuse. 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman. Thanks to the IMEC DTSE experts:. Erik Brockmeyer IMEC, Leuven, Belgium and also

Download Presentation

Processor Architectures and Program Mapping

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Processor Architectures and Program Mapping Data Memory Management Part b: Loop transformations & Data Reuse 5kk10 TU/e Henk Corporaal Jef van Meerbergen Bart Mesman

  2. Thanks to the IMEC DTSE experts: Erik Brockmeyer IMEC, Leuven, Belgium and also Martin Palkovic, Sven Verdoolaege, Tanja van Achteren, Sven Wuytack, Arnout Vandecappelle, Miguel Miranda, Cedric Ghez, Tycho van Meeuwen, Eddy Degreef, Michel Eyckmans, Francky Catthoor, e.a.

  3. DM methodology C-in Analysis/Preprocessing Dataflow Transformations Loop/control-flow transformations Data Reuse Storage Cycle Budget Distribution Memory Allocation and Assignment Memory Layout organisation Address optimization C-out H.C. TD5102

  4. Location Production Consumption Time Location Production Consumption Time Locality of Reference for (i=0; i < 8; i++) A[i] = …; for (i=0; i < 8; i++) B[7-i] = f(A[i]); for (i=0; i < 8; i++) A[i] = …; B[7-i] = f(A[i]); H.C. TD5102

  5. Location Production Consumption Time Location Time Regularity for (i=0; i < 8; i++) A[i] = …; for (i=0; i < 8; i++) B[i] = f(A[7-i]); for (i=0; i < 8; i++) A[i] = …; for (i=0; i < 8; i++) B[7-i] = f(A[i]); H.C. TD5102

  6. Location Consumption Consumption Time Location Consumption Consumption Time Enabling Reuse for (i=0; i < 8; i++) B[i] = f1(A[i]); for (i=0; i < 8; i++) C[i] = f2(A[i]); for (i=0; i < 8; i++) B[i] = f1(A[i]); C[i] = f2(A[i]); H.C. TD5102

  7. How to do these loop transformations automatically? • Requires cost function • Requires technique Let's introduce some terminology • iteration spaces • polytopes • ordering vector / execution order H.C. TD5102

  8. // assume A[][] exists for (i=1; i<6; i++) { for (j=2; j<6; j++) { B[i][j]= g( A[i-1][j-2]); } } Iteration space and polytopes i 5 4 3 2 1 --- iteration space --- consumption space --- production space --- dependency vector 0 j 0 1 2 3 4 5 H.C. TD5102

  9. C B A Example with 3 polytopes Algorithm having 3 loops: A: for (i=1; i<=N; ++i) for (j=1; j<=N-i+1; ++j) a[i][j] = in[i][j] + a[i-1][j]; B: for (p=1; p<=N; ++p) b[p][1] = f( a[N-p+1][p], a[N-p][p] ); C: for (k=1; k<=N; ++k) for (l=1; l<=k; ++k) b[k][l+1] = g (b[k][l]); l k p i j H.C. TD5102

  10. Common iteration space for (i=1; i<=(2*N+1); ++i) for (j=1; j<=2*N; ++j) if (i>=1 && i<=N && j>=1 && j<=N-i+1) a[i][j] = in[i][j] + a[i-1][j]; if (i==N+1 && j>=1 && j<=N) b[j][1] = f( a[N-j+1][j], a[N-j][j] ); if (i>=N+2 && i<=2*N+1 && j>=N+1 && j<=N+k) b[i-N-1][j-N+1] = g (b[i-N-1][j-N]); • Initial solution having a common iteration space: • Bad locality • Bad regularity • Requires 2N memory locations • Many dummy iterations 2*N+1 i 1 1 2*N j Ordering vector H.C. TD5102

  11. Cost function needed for automation • Regularity • Equal direction for dependency vectors • Avoid that dependency vectors cross each other • Good for storage size • Temporal locality • Equal length of all dependency vectors • Good for storage size • Good for data reuse H.C. TD5102

  12. Regularity Regular Irregular H.C. TD5102

  13. Bad regularity limits the ordering freedom 2*N+1 i 1 1 2*N j Ordering freedom = 90 degrees H.C. TD5102

  14. C C P C P C C C C C C Locality estimates Sum{di} Max {di} Spanning tree C di P C C C P = production C = consumption Dependency vector length is measure for locality Q: Which length is the best estimate? H.C. TD5102

  15. Three step approach for loop transformation tool • Affine loop transformations • Only geometric information is available during placement • Rotation, skewing, interchange, reverse • Polytope placement • Only geometric information is available during placement • Translation • Choose ordering vector Combined transformation: H.C. TD5102

  16. j i p k l Three step approach for loop transformation tool • Affine loop transformations • Polytope placement • Choose ordering vector A: (i: 1..N):: (j: 1 .. N-i+1):: a[i][j] = in[i][j] + a[i-1][j]; B: (p: 1..N):: b[p][1] = f( a[N-p+1][p], a[N-p][p] ); C: (k: 1..N):: (l: 1..k):: b[N-k+1][l+1] = g( b[N-k+1][l] ); H.C. TD5102

  17. Three step approach for loop transformation tool • Affine loop transformations • Polytope placement • Choose ordering vector H.C. TD5102

  18. Three step approach for loop transformation tool • Affine loop transformations • Polytope placement = merging loops • Choose ordering vector H.C. TD5102

  19. Choose optimal ordering vector Ordering Vector 1 Ordering Vector 2 H.C. TD5102

  20. j i l From the Polyhedral model back to C • Affine loop transformations • Polytope placement • Choose ordering vector for (j=1; j<=N; ++j) { for (i=1; i<=N-j+1; ++i) a[i][j] = in[i][j] + a[i-1][j]; b[j][1] = f( a[N-j+1][j], a[N-j][j] ); for (l=1; l<=j; ++l) b[j][l+1] = g( b[j][l] ); } • Optimized solution having a common iteration space: • Optimal locality • Optimal regularity • Requires 2 memory locations H.C. TD5102

  21. Loop trafo - cavity detection N x M N x M N x M Scanner Gauss Blur x Gauss Blur y X X-Y Loop Interchange Y From N x M toN x (2GB+1) buffer size H.C. TD5102

  22. 1 Transform: interchange Translate: merge 2 Order 3 Loop trafo-cavity (1) H.C. TD5102

  23. 1 Transform: interchange Translate: merge 2 Order 3 Loop trafo-cavity (2) x-blur filter: H.C. TD5102

  24. Loop trafo - cavity detection N x M N x M N x M Scanner Gauss Blur x Gauss Blur y X X-Y Loop Interchange Y From N x M toN x (2GB+1) buffer size H.C. TD5102

  25. 2 2 Translate 1: Translate 2: 3 Loop trafo-cavity (3) Comparing different translations H.C. TD5102

  26. Order 3 3 Loop trafo-cavity (4) Combining (merging) multiple polytopes + = H.C. TD5102

  27. Result on gauss filter for (y=0; y<M+GB; ++y) { for (x=0; x<N+GB; ++x) { if (x>=GB && x<=N-1-GB && y>=GB && y<=M-1-GB) { gauss_x_compute = 0; for (k=-GB; k<=GB; ++k) gauss_x_compute += image_in[x+k][y]*Gauss[abs(k)]; gauss_x_image[x][y] = gauss_x_compute/tot; } else if (x<N && y<M) gauss_x_image[x][y] = 0; if (x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) { gauss_xy_compute = 0; for (k=-GB; k<=GB; ++k) gauss_xy_compute += gauss_x_image[x][y-GB+k]* Gauss[abs(k)]; gauss_xy_image[x][y-GB] = gauss_xy_compute/tot; } else if (x<N && (y-GB)>=0 && (y-GB)<M) gauss_xy_image[x][y-GB] = 0; H.C. TD5102

  28. Intermezzo • Before we continue with data reuse, have a look at other loop transformations H.C. TD5102

  29. DM methodology C-in Analysis/Preprocessing Dataflow Transformations Loop/control-flow transformations Data Reuse Storage Cycle Budget Distribution Memory Allocation and Assignment Memory Layout organisation Address optimization C-out H.C. TD5102

  30. Layer 3 Layer 2 Data paths Layer 1 Memory hierarchy and Data reuse • Determines reuse candidates • Combine reuse candidates into reuse chains • If multiple access statements/array combine into reuse trees • Determine number of layers (if architecture is not fixed) • Select candidates and assign to memory layers • Add extra transfers between the different memory layers(for scratchpad RAM; not for caches) H.C. TD5102

  31. TI C55@200MHz example platform L2 Offchip Fixed size RAM partition BW: 50M Word/s single port MAX: 8MBx16 Size 16 MB SRAM/EPROM/ SDRAM/SBSRAM Bandwidth 50M words/s ROM partition L1 ROM (Data/program/DMA) Size 32kB 16Kx16 first 3 cycles, next 2 cycles ROM Bandwidth 100M words/S It seems this can be in parallel with the 256Kb memory BW: 400M Word/s dual port 32x Total 256Kb 4Kx16 4Kx16 4Kx16 Variable size RAM partition sing sing sing 1 elem in 1 cycle Size 320kB Bandwidth 400M words/s 8x Total 64Kb 4Kx16 4Kx16 4Kx16 2 elem in 1 cycle dual dual dual Processor partition L0 Size 2x16 registers Register file + Core TMS320vc5510@200MHz Bandwidth 4.8Gwords/s Vdd= 1.5 V P = unknown H.C. TD5102

  32. #A = 100% M P = 1 P total (before) = 100% Exploiting Memory Hierarchy for reduced Power: principle Processor Data Paths Processor Data Paths Register File Register File A P = 1 H.C. TD5102

  33. 100% 100% A’’ A’ A’ M A M A 10% 5% 1% P = 0.01 P = 0.3 P = 0.1 P = 1 P = 1 P = 1 P = 1 Exploiting Memory Hierarchy for reduced Power: principle Processor Data Paths Processor Data Paths Register File Register File P total (before) = 100% P total (after) = 100%x0.01+10%x0.1+1%x1 = 3% H.C. TD5102

  34. customized connections A’’ A’ Data reuse decision and memory hierarchy: principle Processor Data Paths Processor Data Paths Register File Register File M B A Customized connections in the memory subsystem to bypass the memory hierarchy and avoid the overhead. H.C. TD5102

  35. copy2 copy1 copy3 copy4 Time frame 2 Time frame 1 Time frame 4 Time frame 3 Step 1: identify arrays with data reuse potential for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; intra-copy reuse array index inter-copy reuse time H.C. TD5102

  36. Importance of high level cost estimate for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; array index Array copies are stored in-place! Mk 6 time copy2 copy1 copy3 copy4 Time frame 1 Time frame 2 Time frame 3 Time frame 4 H.C. TD5102

  37. j iterator =not present so intra-copy reuse 3 intra-copy reuse factor= 3 copy2 copy1 copy3 copy4 Time frame 1 Time frame 2 Time frame 4 Time frame 3 Step 1: determine gains Intra-copy reuse factor for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; array index Mk 6 time H.C. TD5102

  38. for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; inter-copy reuse factor = 1/(1-1/3)=3/2 i iterator has smaller weight than k range so inter-copy reuse copy2 copy1 copy3 copy4 Time frame 1 Time frame 2 Time frame 4 Time frame 3 Step 1: determine gains Inter-copy reuse factor for (i=0; i<n; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; array index Mk 6 time H.C. TD5102

  39. Mk 15 Mm Mm tf 1.1 tf 2.3 tf 1.3 tf 1.2 tf 2.2 tf 1.4 tf 2.1 tf 1.5 tf 1.6 tf 1 tf 2 tf 4 tf 5 tf 9 tf 7 tf 3 tf 8 tf 6 5 5 time frame 1 time frame 2 Possibility for multi-level hierarchy for (i=0; i<10; i++) for (j=0; j<2; j++) for (k=0; k<3; k++) for (l=0; l<3; l++) for (m=0; m<5; m++) … = A[i*15+k*5+m]; array index time H.C. TD5102

  40. A A A A Many reuse possibilities A’ Prune for promising ones A’ A’ Cost estimate needed A’’ R1(A) R1(A) R1(A) R1(A) Step 2: determine data reuse chains for each memory access H.C. TD5102

  41. 100 80 60 #misses 40 estimate size 20 Gk 0 15 0 5 10 15 20 Gm #elements R1(A) 5 A’ A’ Cost function needs both size and number of accesses to intermediate array estimate #misses from different levels for one iteration of i for (i=0; i<10; i++) for (j=0; j<2; j++) for (k=0; k<3; k++) for (l=0; l<3; l++) for (m=0; m<5; m++) … = A[i*15+k*5+m]; 2*3*5 =30 3*5 =15 2*3*3*5 =90 H.C. TD5102

  42. 51 38 155 165 135 150 35 170 A 30 15 90 15 A A A 45 22 135 22 150 150 150 150 30 15 15 120 A’ 6 45 105 A’ A’ 5 7 16 15 15 30 120 A’’ accesses size energy 6 x 5 y 90 90 90 90 z R1(A) R1(A) R1(A) R1(A) Very simplistic power and area estimation for different data-reuse versions H.C. TD5102

  43. A A’ for (x=0; x<8; x++) for (y=0; y<5; y++) … = A[i*5+y]; R2(A) Step 3: determine data reuse trees for multiple accesses for (i=0; i<10; i++) for (j=0; j<2; j++) for (k=0; k<3; k++) for (l=0; l<3; l++) for (m=0; m<5; m++) … = A[i*15+k*5+m]; A A’ A’’ R1(A) H.C. TD5102

  44. Reuse tree A A A’ A’ R2(A) A’’ R1(A) Step 3: determine data reuse trees for multiple accesses A A’ A’ R2(A) A’’ R1(A) H.C. TD5102

  45. A A B Layer 1 B A’ B’ A’ Layer 2 B’ A’ R2(A) B’’ A’’ B’’’ Layer 3 A’’ A’ B’’’ R1(A) R1(B) R1(A) R2(A) R1(B) Assign all data reuse trees (multiple arrays) to memory hierarchy H.C. TD5102

  46. Hierarchy layers Layer1 Layer2 Layer3 Foreground mem. Datapath Step 4: Determine number of layers Data reuse trees B Data reuse trees A H.C. TD5102

  47. all 1 3 2 4 5 A A A FG FG Step 5: Select and assign reuse candidates hierarchy assignments Hierarchy layers Data reuse trees H.C. TD5102

  48. Data reuse trees B Step 5: All freedom in array to memory hierarchy Hierarchy layers Data reuse trees A H.C. TD5102

  49. Hierarchy layers Pruned Step 5: Prune reuse graph (platform independent) Hierarchy layers Full freedom Quite some solutions never make sense H.C. TD5102

  50. Hierarchy layers Pruned Final solution 4 layer platform Final solution 4 layer platform A A' B B' FG FG Step 5: Prune reuse graph further (platform dependent) H.C. TD5102

More Related