1 / 57

Processor Architectures and Program Mapping

Processor Architectures and Program Mapping. Data Memory Management Part b: Loop transformations & Data Reuse. 5KK70 TU/e Henk Corporaal Bart Mesman. Thanks to the IMEC DTSE experts:. Erik Brockmeyer IMEC, Leuven, Belgium and also

holt
Download Presentation

Processor Architectures and Program Mapping

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Processor Architectures and Program Mapping Data Memory Management Part b: Loop transformations & Data Reuse 5KK70 TU/e Henk Corporaal Bart Mesman

  2. Thanks to the IMEC DTSE experts: Erik Brockmeyer IMEC, Leuven, Belgium and also Martin Palkovic, Sven Verdoolaege, Tanja van Achteren, Sven Wuytack, Arnout Vandecappelle, Miguel Miranda, Cedric Ghez, Tycho van Meeuwen, Eddy Degreef, Michel Eyckmans, Francky Catthoor, e.a.

  3. DM methodology C-in Analysis/Preprocessing Dataflow Transformations Loop/control-flow transformations Data Reuse Storage Cycle Budget Distribution Memory Allocation and Assignment Memory Layout organisation Address optimization C-out @HC 5KK70 Platform-based Design

  4. Location Production Consumption Time Location Production Consumption Time Locality of Reference for (i=0; i < 8; i++) A[i] = …; for (i=0; i < 8; i++) B[7-i] = f(A[i]); for (i=0; i < 8; i++) A[i] = …; B[7-i] = f(A[i]); @HC 5KK70 Platform-based Design

  5. Location Production Consumption Time Location Time Regularity for (i=0; i < 8; i++) A[i] = …; for (i=0; i < 8; i++) B[i] = f(A[7-i]); for (i=0; i < 8; i++) A[i] = …; for (i=0; i < 8; i++) B[7-i] = f(A[i]); @HC 5KK70 Platform-based Design

  6. Location Consumption Consumption Time Location Consumption Consumption Time Enabling Reuse for (i=0; i < 8; i++) B[i] = f1(A[i]); for (i=0; i < 8; i++) C[i] = f2(A[i]); for (i=0; i < 8; i++) B[i] = f1(A[i]); C[i] = f2(A[i]); @HC 5KK70 Platform-based Design

  7. How to do these loop transformations automatically? • Requires cost function • Requires technique Let's introduce some terminology • iteration spaces • polytopes • ordering vector / execution order @HC 5KK70 Platform-based Design

  8. // assume A[][] exists for (i=1; i<6; i++) { for (j=2; j<6; j++) { B[i][j]= g( A[i-1][j-2]); } } Iteration space and polytopes i 5 4 3 2 1 --- iteration space --- consumption space --- production space --- dependency vector 0 j 0 1 2 3 4 5 @HC 5KK70 Platform-based Design

  9. C B A Example with 3 polytopes Algorithm having 3 loops: A: for (i=1; i<=N; ++i) for (j=1; j<=N-i+1; ++j) a[i][j] = in[i][j] + a[i-1][j]; B: for (p=1; p<=N; ++p) b[p][1] = f( a[N-p+1][p], a[N-p][p] ); C: for (k=1; k<=N; ++k) for (l=1; l<=k; ++k) b[k][l+1] = g (b[k][l]); l k p i j @HC 5KK70 Platform-based Design

  10. Common iteration space for (i=1; i<=(2*N+1); ++i) for (j=1; j<=2*N; ++j) if (i>=1 && i<=N && j>=1 && j<=N-i+1) a[i][j] = in[i][j] + a[i-1][j]; if (i==N+1 && j>=1 && j<=N) b[j][1] = f( a[N-j+1][j], a[N-j][j] ); if (i>=N+2 && i<=2*N+1 && j>=N+1 && j<=N+k) b[i-N-1][j-N+1] = g (b[i-N-1][j-N]); • Initial solution having a common iteration space: • Bad locality • Bad regularity • Requires 2N memory locations • Many dummy iterations 2*N+1 i 1 1 2*N j Ordering vector @HC 5KK70 Platform-based Design

  11. Cost function needed for automation • Regularity • Equal direction for dependency vectors • Avoid that dependency vectors cross each other • Good for storage size • Temporal locality • Equal length of all dependency vectors • Good for storage size • Good for data reuse @HC 5KK70 Platform-based Design

  12. Regularity Regular Irregular @HC 5KK70 Platform-based Design

  13. Bad regularity limits the ordering freedom 2*N+1 i 1 1 2*N j Ordering freedom = 90 degrees @HC 5KK70 Platform-based Design

  14. C C P C P C C C C C C Locality estimates Sum{di} Max {di} Spanning tree C di P C C C P = production C = consumption Dependency vector length is measure for locality Q: Which length is the best estimate? @HC 5KK70 Platform-based Design

  15. Three step approach for loop transformation tool • Affine loop transformations • Only geometric information is available during placement • Rotation, skewing, interchange, reverse • Polytope placement • Only geometric information is available during placement • Translation • Choose ordering vector Combined transformation: @HC 5KK70 Platform-based Design

  16. j i p k l Three step approach for loop transformation tool • Affine loop transformations • Polytope placement • Choose ordering vector A: (i: 1..N):: (j: 1 .. N-i+1):: a[i][j] = in[i][j] + a[i-1][j]; B: (p: 1..N):: b[p][1] = f( a[N-p+1][p], a[N-p][p] ); C: (k: 1..N):: (l: 1..k):: b[N-k+1][l+1] = g( b[N-k+1][l] ); @HC 5KK70 Platform-based Design

  17. Three step approach for loop transformation tool • Affine loop transformations • Polytope placement • Choose ordering vector @HC 5KK70 Platform-based Design

  18. Three step approach for loop transformation tool • Affine loop transformations • Polytope placement = merging loops • Choose ordering vector @HC 5KK70 Platform-based Design

  19. Choose optimal ordering vector Ordering Vector 1 Ordering Vector 2 @HC 5KK70 Platform-based Design

  20. j i l From the Polyhedral model back to C • Affine loop transformations • Polytope placement • Choose ordering vector for (j=1; j<=N; ++j) { for (i=1; i<=N-j+1; ++i) a[i][j] = in[i][j] + a[i-1][j]; b[j][1] = f( a[N-j+1][j], a[N-j][j] ); for (l=1; l<=j; ++l) b[j][l+1] = g( b[j][l] ); } • Optimized solution having a common iteration space: • Optimal locality • Optimal regularity • Requires 2 memory locations @HC 5KK70 Platform-based Design

  21. Loop trafo - cavity detection N x M N x M N x M Scanner Gauss Blur x Gauss Blur y X X-Y Loop Interchange Y From N x M toN x (2GB+1) buffer size @HC 5KK70 Platform-based Design

  22. 1 Transform: interchange Translate: merge 2 Order 3 Loop trafo-cavity (1) @HC 5KK70 Platform-based Design

  23. 1 Transform: interchange Translate: merge 2 Order 3 Loop trafo-cavity (2) x-blur filter: @HC 5KK70 Platform-based Design

  24. Loop trafo - cavity detection N x M N x M N x M Scanner Gauss Blur x Gauss Blur y X X-Y Loop Interchange Y From N x M toN x (2GB+1) buffer size @HC 5KK70 Platform-based Design

  25. 2 2 Translate 1: Translate 2: 3 Loop trafo-cavity (3) Comparing different translations @HC 5KK70 Platform-based Design

  26. Order 3 3 Loop trafo-cavity (4) Combining (merging) multiple polytopes + = @HC 5KK70 Platform-based Design

  27. Result on gauss filter for (y=0; y<M+GB; ++y) { for (x=0; x<N+GB; ++x) { if (x>=GB && x<=N-1-GB && y>=GB && y<=M-1-GB) { gauss_x_compute = 0; for (k=-GB; k<=GB; ++k) gauss_x_compute += image_in[x+k][y]*Gauss[abs(k)]; gauss_x_image[x][y] = gauss_x_compute/tot; } else if (x<N && y<M) gauss_x_image[x][y] = 0; if (x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) { gauss_xy_compute = 0; for (k=-GB; k<=GB; ++k) gauss_xy_compute += gauss_x_image[x][y-GB+k]* Gauss[abs(k)]; gauss_xy_image[x][y-GB] = gauss_xy_compute/tot; } else if (x<N && (y-GB)>=0 && (y-GB)<M) gauss_xy_image[x][y-GB] = 0; @HC 5KK70 Platform-based Design

  28. Intermezzo • Before we continue with data reuse, have a look at other loop transformations @HC 5KK70 Platform-based Design

  29. DM methodology C-in Analysis/Preprocessing Dataflow Transformations Loop/control-flow transformations Data Reuse Storage Cycle Budget Distribution Memory Allocation and Assignment Memory Layout organisation Address optimization C-out @HC 5KK70 Platform-based Design

  30. Layer 3 Layer 2 Data paths Layer 1 Memory hierarchy and Data reuse • Determines reuse candidates • Combine reuse candidates into reuse chains • If multiple access statements/array combine into reuse trees • Determine number of layers (if architecture is not fixed) • Select candidates and assign to memory layers • Add extra transfers between the different memory layers(for scratchpad RAM; not for caches) @HC 5KK70 Platform-based Design

  31. TI C55@200MHz example platform L2 Offchip Fixed size RAM partition BW: 50M Word/s single port MAX: 8MBx16 Size 16 MB SRAM/EPROM/ SDRAM/SBSRAM Bandwidth 50M words/s ROM partition L1 ROM (Data/program/DMA) Size 32kB 16Kx16 first 3 cycles, next 2 cycles ROM Bandwidth 100M words/S It seems this can be in parallel with the 256Kb memory BW: 400M Word/s dual port 32x Total 256Kb 4Kx16 4Kx16 4Kx16 Variable size RAM partition sing sing sing 1 elem in 1 cycle Size 320kB Bandwidth 400M words/s 8x Total 64Kb 4Kx16 4Kx16 4Kx16 2 elem in 1 cycle dual dual dual Processor partition L0 Size 2x16 registers Register file + Core TMS320vc5510@200MHz Bandwidth 4.8Gwords/s Vdd= 1.5 V P = unknown @HC 5KK70 Platform-based Design

  32. #A = 100% M P = 1 P total (before) = 100% Exploiting Memory Hierarchy for reduced Power: principle Processor Data Paths Processor Data Paths Register File Register File A P = 1 @HC 5KK70 Platform-based Design

  33. 100% 100% A’’ A’ A’ M A M A 10% 5% 1% P = 0.01 P = 0.3 P = 0.1 P = 1 P = 1 P = 1 P = 1 Exploiting Memory Hierarchy for reduced Power: principle Processor Data Paths Processor Data Paths Register File Register File P total (before) = 100% P total (after) = 100%x0.01+10%x0.1+1%x1 = 3% @HC 5KK70 Platform-based Design

  34. customized connections A’’ A’ Data reuse decision and memory hierarchy: principle Processor Data Paths Processor Data Paths Register File Register File M B A Customized connections in the memory subsystem to bypass the memory hierarchy and avoid the overhead. @HC 5KK70 Platform-based Design

  35. copy2 copy1 copy3 copy4 Time frame 2 Time frame 1 Time frame 4 Time frame 3 Step 1: identify arrays with data reuse potential for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; intra-copy reuse array index inter-copy reuse time @HC 5KK70 Platform-based Design

  36. Importance of high level cost estimate for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; array index Array copies are stored in-place! Mk 6 time copy2 copy1 copy3 copy4 Time frame 1 Time frame 2 Time frame 3 Time frame 4 @HC 5KK70 Platform-based Design

  37. j iterator =not present so intra-copy reuse 3 intra-copy reuse factor= 3 copy2 copy1 copy3 copy4 Time frame 1 Time frame 2 Time frame 4 Time frame 3 Step 1: determine gains Intra-copy reuse factor for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; array index Mk 6 time @HC 5KK70 Platform-based Design

  38. for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; inter-copy reuse factor = 1/(1-1/3)=3/2 i iterator has smaller weight than k range so inter-copy reuse copy2 copy1 copy3 copy4 Time frame 1 Time frame 2 Time frame 4 Time frame 3 Step 1: determine gains Inter-copy reuse factor for (i=0; i<n; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; array index Mk 6 time @HC 5KK70 Platform-based Design

  39. Mk 15 Mm Mm tf 1.1 tf 2.3 tf 1.3 tf 1.2 tf 2.2 tf 1.4 tf 2.1 tf 1.5 tf 1.6 tf 1 tf 2 tf 4 tf 5 tf 9 tf 7 tf 3 tf 8 tf 6 5 5 time frame 1 time frame 2 Possibility for multi-level hierarchy for (i=0; i<10; i++) for (j=0; j<2; j++) for (k=0; k<3; k++) for (l=0; l<3; l++) for (m=0; m<5; m++) … = A[i*15+k*5+m]; array index time @HC 5KK70 Platform-based Design

  40. A A A A Many reuse possibilities A’ Prune for promising ones A’ A’ Cost estimate needed A’’ R1(A) R1(A) R1(A) R1(A) Step 2: determine data reuse chains for each memory access @HC 5KK70 Platform-based Design

  41. 100 80 60 #misses 40 estimate size 20 Gk 0 15 0 5 10 15 20 Gm #elements R1(A) 5 A’ A’ Cost function needs both size and number of accesses to intermediate array estimate #misses from different levels for one iteration of i for (i=0; i<10; i++) for (j=0; j<2; j++) for (k=0; k<3; k++) for (l=0; l<3; l++) for (m=0; m<5; m++) … = A[i*15+k*5+m]; 2*3*5 =30 3*5 =15 2*3*3*5 =90 @HC 5KK70 Platform-based Design

  42. 51 38 155 165 135 150 35 170 A 30 15 90 15 A A A 45 22 135 22 150 150 150 150 30 15 15 120 A’ 6 45 105 A’ A’ 5 7 16 15 15 30 120 A’’ accesses size energy 6 x 5 y 90 90 90 90 z R1(A) R1(A) R1(A) R1(A) Very simplistic power and area estimation for different data-reuse versions @HC 5KK70 Platform-based Design

  43. A A’ for (x=0; x<8; x++) for (y=0; y<5; y++) … = A[i*5+y]; R2(A) Step 3: determine data reuse trees for multiple accesses for (i=0; i<10; i++) for (j=0; j<2; j++) for (k=0; k<3; k++) for (l=0; l<3; l++) for (m=0; m<5; m++) … = A[i*15+k*5+m]; A A’ A’’ R1(A) @HC 5KK70 Platform-based Design

  44. Reuse tree A A A’ A’ R2(A) A’’ R1(A) Step 3: determine data reuse trees for multiple accesses A A’ A’ R2(A) A’’ R1(A) @HC 5KK70 Platform-based Design

  45. Hierarchy layers Layer1 Layer2 Layer3 Foreground mem. Datapath Step 4: Determine number of layers Data reuse trees B Data reuse trees A @HC 5KK70 Platform-based Design

  46. all 1 3 2 4 5 A A A FG FG Step 5: Select and assign reuse candidates hierarchy assignments Hierarchy layers Data reuse trees @HC 5KK70 Platform-based Design

  47. Data reuse trees B Step 5: All freedom in array to memory hierarchy Hierarchy layers Data reuse trees A @HC 5KK70 Platform-based Design

  48. Hierarchy layers Pruned Step 5: Prune reuse graph (platform independent) Hierarchy layers Full freedom Quite some solutions never make sense @HC 5KK70 Platform-based Design

  49. Hierarchy layers Pruned Final solution 4 layer platform Final solution 4 layer platform A A' B B' FG FG Step 5: Prune reuse graph further (platform dependent) @HC 5KK70 Platform-based Design

  50. A A B Layer 1 B A’ B’ A’ Layer 2 B’ A’ R2(A) B’’ A’’ B’’’ Layer 3 A’’ A’ B’’’ R1(A) R1(B) R1(A) R2(A) R1(B) Assign all data reuse trees (multiple arrays) to memory hierarchy @HC 5KK70 Platform-based Design

More Related