1 / 57

Embedded Computer Architecture

Embedded Computer Architecture. Data Memory Management Overview. 5KK73 TU/e Henk Corporaal Bart Mesman. Data Memory Management Overview. Motivation Example application DMM steps Results Notes: We concentrate on Static Data structures like arrays

dewei
Download Presentation

Embedded Computer Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Embedded Computer Architecture Data Memory Management Overview 5KK73 TU/e Henk Corporaal Bart Mesman

  2. Data Memory Management Overview • Motivation • Example application • DMM steps • Results Notes: • We concentrate on Static Data structures like arrays • The Data Transfer and Storage Exploration (DTSE)methodology, on which these slides are based, has been developed at IMEC, Leuven 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  3. You want the big picture? • Take a well written C-program, which is dominated by: • "well structured" loopnests, and by • accesses to "static" big (multi-dimensional) arrays • Transform your code in order to • reduced number of external memory accesses by a factor 10 • drastic reduction of energy consumption • significant improvement of speed 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  4. SDRAM SDRAM Serial I/O video-in B[j] = A[i*4+k]; B[j] = A[i*4+k]; B[j] = A[i*4+k]; PCI bridge video-out timers I2C I/O Data storage bottleneck audio-out I$ VLIW cpu audio-in Data transfer bottleneck D$ D$ The underlying idea for (i=0;i<n;i++) for (j=0; j<3; j++) for (k=1; k<7; k++) B[j] = A[i*4+k]; 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  5. TriMedia (5-issue VLIW processor) 256M 1-port SDRAM Hardware accelerators 5 out of 27 processor FUs 128*32b 16-port RegFile 16K2-port SRAM 256M 1-port SDRAM TriMedia cache use CPU SW cache 8KB Cache bypass HW cache 8/16KB SW controlled HW controlled Platform example: TriMedia 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  6. Generic platform model Level-2 Level-3 Level-4 Level-1 SCSI bus bus bus Chip on-chip busses bus-if bridge SCSI Disk L2 Cache ICache CPUs DCache Main Memory Disk HW accel Local Memory Local Memory Disk Local Memory 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  7. Power(memory) = 33 Power(arithmetic) Data transfer and storage power 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  8. What about delay of memories? Global wiring delay becomes dominant over gate delay 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  9. Data transfer and data storage specific rewrites in the application code Positioning in the Y-chart Architecture Instance Applications Applications Applications Mapping Performance Analysis Performance Numbers 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  10. If you are in a hurry, do this: • Given • architecture e.g. TriMedia TM1000 • reference C code for applicatione.g. MPEG-4 Motion Estimation • Task • map application on architecture • But … wait a moment me@work> tmcc -o mpeg4_me mpeg4_me.cThank you for running TriMedia compiler.Your program uses 257321886 bytes,78 Watt, and 428798765291 clock cycles 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  11. Let’s help the compiler ...DTSE: data transfer and storage exploration • Transforms C-code of the application • By focusing on multi-dimensional arrays • To better exploit platform capabilities • This overview covers the major steps to improve power, area, performance trade-off in the context of platform based design 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  12. Application domain: Computer Tomography in medical imaging Algorithm: Cavity detection in CT-scans Detect dark regions in successive images Indicate cavity in brain Application example Bad news for owner of brain 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  13. Application Max Value Reference (conceptual) C code for the algorithm • all functions: image_in[N x M] -> image_out[N x M] • new value of pixel depends on its neighbors • neighbor pixels read from background memory • approximately 110 lines of C code (ignoring file I/O etc) • experiments with e.g. N x M = 640 x 400 pixels • straightforward implementation: 6 image buffers Gauss Blur x Gauss Blur y Compute Edges Reverse Detect Roots 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  14. Concurrent OO spec Remove OO overhead Dynamic memory mgmt Task concurrency mgmt Static memory mgmt Address optimization SW/HW co-design SW design flow HW design flow Where does this fit in the whole design flow? Design flow 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  15. Avoid N-port Memories within real-time constraints local latch 1 & bank 1 Processor Data Paths L1 mem/ cache L2 mem/ cache Cache Bank Combine local latch N & bank N Introduce Locality Reduce redundant transfers Exploit memory hierarchy DMM (data mem. mgt.) principles Off-chip SDRAM Exploit limited life-time of array elements 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  16. C-in DMM steps Preprocessing Dataflow transformations Loop transformations Data reuse Memory hierarchy layer assignment Cycle budget distribution Memory allocation and assignment Data layout Address optimization C-out 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  17. The DMM steps • Preprocessing • Rewrite code in 3 layers (parts) • Selective inlining, Single Assignment form, .... • Data flow transformations • Eliminate redundant transfers and storage • Loop and control flow transformations • Improve regularity of accesses and data locality • Data re-use and memory hierarchy layer assignment • Determine when to move which data between memories to meet the cycle budget of the application with low cost • Determine in which layer to put the arrays (and copies) 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  18. The DMM steps Per memory layer: • Cycle budget distribution • determine memory access constraints for given cycle budget • Memory allocation and assignment • which memories to use, and where to put the arrays • Data layout • determine how to combine and put arrays into memories • Address optimization on the final C-code 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  19. for (i=0;i<N; i++) for (j=0; j<M; j++) if (i == 0) B[i][j] = 1; else B[i][j] = func1(A[i][j], A[i-1][j]); Preprocessing: Dividing an application in the 3 layers Module1a LAYER1 Module2 Module3 Module1b - testbench call - dynamic event behaviour Synchronisation - mode selection LAYER2 int func1(int a, int b) LAYER3 { return a*b; } 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  20. Layered code structure: level 1+2 main(){ /* Layer 1 code */ read_image(IN_NAME, image_in); cav_detect(); write_image(image_out); } void cav_detect() { /* Layer 2 code */ for (x=GB; x<=N-1-GB; ++x) { // filtersize 2GB+1 for (y=GB; y<=M-1-GB; ++y) { gauss_x_tmp = 0; for (k=-GB; k<=GB; ++k) { gauss_x_tmp += in_image[x+k][y] * Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); } } } 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  21. N M N-2 M-2 Data-flow trafo - cavity detection for (x=0; x<N; ++x) for (y=0; y<M; ++y) gauss_x_image[x][y]=0; for (x=1; x<=N-2; ++x) { for (y=1; y<=M-2; ++y) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) { gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); } } #accesses: N * M + (N-2) * (M-2) 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  22. N M M-2 N-2 Data-flow trafo - cavity detection for (x=0; x<N; ++x) for (y=0; y<M; ++y) if ((x>=1 && x<=N-2) && (y>=1 && y<=M-2)) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) { gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; } gauss_x_image[x][y] = foo(gauss_x_tmp); } else { gauss_x_image[x][y] = 0; } } } #accesses: N * M gain is almost 50 % 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  23. Data-flow transformation • In total 5 types of data-flow transformations: • advanced signal substitution and (copy) propagation • algebraic transformations (associativity, distributive law, etc.) • shifting “delay lines” • re-computation (instead of keeping values alive too long) • transformations to eliminate bottlenecksfor subsequent loop transformations 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  24. Data-flow transformation - result 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  25. for (j=1; j<=M; j++) for (i=1; i<=N; i++) A[i]= foo(A[i]); for (i=1; i<=N; i++) out[i] = A[i]; for (i=1; i<=N; i++) { for (j=1; j<=M; j++) { A[i] = foo(A[i]); } out[i] = A[i]; } storage size 1 storage size N Loop transformations • Loop transformations • improve regularity of accesses • improve temporal locality: production  consumption • Expected influence • reduce temporary storage and (anticipated) background storage 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  26. Global loop transformation steps applied to cavity detection • Make all loop dimensions equal • Regularize loop traversal:possibly Y and X loop interchange • follow order of input stream • Y loop folding and global mergingX loop folding and global merging • full, global scope regularity • nearly complete locality for main signals (arrays) 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  27. scandevice Data enters Cavity Detectorrow-wise serial scan Buffer =image_in GaussBlur loop Cavity Detector 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  28. N x M Gauss Blur x X-Y Loop Interchange Loop trafo - cavity detection N x M Scanner X Y From double bufferto single buffer 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  29. Loop interchange (Y  X) • Not always possible; check dependences • For all loops, to maintain regularity for (x=0;x<N;x++) for (y=0;y<M;y++) /* filtering code */ for (y=0;y<M;y++) for (x=0;x<N;x++) /* filtering code */ 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  30. Loop trafo - cavity detection N x (2GB+1) N x 3 Gauss Blur x Gauss Blur y Compute Edges Repeated fold and loop merge 3(offset arrays) 2GB+1 From N x M toN x (3) buffer size From N x M toN x (2GB+1) buffer size 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  31. Improve regularity and locality Loop Merging !! Often impossible due to dependencies! for (y=0;y<M;y++) for (x=0;x<N;x++) /* 1st filtering code */ for (y=0;y<M;y++) for (x=0;x<N;x++) /* 2nd filtering code */ for (y=0;y<M;y++) for (x=0;x<N;x++) /* 1st filtering code */ for (x=0;x<N;x++) /* 2nd filtering code */ 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  32. Data dependencies between1st and 2nd loop for (y=0;y<M;y++) for (x=0;x<N;x++) … gauss_x_image[x][y] = … for (y=0;y<M;y++) for (x=0;x<N;x++) … for (k=-GB; k<=GB; k++) … = … gauss_x_image[x][y+k] … 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  33. Enable merging withLoop Folding (bumping) for (y=0;y<M;y++) for (x=0;x<N;x++) … gauss_x_image[x][y] = … for (y=0+GB;y<M+GB;y++) for (x=0;x<N;x++) … y-GB … for (k=-GB; k<=GB; k++) … gauss_x_image[x][y+k-GB] … 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  34. Y-loop merging resultof 1st and 2nd loop nest for (y=0;y<M+GB;y++) if (y<M) for (x=0;x<N;x++) … gauss_x_image[x][y] = … if (y>=GB) for (x=0;x<N;x++) if (x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) … for (k=-GB; k<=GB; k++) … gauss_x_image[x][y-GB+k] … else 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  35. Simplify conditions in merged loop for (y=0;y<M+GB;y++) for (x=0;x<N;x++) if (y<M) … gauss_x_image[x][y] = … for (x=0;x<N;x++) if (y>=GB && x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) … for (k=-GB; k<=GB; k++) … gauss_x_image[x][y-GB+k] … else if (y>=GB) 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  36. Global loop merging/folding steps 1 x  y Loop interchange (done) 2 Global y-loop folding/merging: 1st and 2nd nest (done) 3 Global y-loop folding/merging: 1st/2nd and 3rd nest 4 Global y-loop folding/merging: 1st/2nd/3rd and 4th nest 5 Global x-loop folding/merging: 1st and 2nd nest 6 Global x-loop folding/merging: 1st/2nd and 3rd nest 7 Global x-loop folding/merging: 1st/2nd/3rd and 4th nest 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  37. End result of global loop trafo for (y=0; y<M+GB+2; ++y) { for (x=0; x<N+2; ++x) { … if (x>=GB && x<=N-1-GB && (y-GB)>=GB && (y-GB)<=M-1-GB) { gauss_xy_compute[x][y-GB][0] = 0; for (k=-GB; k<=GB; ++k) gauss_xy_compute[x][y-GB][GB+k+1] = gauss_xy_compute[x][y-GB][GB+k] + gauss_x_image[x][y-GB+k] * Gauss[abs(k)]; gauss_xy_image[x][y-GB] = gauss_xy_compute[x][y-GB][(2*GB)+1]/tot; } else if (x<N && (y-GB)>=0 && (y-GB)<M) gauss_xy_image[x][y-GB] = 0; … 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  38. Loop transformations - result 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  39. M’’ M’ M Main memory P = 0.01 P = 0.1 P = 1 Data re-use & memory hierarchy • Introduce memory hierarchy • reduce number of reads from main memory • heavily accessed arrays stored in smaller memories Processor Data Paths Reg File #A = 100 100 10 1 P (original) = # access x power/access = 100 P (after) = 100 x 0.01 + 10 x 0.1 + 1 x 1 = 3 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  40. int[2][6] A;for (h=0; h<N; h++) for (i=0; i<2; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) B[j] = A[i][k]; array index (6 * i + k) iterations Data re-use • Data flow transformations to introduce extracopies of heavily accessed signals • Step 1: figure out data re-use possibilities • Step 2: calculate possible gain • Step 3: decide on data assignment to memory hierarchy 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  41. array index 6*2 6*1 N*2*3*6 iterations frame1 frame2 frame3 CPU Data re-use • Data flow transformations to introduce extracopies of heavily accessed signals • Step 1: figure out data re-use possibilities • Step 2: calculate possible gain • Step 3: decide on data assignment to memory hierarchy 1*2*1*6 N*2*1*6 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  42. Data re-use tree image_in gauss_xy/comp_edge gauss_x image_out N*M M*3 M*3 M*3 N*M N*M N*M*3 N*M*3 N*M 0 1*1 N*1 3*3 1*3 N*M N*M*8 N*M*8 N*M*3 3*1 N*M*3 CPU CPU CPU CPU CPU 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  43. L2 L1 L0 Memory hierarchy assignment image_in image_out gauss_xy comp_edge gauss_x N*M N*M 1MB SDRAM 0 N*M M*3 M*3 M*3 16KB Cache N*M*3 N*M N*M N*M*3 N*M*3 128 B RegFile 1*1 1*1 3*1 3*3 3*3 N*M*3 N*M*8 N*M*8 N*M*8 N*M*8 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  44. Data-reuse - cavity detection code Code before reuse transformation for (y=0; y<M+3; ++y) { for (x=0; x<N+2; ++x) { if (x>=1 && x<=N-2 && y>=1 && y<=M-2) { gauss_x_tmp = 0; for (k=-1; k<=1; ++k) gauss_x_tmp += image_in[x+k][y]*Gauss[abs(k)]; gauss_x_image[x][y]= foo(gauss_x_compute); } else { if (x<N && y<M) gauss_x_lines[x][y] = 0; } /* Other merged code omitted … */ } } 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  45. Data-reuse - cavity detection code Code after reuse transformation: for (y=0; y<M+3; ++y) { for (x=0; x<N+2; ++x) { /* first in_pixels initialized */ if (x==0 && y>=1 && y<=M-2) for (k=0; k<1; ++k) in_pixels[(x+k)%3] = image_in[x+k][y]; /* copy rest of in_pixels in row */ if (x>=0 && x<=N-2 && y>=1 && y<=M-2) in_pixels[(x+1)%3] = image_in[x+1][y]; if (x>=1 && x<=N-1-1 && y>=1 && y<=M-2) { gauss_x_tmp=0; for (k=-1; k<=1; ++k) gauss_x_tmp += in_pixels[(x+k)%3]*Gauss[Abs(k)]; gauss_x_lines[x][y%3]= foo(gauss_x_tmp); } else if (x<N && y<M) gauss_x_lines[x][y%3] = 0; 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  46. Data reuse & memory hierarchy 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  47. Data layout optimization • At this point multi-dimensional arraysare to be assigned to physical memories • Data layout optimization determines exactly where in each memory an array should be placed, to • reduce memory size by “in-placing” arrays that do not overlap in time (disjoint lifetimes) • to avoid cache misses due to conflicts • exploit spatial locality of the data in memory to improve performance of e.g. page-mode memory access sequences 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  48. A A A A D D B B D D C C C C B B E E E E In-place mapping Inter in-place Both intra+inter addresses Intra in-place time 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  49. 0x0 b8 mem1[10400]; for (i,j,k,l; …) mem1[10000+i+20*j] = f(mem1[10000+j+20*i], b6(mem1[i+k+100*(j+l)]); A 0x2710 B 0x28a0 In-place mapping • Implements all the “anticipated” memory size savings obtained in previous steps • Modifies code to introduce one array per “real” memory • Changes indices to addresses in mem. arrays b8 A[100][100]; b6 B[20][20]; for (i,j,k,l; …) B[i][j] = f(B[j][i], A[i+k][j+l]); 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

  50. address Image time In-place mapping • Input image is partly consumed, and not reused again, by the time first results for output image are ready index Image_in time index Image_out time 5KK73 Embedded Computer Architecture H. Corporaal and B. Mesman

More Related