1 / 50

Embedded Computer Architecture

Embedded Computer Architecture. Data Management Part c: SCBD, MAA, and Data Layout. 5KK73 TU/e Henk Corporaal. Part 3 overview. Recap on design flow Platform dependent steps SCBD: Storage Cycle Budget Distribution MAA: Memory Allocation and Assignment Data layout techniques for RAM

hiero
Download Presentation

Embedded Computer Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Embedded Computer Architecture Data Management Part c: SCBD, MAA, and Data Layout 5KK73 TU/e Henk Corporaal

  2. Part 3 overview • Recap on design flow • Platform dependent steps • SCBD: Storage Cycle Budget Distribution • MAA: Memory Allocation and Assignment • Data layout techniques for RAM • Data layout techniques for Caches • Results • Conclusions Thanks to the IMEC DTSE people Embedded Computer Architecture 5KK73 @H.C.

  3. Concurrent OO spec Remove OO overhead Dynamic memory mgmt Task concurrency mgmt Physical memory mgmt Address optimization SW/HW co-design SW design flow HW design flow DM Design flow Embedded Computer Architecture 5KK73 @H.C.

  4. C-in DM steps Preprocessing Dataflow transformations Loop transformations Data reuse Memory hierarchy layer assignment Cycle budget distribution Today Memory allocation and assignment Data layout Address optimization C-out Embedded Computer Architecture 5KK73 @H.C.

  5. L2 L1 L0 Result of Memory hierarchy assignment for cavity detection image_in image_out gauss_xy comp_edge gauss_x N*M N*M 1MB SDRAM 0 N*M M*3 M*3 M*3 16KB Cache N*M*3 N*M N*M N*M*3 N*M*3 128 B RegFile 1*1 1*1 3*1 3*3 3*3 N*M*3 N*M*8 N*M*8 N*M*8 N*M*8 Embedded Computer Architecture 5KK73 @H.C.

  6. Data-reuse - cavity detection code Code after reuse transformation (partly) for (y=0; y<M+3; ++y) { for (x=0; x<N+2; ++x) { /* first in_pixels initialized */ if (x==0 && y>=1 && y<=M-2) in_pixels[x%3] = image_in[x][y]; /* copy rest of in_pixel's in row */ if (x>=0 && x<=N-2 && y>=1 && y<=M-2) in_pixels[(x+1)%3]= image_in[x+1][y]; if (x>=1 && x<=N-2 && y>=1 && y<=M-2) { gauss_x_tmp=0; for (k=-1; k<=1; ++k) // 3x1 filter gauss_x_tmp += in_pixels[(x+k)%3]*Gauss[Abs(k)]; gauss_x_lines[x][y%3]= foo(gauss_x_tmp); } else if (x<N && y<M) gauss_x_lines[x][y%3] = 0; Embedded Computer Architecture 5KK73 @H.C.

  7. Storage Cycle Budget Distribution &Memory Allocation and Assignment

  8. Define the memory organization which can provide enough bandwidth with minimal cost Embedded Computer Architecture 5KK73 @H.C.

  9. Memory Bandwidth Required High time Memory Bandwidth Required Low time Balancing memory bandwidth Reduce max. number of loads/store per cycle: Embedded Computer Architecture 5KK73 @H.C.

  10. Data management approach One of the many possible schedules • Idea: find a schedule which • fits in the number of cycles (= budget) • reduces the number of ports • avoids multi-ported memories Embedded Computer Architecture 5KK73 @H.C.

  11. Data management approach; details Embedded Computer Architecture 5KK73 @H.C.

  12. Conflict cost calculation Key issues: • Number of conflicts • Self conflicts • Chromatic number = size of maximum clique Embedded Computer Architecture 5KK73 @H.C.

  13. Self conflict  dual port memory Reschedule Embedded Computer Architecture 5KK73 @H.C.

  14. Chromatic number minimum # single port memories Reschedule Embedded Computer Architecture 5KK73 @H.C.

  15. Final A valid Conflict Graph Schedule Memory Configuration A B A B B C A B C D C D D A C D A B C One solution A B A B A B A B C D C D D C A B C D C D A B Multiple solutions C Lower number of conflicts larger assignment freedom Reschedule Embedded Computer Architecture 5KK73 @H.C.

  16. R(A) R(B) R(C) R(D) R(A) W(B) W(D) Conflict Directed Ordering is used to find a good schedule time slots • Reduce intervals until all conflicts known • Driven by cost of conflicts • Constructive algorithm 1 2 3 4 5 6 R(A) R(A) W(A) W(C) R(B) W(B) W(A) R(C) W(B) R(C) W(B) R(C) ? W(C) R(D) W(D) Embedded Computer Architecture 5KK73 @H.C.

  17. Local optimization is not good for global optimization Embedded Computer Architecture 5KK73 @H.C.

  18. Budget distribution has large impact on memory cost Embedded Computer Architecture 5KK73 @H.C.

  19. Decreasing basic block length until target cycle budget is met Embedded Computer Architecture 5KK73 @H.C.

  20. What's the effect of merging loops? • More scheduling freedom !! Reschedule Embedded Computer Architecture 5KK73 @H.C.

  21. Memory allocation and assignment Embedded Computer Architecture 5KK73 @H.C.

  22. Memory Allocation 1 2 3 A Array-to-memory Assignment D B C A Port Assignment Bus Sharing D B C Memory Allocation and Assignment Substeps Allocation = Select number and type of memories Embedded Computer Architecture 5KK73 @H.C.

  23. Influence of MAA MEMORY-1 MEMORY-N Bitwidth 0101110010 Bitwidth K 1001XXXXXX 1001001110101001 A (maximum) (maximum) L Size 100100111010XXXX Size B Nr. ports (R/W/RW) Nr. ports (R/W/RW) • Bit width • Address range • Nr. memories • Nr. ports • Assign arrays to memory • Memory interconnect • Minimize power & Area Embedded Computer Architecture 5KK73 @H.C.

  24. R(A) R(B) R(B) W(A) W(C) R(A) R(A) W(B) W(A) W(B) W(A) W(C) m1 m2 m3 m1 m2 m3 m1 m2 m3 A C B A C B A C B X X X Example of bus sharing possibilities Given Schedule Embedded Computer Architecture 5KK73 @H.C.

  25. Decreasing cycle budget limits freedom and raises cost Embedded Computer Architecture 5KK73 @H.C.

  26. Minimum Budget Self conflict, Sequential forcing dual port mem. Budget Conflict graph changed, but no impact on assignment Conflict graph changed, change in assignment Example: Resulting Pareto curve for DAB synchro application Energy cost Embedded Computer Architecture 5KK73 @H.C.

  27. Example conflict graph for cavity detection Embedded Computer Architecture 5KK73 @H.C.

  28. MAA result Power: On-chip area: Embedded Computer Architecture 5KK73 @H.C.

  29. Data layouthow to put data into memory Embedded Computer Architecture 5KK73 @H.C.

  30. ? B' C B A' B C A B C A ? CACHE A PE C B MEM1 MEM1 G PE H F MEM2 Memory data layout forcustom and cache architectures ? ? B' A' PE CACHE ? G PE ? F H MEM2 Embedded Computer Architecture 5KK73 @H.C.

  31. aij memory addresses max nr. of life elements This number depends on the layout !! Compare e.g. row major and column major ordering. time Intra-array in-place mappingreduces size of one array j for (i=1; i<5; i++) for (j=0; j<5; j++) a[i][j] = f(a[i-1][j]); i-1 Window i Embedded Computer Architecture 5KK73 @H.C.

  32. abstract addresses real addresses aA a A Storage order Allocation aB B aC Two-phase mapping of array elements onto addresses array domains C Embedded Computer Architecture 5KK73 @H.C.

  33. a1 a2 a=3a1+a2 a=3a1+(2-a2) a=3(1-a1)+a2 a=3(1-a1)+(2-a2) a=2a2+a1 a=2(2-a2)+a1 a=2a2+(1-a1) a=2(2-a2)+(1-a1) Exploration of storage ordersfor 2-dimensional array: 8 options memory address variable domain a=??? a ? ? ? ? ? ? Embedded Computer Architecture 5KK73 @H.C.

  34. i for (i=1; i<5; i++) for (j=0; j<5; j++) a[i][j] = f(a[i-1][j]); j row-major ordering: a=5i+j column-major: a=5j+i for (i=1; i<5; i++) for (i=1; i<5; i++) for (j=0; j<5; j++) for (j=0; j<5; j++) a[5*j+i] = f(a[5*j+i-1]); a[5*i+j] = f(a[5*i+j-5]); 5*4+i-1 5*i+j Highest live address: 5*0+i-1 Lowest live address: 5*i+j-5 21 Difference + 1= Window: 6 Chosen storage order determines window size Embedded Computer Architecture 5KK73 @H.C.

  35. aA aB A A D B B Memory Size aC aD C C D E aE E Static allocation:no in-place mapping time Embedded Computer Architecture 5KK73 @H.C.

  36. Dynamic, windowed A A D D C C Memory Size B B E E Windowed Allocation:intra-array in-place mapping Static, windowed WA Memory Size Embedded Computer Architecture 5KK73 @H.C.

  37. aA aB A B B Memory Size aC aD C C A D D E aE E Dynamic allocation:inter-array in-place mapping Embedded Computer Architecture 5KK73 @H.C.

  38. B A A D D C C B Memory Size E E Dynamic allocation strategy with common window Dynamic, common window Embedded Computer Architecture 5KK73 @H.C.

  39. Expressing memory data layoutin source code Example: array of 10x20 elements A: offset 120, no windowB: storage order [20, 2], offset 134, window 78 Before: bit8 B[10][20];bit6 A[30];for(x=0;x<10;++x) for (y=0;y<20;++y) … = A[3*x-y]; B[x][y] = …; After: bit8 memory[334];bit8* B =(bit8*)&memory[134];bit6* A =(bit6*)&memory[120];for(x=0;x<10;++x) for (y=0;y<20;++y) … = A[3*x-y]; B[(x*20+y*2)%78] = …; Embedded Computer Architecture 5KK73 @H.C.

  40. Example of memory data layoutfor storage size reduction int x[W], y[W]; for (i1=0; i1 < W; i1++) x[i1] = getInput(); for (i2=0; i2 < W; i2++) { sum = 0; for (di2=-N; di2 <=N; di2++) { sum += c[N+di2] * x[wrap(i2+di2,W)]; } y[i2] = sum; } for (i3=0; i3 < W; i3++) putOutput(y[i3]); Embedded Computer Architecture 5KK73 @H.C.

  41. Occupied address-time domainof x[] and y[] Embedded Computer Architecture 5KK73 @H.C.

  42. Optimized source codeafter memory data layout int mem1[N+W]; for (i1=0; i1 < W; i1++) mem1[N+i1] = getInput(); for (i2=0; i2 < W; i2++) { sum = 0; for (di2=-N; di2 <=N; di2++) { sum += c[N+di2] * mem1[N+wrap(i2+di2,W)]; } mem1[i2] = sum; } for (i3=0; i3 < W; i3++) putOutput(mem1[i3]); Embedded Computer Architecture 5KK73 @H.C.

  43. Optimized OAT domainafter memory data layout Embedded Computer Architecture 5KK73 @H.C.

  44. index address Image_out time Image index time Image_in time In-place mapping for cavity detection example • Input image is partly consumed by the time first results for output image are ready Embedded Computer Architecture 5KK73 @H.C.

  45. In-place - cavity detection code for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image_out[x-5][y-3] = …; /* code removed */ … = image_in[x+1][y]; } } for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image[x-5][y-3] = …; /* code removed */ … = image [x+1][y]; } } Embedded Computer Architecture 5KK73 @H.C.

  46. Cavity detection summary Overall result: • Local accesses reduced by factor 3 • Memory size reduced by factor 5 • Power reduced by factor 5 • System bus load reduced by factor 12 • Performance worsened by factor 6 Embedded Computer Architecture 5KK73 @H.C.

  47. The last step: ADOPT (Address OPTimization) • Increased execution time introduced by DTSE • Complicated address arithmetic (modulo: a%b) • Additional complex control flow • Additional transformations needed to • Simplify control flow • Simplify address arithmetic: common sub-expression elimination, modulo expansion, … • Match remaining expressions on target machine Embedded Computer Architecture 5KK73 @H.C.

  48. ADOPT principles • How to avoid % in address expressions, likeint A[7];for (i=0; i<… ; i++) … A[i % 7] • Increase buffer size to power of 2i % 8 => i && 0x07 • Use if-statementint A[7];for (i=0,j=0; i<… ; i++,j++) … A[j] if (j==8) j=0 Embedded Computer Architecture 5KK73 @H.C.

  49. for (i=-8; i<=8; i++) { for (j=- 4; j<=3; j++) { for (k=- 4; k<=3; k++) { B[ ] = A[ ]; }} dist += A[ ]- B[ ]; } ADOPT principles: CSE Example: Full-search Motion Estimation - applying Common Subexpression Elimination (CSE) for (i=- 8; i<=8; i++) { for (j=- 4; j<=3; j++) { for (k=- 4; k<=3; k++) A[((208+i)*257+8+j)*257+ 16+i+k] = B[(8+j)*257+16+i+k]; } dist += A[3096] - B[((208+i)*257+4)*257+ 16+i-4]; } cse1 = (33025*i+6869616)*2; cse3 = 1040+i; cse4 = j*257+1032; cse5 = k+cse4; cse5+cse1 = cse5+cse3 3096 cse1 Algebraic transformations at word-level Embedded Computer Architecture 5KK73 @H.C.

  50. Conclusion on Data Management • In multi-media applications exploring data transfer and storage issues should be done at source code level • DMM method • Reducing number of external memory accesses • Reducing external memory size • Trade-offs between internal memory complexity and speed • Platform independent high-level transformations • Platform dependent transformations exploit platform characteristics (efficient use of memory, cache, …) • Substantial energy reduction Embedded Computer Architecture 5KK73 @H.C.

More Related