1 / 19

Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators

Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators. Manjunath Kudlur, Kevin Fan, Michael Chu, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan. Motivation.

lavi
Download Presentation

Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators Manjunath Kudlur, Kevin Fan, Michael Chu, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan

  2. Motivation • Custom application accelerators (ASICs/ASIPs) require careful data memory system design • Large volumes of data access at high bandwidth • Distributed local memories (scratchpads) • Achieves high bandwidth through parallel access • Low latency by placing data near computation • Custom memory design is complex • Multiple considerations– bandwidth, size requirements, data distribution • Decentralized datapath – another monkey wrench

  3. Background – Our System • Synthesis of non-programmable accelerators • System similar to PICO (Program-In Chip-Out) • Input is “Hot” loop nest expressed in C • Throughput-directed synthesis • Required throughput expressed as II (initiation interval) • Innermost loop modulo scheduled • Datapath derived directly from the schedule • FU allocation to meet II

  4. Background – Multicluster Datapath • FUs divided into clusters • Intercluster communication through global bus • Reduced wire lengths, reduced porting on register file structures • Increased compiler complexity Interconnection Network C Program Cluster 1 Cluster 2 Register FIFOs Register FIFOs FU FU MEM MEM FU FU MEM MEM Local Memories Local Memories

  5. Background – Local Memories • SRAMs connected to MEM units in clusters • Data structures assigned to a single SRAM • Can be whole arrays, part of an array • Currently whole arrays considered • Multiple arrays can be combined in a single SRAM Cluster 1 Register FIFOs FU FU MEM MEM Local Memories

  6. Problem Statement and Approach • “Given a set of arrays, their sizes and bitwidths, the corresponding loop nest, the number of clusters and the target II, find an allocation of arrays to SRAMs and allocation of SRAMs to clusters such that overall cost is minimized” • Phase-ordered approach which handles 2 sub problems separately • Memory synthesis • Operation partitioning

  7. A1 + A2 II Combining Arrays • Combining arrays into a single SRAM reduces hardware cost (row decoders, sense amps) • Issues with combining: • Consider two arrays with (Bitwidth, Size) = (B1, S1) and (B2, S2) • Suppose A1 and A2 are number of static accesses in the loop • Number of ports = MAX(B1, B2) B1 B2 X X Y S2 S1 + S2 S1 Y

  8. Combining Arrays • Multicluster issues • Can cause imbalance in operation distribution • All load store operations for the combined arrays should be assigned to same cluster • Can increase inter cluster traffic • Address calculations and load-uses would cause extra inter cluster moves R1 R2 + IC Move LD USE

  9. D A B C Cluster 1 Cluster 2 Solution 1 • Formulate the problem as an integer program • A binary decision variable X(i,j,k,l) to denote assignment of array ‘i’ to local memory ‘j’ with ‘k’ ports on cluster ‘l’ • Constraints to make sure inter cluster move bandwidth is not violated • Perform operation partitioning and Modulo schedule after memory synthesis A B C D Input Arrays Target II Memory Synthesis Operation Partitioning Modulo Schedule

  10. Experiments • System implemented in the Trimaran framework • Memory costs obtained from ARTISAN SRAM generator scripts • lp_solve used to solve the integer programs • A set of DSP kernels evaluated • Loop oriented • Many arrays accessed in the loops

  11. Results for Solution 1 huffman channel Target Initiation Interval (II) Target Initiation Interval (II) LU lyapunov Target Initiation Interval (II) Target Initiation Interval (II)

  12. Achieved II in Solution 1 • Solution 1 eagerly combines arrays • Potential increase in inter cluster moves due to imbalance in distribution of LD/ST ops • Achieved II poor due to IC moves in recurrence cycles Best II achieved

  13. Solution 2 • Phase-ordered approach • Two highly intertwined decisions: allocation of local memories and partitioning of operations • Three phases: • Pre-Partitioning • Memory Synthesis • Operation Partitioning

  14. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Pre-Partitioning • Performance-oriented operation partitioning • Memory operations accessing the same arrays are bound to same cluster • Consequently, arrays are bound to clusters A C E B D Cluster 2 Cluster 1 Pre-Partitioning

  15. Memory Synthesis • ILP used to optimally combine arrays within clusters • Pre-partitioning effectively disables combining of arrays that cause operation imbalance D A B A C E B D C E Cluster 1 Cluster 2 Cluster 2 Cluster 1 Memory Synthesis

  16. Results for Solution 2 channel huffman Target Initiation Interval (II) Target Initiation Interval (II) LU lyapunov Target Initiation Interval (II) Target Initiation Interval (II)

  17. Achieved II for Solution 2 • Cost of synthesized memory not substantially different • But achieved II is 36% better with pre-partitioning Best II achieved

  18. Conclusion • An approach for synthesizing custom local memories • ILP based optimal solution • Works for clustered datapath • Pre-partitioning to improve achieved throughput, with minimal impact on cost • For more information • http://cccp.eecs.umich.edu

  19. Example

More Related