1 / 30

Panagiotis Athanasopoulos EPFL Philip Brisk UCR

Memory Organization and Data Layout for Instruction Set Extensions with Architecturally Visible Storage. Panagiotis Athanasopoulos EPFL Philip Brisk UCR Yusuf Leblebici EPFL Paolo Ienne EPFL École Polytechnique Fédérale de Lausanne (EPFL)

adia
Download Presentation

Panagiotis Athanasopoulos EPFL Philip Brisk UCR

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Memory Organization and Data Layout for Instruction Set Extensions with Architecturally Visible Storage Panagiotis Athanasopoulos EPFL Philip Brisk UCR Yusuf Leblebici EPFL Paolo Ienne EPFL École Polytechnique Fédérale de Lausanne (EPFL) University of California, Riverside (UCR) First_name.Second_name@{epfl.ch|ucr.edu}

  2. Motivation • Classic Challenge • Increase performance while maintaining area/cost constrained • Typical solutions • Customizable and extensible processors • Instruction set extension (ISE) • Custom functional units (CFU) • Architecturally visible storage (AVS)

  3. Typical embedded application extract 2D DCT 8x8 Matrix Pseudo: dct{ for(int i=0,i<num_of_rows,i++){ . . 1D DCT Slice . . } for(int j=0,j<num_of_columns,j++){ . . 1D DCT Slice . . } }

  4. Typical embedded application extract • 2D DCT 8x8 Matrix for(int i=0,i<num_of_rows,i++){ . . 1D DCT Slice . . } Row accesses 1D DCT Slice Data accessed in row i, column j

  5. Typical embedded application extract • 2D DCT 8x8 Matrix for(int j=0,j<num_of_columns,j++){ . . 1D DCT Slice . . } Column accesses 1D DCT Slice Data accessed in row i, column j

  6. Speeding up the execution • ISE • Extend the basic processor instruction set with a new instruction: DCT_instr • CFU • Assign the execution of the new instruction to a dedicated unit

  7. Reasonable ISE/CFU implementation Pseudo: dct{ DCT_instr(0,1,2,...,7) DCT_instr(8,9,10,...,15) . . DCT_instr(56,57,58,...,63) DCT_instr(0,8,16,...,56) DCT_instr(1,9,17,...,56) . . DCT_instr(7,15,23,...,63) } 16 executions

  8. Speeding up the execution • Memory bandwidth • Usually limited to 2 read/write ports • Caches, scratchpads, architecturally visible storage • Area quadruplicates to the number of ports [ref] • Increased latency to execute the new instruction until all data is available

  9. Speeding up the execution • Ideally • 8 read 8 write ports • Minimum area • Full bandwidth utilization • Could we achieve this???

  10. Speeding up the execution • Minimum Area • What is the minimum memory organization for 64 elements with 8 read and 8 write ports? • 8 individual single port 8 word capacity memory arrays (Flip Flop)

  11. Speeding up the execution • Full bandwidth utilization Row Major Order Good for row accesses Bad for column accesses 1D DCT Slice 1D DCT Slice

  12. Speeding up the execution • Full bandwidth utilization Column Major Order Good for column accesses Bad for row accesses 1D DCT Slice 1D DCT Slice

  13. Speeding up the execution • Full bandwidth utilization • Would there exist a data layout that would allow row and column access with the same latency ??? • Not with the existing organization • What if we attempted to relax the requirements by ignoring the misalignment of data ??? • Introduce alignment layers • Form of Register Clustering that is cheap! [RWTH ICCAD’07]

  14. 1D DCT Slice

  15. Memory Area Comparison Area mm2

  16. Methodology • Optimizing the memory system • Enumerate Memories • Memory Organization • Cost Estimation • Data Layout • Limitedly Improper Constrained Color Assignment • Alignment Layer

  17. LICCA Formulation • Input: • Graph G = (V,E,I) • Vertices V = {v0,...,vn-1} • Edges E = {e0,...,em-1} • Set of Set of vertices I = {I0,...,IL-1} • Where: • E = {(vx, vy)|∃Ij∈E∋vx∈Ij and vy∈Ij}

  18. LICCA Formulation • Solution: • Assignment of colors to vertices • Every function f: V→{0,..., k-1} • A maximum of nivertices can receive color i, 0<i<k-1; that is, |{v∈V| f(v) = i}| < ni • For each set Ij∈I, there can be at most ai vertices that receive color i. • Any instance of the k-colorability problem can be reduced to an instance of LICCA by setting I = {{vx, vy| (vx, vy)∈E}}, and, for 0<i<k-1: ni=|V| and ai=1

  19. LICCA Relation to the problem • Relation to the problem: • An edge e = (vx, vy) indicates that vxand vyare read in the same cycle • Each set of vertices Ij ∈I is a set of vertices that are read in parallel • k is the number of memories • ni is the capacity of the ithmemory • ai is the number read/write ports of the ith memory

  20. LICCA Example • V = {v0,v1,v2,v3,v4,v5} • I0 = {v0,v1,v2} • I1 = {v3,v4,v5} • I2 = {v0,v2,v5} • E = {(v0,v1),(v0,v2),(v0,v5),(v1,v2),(v2,v5),(v3,v4),(v3,v5),(v4,v5)} • Legal k-coloring? • Legal LICCA coloring? v0 v3 v1 v4 v2 v5 G

  21. LICCA Example M0 M1 v1 v0 v0 v1 v2 I0 v2 v4 v3 v3 v4 v5 I1 v5 n1=2 a1=1 v0 v2 v5 I2 n0=4 a0=2

  22. Comparison Example AVS (Single/Dual Port Memory or 8x8 Non-clustered RF) Memory Decoder Main Memory Baseline Processor Ports (DMA) ISE Logic RF Baseline Processor

  23. Comparison Example AVS (8x8 clustered RF) Memory Decoder Main Memory Baseline Processor Ports (DMA) Alignment Layer Decoders Alignment Layer RF Baseline Processor ISE Logic Alignment Layer

  24. Comparison Example • 2D DCT 8x8 Matrix • DCT row/column Slice VS 2-point • 8x8 Clustered RF VS Single port Memory • 150 MHz • 2D FFT 8x8 Matrix • 12 butterfly VS 1 butterfly • 8x8 Clustered RF VS Single port Memory • 150 MHz

  25. Comparison Example • 2D DCT 8x8 Matrix 3x 8x

  26. Comparison Example • 2D FFT 8x8 Matrix 2,5x 12x

  27. Conclusion • Methodology to efficiently increase bandwidth to AVS enhanced ISEs • LICCA • Memory System Optimization • Future Work • Commutativity • LICCA Extension for multiple ISEs and shift registers

  28. References

  29. Thank you! Questions?

More Related