1 / 25

MEMOCode 2007 Design Contest – MIT Submission

MEMOCode 2007 Design Contest – MIT Submission. N. Dave, K. Fleming , M. King, M. Pellauer, M. Vijayaraghavan. Resources. Five “insufficiently busy” grad students Three weeks Nine man weeks used Bluespec expertise Easy parameterization/Fast Concurrency The promise of food. Basic Facts.

noel
Download Presentation

MEMOCode 2007 Design Contest – MIT Submission

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan

  2. Resources • Five “insufficiently busy” grad students • Three weeks • Nine man weeks used • Bluespec expertise • Easy parameterization/Fast Concurrency • The promise of food

  3. Basic Facts • Matrix Multiply is embarrassingly parallel • More multipliers and adders should help • Matrices are too large to be stored in FGPA memory • Time was short, design needed to be partitioned to make use of all designers • Latency insensitive methodology

  4. Outline • The Problem  • Partitioning the Computation  • Architectural Overview • Implementation • Results • Things We Wish we could do

  5. The Standard N3 Algorithm for(int i=0; i < N; i++) for(int j=0; j < N; j++) for(int k=0; k < N; k++) c[i][j] += a[i][k] * b[k][j];

  6. split split Kernel swap and blocking is well understood… for(int ib = 0; ib < N; ib+=K) for(int io = 0; io < K; io++) for(int jb = 0; jb < N/K; jb+=K) for(int jo = 0; jo < K; jo++) for(int k = 0; k < K; k++) c[ib+io][jb+jo] +=a[ib+io][jb+k] * b[ib+k][jb+jo]; for(int ib = 0; ib < N; ib+=K) for(int jb = 0; jb < N/K; jb+=K) for(int io = 0; io < K; io++) for(int jo = 0; jo < K; jo++) for(int k = 0; k < K; k++) c[ib+io][jb+jo] += (a[ib+io][jb+k] * b[ib+k][jb+jo]); reduces memory traffic

  7. Outline • The Problem  • Partitioning the Computation  • Architectural Overview  • Implementation • Results • Things We Wish we could do

  8. Hardware Facts • If we accelerate the computation, DRAM access becomes the bottleneck • CPU has slow access to DRAM • HW can directly access DRAM via PLB (Processor Local Bus)

  9. Hardware Facts • CPU to HW memory bandwidth bound at 150MB/sec • Software overhead in data orchestration, probably only 50% of this bandwidth can be used • Memory Bus supports 800MB/sec • Direct interface can provide up to a 5x improvement over software transfer • Special hardware may not be complicated because memory access patterns are simple

  10. High Level Architecuture Func Unit Func Unit Func Unit Interconnection Logic DRAM CPU PLB

  11. Architecture Func Unit Func Unit Func Unit Controller Switch Feeder DRAM PLB Master CPU PLB

  12. Software Example (C = A x B) Func Unit Func Unit Func Unit C Controller In reality – the execution of several blocks will be overlapped Switch Feeder DRAM A B PLB Master CPU Ld A 0 Ld B 0 St C 0 MAC 0 PLB

  13. Outline • The Problem  • Partitioning the Computation  • Architectural Overview  • Implementation  • Results • Things We Wish we could do

  14. Functional Unit - Design • Instructions: • Load operand (memory) • Store operand (memory) • Zero (C = 0) • Multiply-Add-Accumulate (C += A*B) • Two FSMs (Read/Write and Compute) • Allows overlapping of Instructions

  15. Functional Unit – Algorithm • Take algo & unroll P loop iterations • Adder Tree of P • Crit. path grows logarithmically • Can pipeline • Complicated because of parameterization for(int i = 0; i < K; i++) for(int j = 0; j < K; j++) for(int k = 0; k < K; k++) c[i][j] += a[i][k] * b[k][j];

  16. Functional Unit – Algorithm • Different algorithm • reorder multiplies • writes c[i][j] multple times • Unroll by P • same # of adders and multipliers • shorter critical path • Pipelining is easy • 2 stages for(int i = 0; i < K; i++) for(int j = 0; j < K; j++) for(int k = 0; k < K; k++) c[j][k] += a[i][k] * b[j][i];

  17. FU Microarchitecture

  18. Memory Bus Master (PLB) • 32-bit bus interface • 16-word burst transfers • Amortize bus setup costs • DRAM may refresh during transfer • Added burst buffer for rapid recovery

  19. Memory Bus Master (PLB) • Half of critical path through bus arbiter • Beyond our control • Substantial retiming needed • Register pushing • State decoupling • Need fine-grained control over scheduling

  20. Outline • The Problem  • Partitioning the Computation  • Architectural Overview  • Implementation  • Results  • Things We Wish we could do

  21. Design Parameters • Architecture: Number of functional units • Functional Unit: degree of parallelism, matrix size • Memory Bus (PLB) Master: matrix memory layout, matrix size • Switch: Number of functional units • Algorithm Generator: Block size

  22. Final Results • 100MHz • 1 Functional Unit • 642 subblocks – 8 Complex Multiplies • Lines of code – 10K total • Unit Testing Framework – 1.5K • C Code – 2K • BSV – 5.5K • Multiple FU implementations 1K • Additional Unused Hardware 1K • More than 3 GOps/Sec

  23. Performance 125x

  24. Things we would have done with more time We believe we could have obtained 10 billion ops per second 32-PLB -> 64-bit PLB Double memory bandwidth fairly simple improvement Multiple Clock Domains implemented, but had trouble synthesizing in EDK Play with # of FUs / registers per FU HW parameterized for this Explore alternative machine organization Algorithmic Exploration

  25. Fin

More Related