1 / 57

MEMOCODE 2007 HW/SW Co-design Contest

MEMOCODE 2007 HW/SW Co-design Contest. Documentation of the submission by. Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont. Electrical and Computer Engineering Department Virginia Tech. Table of Contents. Section 1 Performance Evaluation and Analysis

ilori
Download Presentation

MEMOCODE 2007 HW/SW Co-design Contest

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MEMOCODE 2007HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical and Computer Engineering Department Virginia Tech

  2. Table of Contents • Section 1 Performance Evaluation and Analysis • Section 2 Matrix Multiplication Algorithm Optimization • Section 3 HW/SW System Implementation • Section 4 Co-design Flow and Methodology • Section 5 Conclusion

  3. Section 1Performance Evaluation and Analysis

  4. Performance Results Section 1 Performance Evaluation and Analysis

  5. Performance Calculation • FCPU-Speed = 1, we used 300Mhz PPC • FFPGA-Capacity = 1, we used XUP’s XC2VP30 • FFPGA-speed = 1, we used 100Mhz clock for bus and coprocessor • TimeEffective = (Tmeas,N=1024 + Tmeas,N=256 * 64) * FCPU-Speed *FFPGA-Capacity * FFPGA-speed = (11.882 + 64*0.217) * 1 * 1 * 1 = 25.77 seconds Section 1 Performance Evaluation and Analysis

  6. Performance Results Section 1 Performance Evaluation and Analysis

  7. Section 2Matrix Multiplication Algorithm Optimization

  8. Algorithm Optimization • Algorithm is optimized based on targeting platform (Virtex2 Pro VP30) • Optimization goal: • Best utilized the slow DDR Memory Interface • Optimally 128-bit/cycle transfers => 4 Complex Numbers • Linear accesses result in better throughput • Utilize as many fast discrete FPGA Resources as possible • 136 18x18-Hardware Multipliers • 136 18kbits Block Rams Section 2 Matrix Multiplication Algorithm Optimization

  9. [A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B A Section 2 Matrix Multiplication Algorithm Optimization

  10. [A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B Bring in 4 complex numbers from “A” A C Section 2 Matrix Multiplication Algorithm Optimization

  11. [A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B A C Section 2 Matrix Multiplication Algorithm Optimization

  12. [A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B A C Section 2 Matrix Multiplication Algorithm Optimization

  13. [A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B A C Section 2 Matrix Multiplication Algorithm Optimization

  14. [A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B A C Section 2 Matrix Multiplication Algorithm Optimization

  15. [A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B • Bring in four numbers from “B” and perform the following calculations: • C[0][0] = C[0][0] + A[0][0]*B[0][0] • C[0][1] = C[0][0] + A[0][0]*B[0][1] • C[0][2] = C[0][0] + A[0][0]*B[0][2] • C[0][3] = C[0][0] + A[0][0]*B[0][3] • … • C[8][0] = C[8][0] + A[8][0]*B[0][0] • C[8][1] = C[8][0] + A[8][0]*B[0][1] • C[8][2] = C[8][0] + A[8][0]*B[0][2] • C[8][3] = C[8][0] + A[8][0]*B[0][3] • Where “A*B” is a complex multiplication. • 32 Complex multiplication in parallel = 128 multiplies, 64 additions/subtractions and 64 accumulates per cycle A C Section 2 Matrix Multiplication Algorithm Optimization

  16. [A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B A C Section 2 Matrix Multiplication Algorithm Optimization

  17. [A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B A C Section 2 Matrix Multiplication Algorithm Optimization

  18. [A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B A C Section 2 Matrix Multiplication Algorithm Optimization

  19. [A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B A C Section 2 Matrix Multiplication Algorithm Optimization

  20. [A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B A C Section 2 Matrix Multiplication Algorithm Optimization

  21. [A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B A C Section 2 Matrix Multiplication Algorithm Optimization

  22. [A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B A C Section 2 Matrix Multiplication Algorithm Optimization

  23. [A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B A C Section 2 Matrix Multiplication Algorithm Optimization

  24. [A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B A C Section 2 Matrix Multiplication Algorithm Optimization

  25. [A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B A C Section 2 Matrix Multiplication Algorithm Optimization

  26. [A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B A C Section 2 Matrix Multiplication Algorithm Optimization

  27. [A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B A C Section 2 Matrix Multiplication Algorithm Optimization

  28. [A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B A C Section 2 Matrix Multiplication Algorithm Optimization

  29. [A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B A C Section 2 Matrix Multiplication Algorithm Optimization

  30. [A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B A C Section 2 Matrix Multiplication Algorithm Optimization

  31. [A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B A C Section 2 Matrix Multiplication Algorithm Optimization

  32. [A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B A C Section 2 Matrix Multiplication Algorithm Optimization

  33. [A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B A C Section 2 Matrix Multiplication Algorithm Optimization

  34. [A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B A C Section 2 Matrix Multiplication Algorithm Optimization

  35. [A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B A C Section 2 Matrix Multiplication Algorithm Optimization

  36. [A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated C Optimized Algorithm B At this point we have completed calculating the first 8xN rows of C in our coprocessor and we write the results back to RAM A C Section 2 Matrix Multiplication Algorithm Optimization

  37. [A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated Optimized Algorithm B A C C Section 2 Matrix Multiplication Algorithm Optimization

  38. [A] currently in coprocessor • [A] currently used for calculation • [B] currently used for calculation • [C] stored and accumulated in BRAM • [C] being multiplied and accumulated Optimized Algorithm B Next, we repeat the previous algorithm to calculate the next “8xN CSlice” A C Section 2 Matrix Multiplication Algorithm Optimization

  39. Optimized Algorithm • Performs 128 MACs per cycle (utilizing 128 out of 136 hard multipliers) • Linear scan through B matrix (optimizing interface to DDR storage) Section 2 Matrix Multiplication Algorithm Optimization

  40. Section 3HW/SW System Implementation

  41. System Architecture Processor Local Bus Section 3 HW/SW System Implementation

  42. Coprocessor Architecture vs. Optimized Algorithm • Minor deviation from proposed algorithm • I/O size for coprocessor: B elements are loaded 2 at a time instead of 4 • PLB DMA failed to function resulting in a much slower {DDR->PPC->Coprocessor FIFO} datapath. • FIFO width of 64-bit => 2-number sends from PPC to Coprocessor FIFO • To maintain SAME calculation capacity: A-Block dimension doubled from 8x4 to 16x4. C-Slice doubled from 8xN to 16xN • Still utilizes 128 Hardware Multipliers. Section 3 HW/SW System Implementation

  43. Coprocessor Architecture • Coprocessor is scalable! • Reduce the depth of the A-matrix subblock to reduce the amount of MAC needed Section 3 HW/SW System Implementation

  44. Coprocessor Architecture Section 3 HW/SW System Implementation

  45. MAC Unit Architecture Section 3 HW/SW System Implementation

  46. MAC Unit Architecture Input “B” Value “A” Values Complex Multiply Accumulate BlockRAM Storage for current “C” value Section 3 HW/SW System Implementation

  47. Section 4Co-design Flow and Methodology

  48. Design Flow Reference C Algorithm Rectangular-Block Transformation Optimized C Algorithm Manual Partitioning Driver C Algorithm GEZEL Coprocessor Cosimulation PPC Binary VHDL Synthesis XUP Board PerformanceAnalysis Section 4 Co-design Flow and Methodology

  49. Simulation Reference C Algorithm Optimized C Algorithm workstation cycle-based instruction-set cosimulator Driver C Algorithm GEZEL Coprocessor PPC Binary VHDL XUP Board FPGA Section 4 Co-design Flow and Methodology

  50. Simulation • Simulation-based verification on three levels • workstation (behavioral) • cycle-based ISS (functional model of coprocessor) • FPGA board (skipping VHDL simulation since synthesis is swift and easy) • Drawback - simulations capture only behavior, but not the architecture. • Example: Hard to estimate post-synthesis timing • Example: Hard to reflect memory-bus behavior (DMA, DDR, ...) in a C simulation model Section 4 Co-design Flow and Methodology

More Related