1 / 30

Hardware-Software Codesign Kermin Fleming Computer Science & Artificial Intelligence Lab

Hardware-Software Codesign Kermin Fleming Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology. Many slides produced by: Arvind, Myron King, Man Cheuk Ng, Angshuman Parashar. Hello, w orld!. module mkHello #(TOP_LEVEL_WIRES wires);

yama
Download Presentation

Hardware-Software Codesign Kermin Fleming Computer Science & Artificial Intelligence Lab

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hardware-Software Codesign Kermin Fleming Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology Many slides produced by: Arvind, Myron King, Man Cheuk Ng, AngshumanParashar http://csg.csail.mit.edu/6.375

  2. Hello, world! module mkHello#(TOP_LEVEL_WIRES wires); CHANNEL_IFC channel <- mkChannel(wires); // has a software counterpart Reg#(Bit#(8)) count <- mkReg(0); Reg#(Bit#(5)) state <- mkReg(0); rule init (count == 0); count <- channel.recv(); state <= 0; endrule rule hello (count != 0); case (state) 0: channel.send(‘H’); 1: channel.send(‘e’); 2: channel.send(‘l’); 3: channel.send(‘l’); ... 16: count <= count – 1; endcase if (state != 16) state <= state + 1; else state <= 0; endrule endmodule int main (intargc, char* argv[]) { int n = atoi(argv[1]); for (inti = 0; i< n; i++) { printf(“Hello, world!\n”); } return 0; } http://csg.csail.mit.edu/6.375

  3. Today’s Lecture • Case Study: IMDCT • Interfacing with HW • Extracting Parallelism • Automated Solutions • Bluespec Inc.: SCE-MI • Intel/MIT: LEAP RRR http://csg.csail.mit.edu/6.375

  4. Ogg Vorbis Pipeline Bits Stream Parser • OggVorbis is a audio compression format roughly comparable to other compression formats: e.g. MP3, AAC, MWA. • Input is a stream of compressed bits • Parsed into frame residues and floor “predictions” • The summed frequency results are converted to time-valued sequencies • Final frames are windows to smooth out irregularities • IMDCT takes the most computation Residue Decoder Floor Decoder IMDCT Windowing PCM Output http://csg.csail.mit.edu/6.375

  5. IMDCT Suppose we want to use hardware to accelerate FFT/IFFT computation Array imdct(int N, Array vx){ // preprocessing loop for(i = 0; i < N; i++){ vin[i] = convertLo(i,N,vx[i]); vin[i+N] = convertHi(i,N,vx[i]); } // postprocessing loop for(i = 0; i < N; i++){ int idx = bitReverse(i); vout[idx] = convertResult(i,N,vifft[i]); } return vout; } // do the IFFT vifft = ifft(2*N, vin); http://csg.csail.mit.edu/6.375

  6. IMDCT Array imdct(int N, Array vx){ // preprocessing loop for(i = 0; i < N; i++){ vin[i] = convertLo(i,N,vx[i]); vin[i+N] = convertHi(i,N,vx[i]); } // call the hardware vifft = call_hw(2*N, vin); // postprocessing loop for(i = 0; i < N; i++){ int idx = bitReverse(i); vout[idx] = convertResult(i,N,vifft[i]); } return vout; } • Implement or find a hardware IFFT • How will the HW/SW communication work? • How do we explore design alternatives? // do the IFFT vifft = ifft(2*N, vin); http://csg.csail.mit.edu/6.375

  7. HW IFFT Accelerator 1 HW IFFT Accelerator 2 HW Accelerator in a system • Communication via bus • DMA transfer? • Accelerators are all multiplexed on bus • Possibly introduces conflicts • Fair sharing of bus bandwidth Software CPU Bus (PCI Express) http://csg.csail.mit.edu/6.375

  8. setSize inputData outputData The HW Interface • SW calls turn into a set of memory-mapped calls through Bus • Three communication tasks • Set size of IFFT • Enter data stream • Take output out Bus (PCI Express) http://csg.csail.mit.edu/6.375

  9. Data Compatibility Issue IFFT takes Complex fixed point numbers. How do we represent such numbers in C and in RTL? template <typename F, typename I> struct FixedPt{ F fract; I integer; }; template <typename T> struct Complex{ T rel; T img; }; C++ typedefstruct { bit [31:0] fract; bit [31:0] integer; } FixedPt; typedefstruct { FixedPtrel; FixedPtimg; } Complex_FixedPt; Verilog http://csg.csail.mit.edu/6.375

  10. Data Compatibility Let us assume that data compatibility issue have been solved and focus on control issues • Keeping HW and SW representation is tedious and error prone • Issues of endianness (bit and byte) • Layout changes based on C compiler • (gcc vs. icc vs. msvc++) • Some SW representation do not have a natural HW analog • What is a pointer? Do we disallow passing trees and lists directly? • Ideally translation should be automatically generated http://csg.csail.mit.edu/6.375

  11. First Attempt at Acceleration Array imdct(int N, Array<Complex<FixedPt<int,int>> vx){ // preprocessing loop for(i = 0; i < N; i++){ vin[i] = convertLo(i,N,vx[i]); vin[i+N] = convertHi(i,N,vx[i]); } // postprocessing loop for(i = 0; i < N; i++){ int idx = bitReverse(i); vout[idx] = convertResult(i,N,vifft[i]); } return vout; } Sets size pcie_ifc.setSize(2*N); for(i = 0; i < 2*N; i++) pcie_ifc.put(vin[i]); for(i = 0; i < 2*N; i++) vifft[i] = pcie_ifc.get(); Sends 1 element Gets 1 element Software blocks until response exists http://csg.csail.mit.edu/6.375

  12. Exposing more details //mem-mapped hw register volatile int* hw_flag = … //mem-mapped hw frame buffer volatile int* fbuffer = … Array imdct(int N, Array<Complex<FixedPt<int,int>> vx){ … assert(*hw_flag== IDLE); for(cnt = 0; cnt<n; cnt++) *(fbuffer +cnt)= frame[cnt]; *hw_flag = GO; while(*hw_flag != IDLE) {;} for(cnt = 0; cnt<n*2; cnt++) frame[cnt++]=*(fbuffer+cnt); … } What happens if SW has a cache? http://csg.csail.mit.edu/6.375

  13. Issues • Are the internal hardware conditions exposed correctly by the hw_flag control register? • Blocking SW is problematic: • Prevents the processor from doing anything while the accelerator is in use • Hard to pipeline the accelerator • Does not handle variation in timing well http://csg.csail.mit.edu/6.375

  14. Driving a Pipelined HW … intpid = fork(); if(pid){ // producer process while(…) { … for(i = 0; i < 2*N; i++) pcie.put(vin[i]); } } else { // consumer process while(…){ for(i = 0; i < 2*N; i++) v[i] = pcie.get(); … } } • Multiple processes exploit pipeline parallelism in the IFFT accelerator. • How does the BSV exert back pressure on the producer thread? • How does the consumer thread exert back pressure on the BSV module? • What if our frames are really large, could the HW begin working before the entire frame is transmitted? http://csg.csail.mit.edu/6.375

  15. Data Parallelism 1 … SyncQueue<Complex<…>> workQ(); intpid = fork(); // both threads do same work while(…) { Complex<FixedPt>* vin = workQ.pop(); … for(i = 0; i < 2*N; i++) pcie.put(vin[i]); for(i = 0; i < 2*N; i++) v[i] = pcie.get(); … } • How do we isolate each thread’s use of the HW accelerator? • Do two synchronization points (workQ and the HW accelerator) cause our design to deadlock? http://csg.csail.mit.edu/6.375

  16. Data Parallelism 2 PCIE get_hw(intpid){ if(pid==0) return pcieA; else return pcieB; } • By giving each thread its own HW accelerator, we have further increased data parallelism • If the HW is not the bottleneck this could be a waste of resources. • Do we multiplex the use of the physical BUS between the two threads? … SyncQueue<Complex<…>> workQ(); intpid = fork(); // both threads do same work while(…) { Complex<FixedPt>* vin = workQ.pop(); … for(i = 0; i < 2*N; i++) get_hw(pid).put(vin[i]); for(i = 0; i < 2*N; i++) v[i] = get_hw(pid).get(); … } http://csg.csail.mit.edu/6.375

  17. Multithreading without threads or processies inticnt, ocnt = 0; Complex iframe[sz]; Complex oframe[sz]; … // IMDCT loop while(…){ … // producer “thread” for(i = 0; i<2,icnt<n; i++) if(pcie.can_put()) pcie.put(iframe[icnt++]); // consumer “thread” for(i = 0; i<2,ocnt<n*2; i++) if(pcie.can_get()) oframe[ocnt++]= pcie.get(); … } • Embedded execution environments often have little or no OS support, so multithreading must be emulated in user code • Getting the arbitration right is a complex task • All existing issues are compounded with the complexity of the duplicated states for each “thread” http://csg.csail.mit.edu/6.375

  18. The message • Writing SW which can safely exploit HW parallelism is difficult… • Particularly difficult if shared resources (e.g. bus) are involved Need for automated solutions doing a good job http://csg.csail.mit.edu/6.375

  19. Today’s Lecture • Case Study: IMDCT • Interfacing with HW • Extracting Parallelism • Automated Solutions • Bluespec Inc.: SCE-MI • Intel/MIT: LEAP RRR http://csg.csail.mit.edu/6.375

  20. Bluespec Co-design: SCE-MI • Circuit verification is difficult • Billions of cycles of gate-level simulation • How do we retain cycle accuracy? • Use SCE-MI *Target: WiFi Transceiver http://csg.csail.mit.edu/6.375

  21. SCE-MI • Use gated clocks to preserve cycle-accuracy • Circuit internals run at “Model Clock” • “Model Clock” ticks only when inputs and outputs to the circuit stabilize • Another Co-design problem http://csg.csail.mit.edu/6.375

  22. Bluespec SCE-MI • Used already in Lab • With a controlled clock on the FPGA • Bluespec has a rich SCE-MI library • Get/Put transactors provided • User provides C++ and HW transactors for exotic interfaces http://csg.csail.mit.edu/6.375

  23. Intel/MIT: LEAP RRR • Asynchronous Remote Request-Response stack for FPGA • Uses common Client/Server paradigm • Similar in many respects to Bluespec SCE-MI • Constrained user interface • Open, many platforms supported http://csg.csail.mit.edu/6.375

  24. client get put enable ready data data ready enable req_t resp_t enable ready data data ready enable get put server Client/Server interfaces • Get/Put pairs are very common, and duals of each other, so the library defines Client/Server interface types for this purpose interface Client #(req_t, resp_t); interface Get#(req_t) request; interface Put#(resp_t) response; endinterface interface Server #(req_t, resp_t); interface Put#(req_t) request; interface Get#(resp_t) response; endinterface http://csg.csail.mit.edu/6.375

  25. RRR Specification Language // ---------------------------------------- // create a new service called ISA_EMULATOR // ---------------------------------------- service ISA_EMULATOR { // -------------------------------- // declare services provided by CPU // -------------------------------- server CPU <- FPGA; { method UpdateRegister(in REG_INDEX, in REG_VALUE); method Emulate(in INST_INFO, out INST_ADDR); }; // --------------------------------- // declare services provided by FPGA // --------------------------------- server FPGA <- CPU; { method SyncRegister(in REG_INDEX, in REG_VALUE); }; }; http://csg.csail.mit.edu/6.375

  26. LEAP Abstraction Layers: RRR Channel IO Channel IO FPGA Platform Physical Devices Kernel Driver FPGA CPU http://csg.csail.mit.edu/6.375

  27. LEAP Abstraction Layers: RRR RRR specification files Client Stub Server Stub RRR Client/Server Manager RRR Client/Server Manager Channel IO Channel IO FPGA Platform Physical Devices Kernel Driver FPGA CPU http://csg.csail.mit.edu/6.375

  28. LEAP Abstraction Layers: RRR ClientStubs.ISA_EMULATORiemu; ... ... iemu.UpdateRegister.Request( REG_R27, regFile[REG_R27]); ... ... iemu.Emulate.Request(inst); ... ... tgtPC<- iemu.Emulate.Response(); ISA_EMULATOR::UpdateRegister( REG_INDEX i, REG_VALUE v) { regFile[i] = v; } ISA_EMULATOR::Emulate( INST_INFO inst) { // emulate the instruction return target_PC; } User Code User Code Client Stub Server Stub RRR Client/Server Manager RRR Client/Server Manager Channel IO Channel IO FPGA Platform Physical Devices Kernel Driver FPGA CPU http://csg.csail.mit.edu/6.375

  29. LEAP Abstraction Layers: RRR User Application Stub Stub Stub Stub Stub Stub RRR Client/Server Manager RRR Client/Server Manager Channel IO Channel IO FPGA Platform Physical Devices Kernel Driver FPGA CPU http://csg.csail.mit.edu/6.375

  30. Conclusion • Writing SW which can safely exploit HW parallelism is difficult… • Several automated tools are available • Development ongoing http://csg.csail.mit.edu/6.375

More Related