1 / 30

RISC processor implementation using Bluespec part 2 - final presentation

30/3/2014. Performed By: Yahel Ben- Avraham and Yaron Rimmer Instructor: Mony Orbach Bi- semesterial , 2012 - 2014. RISC processor implementation using Bluespec part 2 - final presentation. Project goals. Goal: Implementing and analyzing RISC Processor using Bluespec Verilog Part A:

overton
Download Presentation

RISC processor implementation using Bluespec part 2 - final presentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 30/3/2014 Performed By: Yahel Ben-Avraham and YaronRimmer Instructor: MonyOrbach Bi-semesterial,2012 - 2014 RISC processor implementation using Bluespecpart 2 - final presentation

  2. Project goals • Goal: Implementing and analyzing RISC Processor using Bluespec Verilog • Part A: • Studying the working environment, BSV language and the basic processor implementation. • Implementing a simple RISC processor. • Run a simple test bench on the FPGA system.

  3. Project goals • Goal: Implementing and analyzing RISC Processor using Bluespec Verilog • Part B: • Ramp up the design: • Wider instruction set • Branch prediction (and flushing) • Hazard detection unit and extended Data forwarding • Performance counters • Run the design on the FPGA system

  4. Pipeline Datapath Memory Branch Predictor MEM2 WB FETCH DEC EXE MEM1 Instruction Memory Forwarding Register File

  5. Fetch • Tag the instruction’s metadata (PC, cycle) • Fetch the requested instruction from the instruction memory • Update next PC • Get next PC’s branch prediction and branch address • Check for Jump command

  6. Decode • Fully parse the received instruction • Pre-fetch data from registers potentially in use

  7. Execute • According to the instruction’s opcode: • ALU instruction: compute the result • Memory instruction: calculate memory address to read / write to • Branch instruction: check if branch is taken and update branch resolution • Data forwarding

  8. Memory 1 • Send a read / write request to the BRAM • Write : data is immediately stored • Read: wait for response in the next cycle • Otherwise, pass the incoming data

  9. Memory 2 (mem / skipmem) • Implemented in two rules: • For memory read: get BRAM response • Otherwise, pass the incoming struct

  10. Writeback • Save needed data to the register file • Register 0 – read only • Communication with the wrapper • Data and statistics

  11. Branch Prediction • 2-bit saturated, local counter (initialized to WNT) • Prediction is acquired in the Fetch stage • Stored and passed along the pipeline • Branch resolution determined in the Exec stage • BP is updated accordingly • Wrong prediction? • Correction PC • Flushing Dec & Exe

  12. Forwarding • 4 global Forwarding registers • Each containing (when valid) address, value, cycle • Writing - in the end of Exec stage • Reading - in the beginning of Exec stage • Invalidating - by aging after the Exec stage Memory Branch Predictor MEM2 WB FETCH DEC EXE MEM1 Instruction Memory Forwarding Register File

  13. Forwarding – cont. • Special case: register read after memory load • Stalling registers holding the address to be read to • If needed – stall the Exec stage by keeping the current command in the dec/exec FIFO Memory Branch Predictor MEM2 WB FETCH DEC EXE MEM1 Instruction Memory Forwarding Register File

  14. The working environment • Xilinx FPGA development board – of Virtex 5 family • Programming to FPGA using JTAG • Communication with DUT using PCIE • The platform enables: • Synthesis of design to FPGA • Reading and writing to memories • Performance counters

  15. The platform

  16. SCEMI’s working methods “Standard Co-Emulation Modeling Interface” • 2 working methods • TCP/IP simulation • FPGA emulation • Establishes port on SW end to FIFO on HW end communication • Parcels (data structs) are delivered in both directions

  17. System layers – PCIE simulation PC FPGA Linux O.S. SCEMI – DUT to PCIE C++ Executable: TB DUT: Wrapper Datapath PCIE Input files

  18. System layers – TCP\IP emulation PC FPGA Linux O.S. SCEMI – DUT to PCIE DUT: Bsim_dut C++ Executable: TB DUT: Wrapper TCP\IP Datapath PCIE Input files

  19. Our SCEMI platform – SW side • A compiled C++ code (TB) is loaded with input files • Sends and receives messages from the DUT using incoming \ outgoing ports • We chose to use a “Stop & Wait” protocol • Performs the following actions: • Loads the DUT’s instruction memory • Loads the DUT’s register file • Signals the DUT to run • When done, collecting relevant information • Register file • Run statistics

  20. Our SCEMI platform – HW side • Our top level module (Wrapper, which is our DUT) • Receiving and sending messages to the TB using FIFOs • Contains the Datapath itself as a black box • Performs commands from the TB • Loads the instruction memory and the register file • Initiates all the registers and starts \ stops the run of the datapath • Receives data from the datapath (from the WB stage) and relay it back to the TB

  21. Putting the design to the test • As a concluding test, we wrote a Bubble Sort in assembly, loading 10 unsorted numbers into the memory, then using bubble sort and displaying them in the register file. • The code uses almostall the instruction set, and practicallyevery feature in thedesign. for (i = 0; i < length -1; ++i) { for (j = 0; j < length - i - 1; ++j) { if (array[j] > array[j + 1]) { inttmp = array[j]; array[j] = array[j + 1]; array[j + 1] = tmp; } } }

  22. Critical example – Bubble sort • The program works successfully in the BSV simulation and the TCP\IP simulation. • Results are incorrect in the PCIE emulation.

  23. Critical example – Bubble sort

  24. Isolating the problem • Trying to isolate the problem – store 4 numbers, and read them into the register file • 4 ADDI , 4 STORE , 4 LOAD • Encountered unexplained yet repeating results • This is only one of many debugging attempts

  25. Isolating the problem • Expected result:consistent with simulation • FPGA result: • Padding with 1 NOP:between ADDI and ST • Padding with 2 or more NOPS:

  26. Further investigation • Dismissing possible issues • Design fault – works flawlessly in simulations • Clearing the design between runs • Investigating xilinx compilation files • Place and route – margins are positive • No note-worthy warnings • Consulting with Danny Hofshi, MonyOrbach, Yuval H.Nacson We were unable to solve the problem.

  27. Problem characterization • PFGA differs in behavior from both BSV and TCP\IP simulation • Related to the Store command – storing into the BRAM memory • Occurs when performing multiples stores in a row • Xilinx reports show no timing warnings

  28. Project usage and integration • The project is designed modularly, so that it can be easily modified and enhanced in the future • “Black Box” design • Integration oriented information and step-by-step walkthrough for using the system in designated section in the project’s final report

  29. Summary and conclusions • Fine line between high- and low- level implementation • Easy to write, modify and understand • Excellent simulation environment • Differences between simulation and FPGA • Automatic optimization – good and bad

  30. Thank you!

More Related