Technion – Israel Institute of Technology Department of Electrical Engineering

Project performed by: Naor Huri Idan Shmuel Project supervised by: Ina Rivkin Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Winter 2008 Final Presentation multi processor system- two semesters project

Background • This project is a part of a much larger project dealing with signal processing acceleration using hardware. In our project we created a hardware acceleration to a given algorithm and analyzed the advantages and disadvantages of such system, in comparison to a pure software implementation.

problem • Running the signal processing algorithm takes too much time using a software on a standard PC. solution • A system designed especially for that target, using multiple processors and management unit

The system • Simulator- a program running on the host PC responsible to generate data packets, sending them for processing and retrieving back the results. • Processing units, each running the same signal processing program upon the incoming packets. • Switch- responsible of the correct transfer of data between the host PC and the processing units and vise versa.

Block diagram Data PackagesGenerator Gidel ProcStar II Stratix II switch On chip memory Processor I FIFO_IN Board memory On chip memory Processor II FIFO_OUT Board memory On chip memory Processor N PCI BUS

Project goals • Building the above system, understanding multiprocessors issues. • Learning the tools and techniques for building such complex systems. • Optimizing the system configuration, in a search for the ideal NiosII type and number • Finding the optimal configuration in which throughput is brought to maximum • Performances comparison between working with PC and working with the system

Project levels • The project is un integration between tow levels: software and hardware: • The software level is composed of Vitaly’s packet processing algorithm and the HOST program which generate the data packets and retrieves the results. • The hardware level, implemented on the PROC board, includes the switch, the processing units and GIDEL IP’s

Software level: Simulator- packet generator and retriever • This program generates vectors of Time Of Arrivals (TOA), each made up of a basic chain with a specific period and noise elements. • Every vector is wrapped with header and tail used for identification; control signals and synchronization • The packet structure:

Software level: The algorithm • The algorithm job is to recognize such basic chains in the incoming vectors and to associate each TOA element to his chain. • The results send back to the simulator in the following packet structure:

Hardware level: The board • Hardware level is implemented on a Gidel PROCStar II board. • 4 powerful Stratix II FPGA each annexed to 2 external DDRII. • The packets are sent to the processing units via the PCI bus • the packets are stored on the 2 external memories, which are configured to act as FIFO’s.

Hardware level: the switch • The switch, designed by Oleg & Maxim, manages the data transfer between the host PC and the multiple processing units. • The switch is composed of the following main modules: • Input reader- reads packets from FIFO_IN to processing units • Output writer-writes the answers from the processing units to FIFO_OUT • Main controller- as it name implies- issues all control signals required to the correct transfer of data • Clusters- a wrapper around the processing units used to give another abstraction layer to the system.

Switch top level block diagram

Switch- main features • Management policy: FCFS for input packets, RR for output packets • Statistics reporter • Error reporter • up to 16 clusters.

Every cluster has one processing unit, as seen in the next slide Cluster ports: Inside the switch • Switch has up to 16 clusters inside. • Same cluster is duplicated many times to create a multi Nios system • Switch ports:

Cluster structure

Hardware level: the processing unit • 1 NiosII CPU • 12 KB on chip memory for code, stack and heap • 2 4KB buffers used by the algorithm to build the histograms • 4KB buffer for input packets • dual port , also mastered by the switch • 1KB buffer for output packets • dual port , also mastered by the switch • Timer.

1 processing unit

Processing unit ports • input vector and output vector- the connection to the switch. Without their ports no ack/req protokols could be implemented • The modules “export” signals, would be connected to the cluster, as seen in the “cluster structure” slide.

Creating a multi Nios system • Duplicating the clusters inside the switch would create a multi Nios system. • The switch support up to 16 clusters. • This example include 14 Nios s • Logic utilization is only 20% • While almost all ram blocks are used, memory utilization is only 33%, mainly because M-RAM cells are ineffectively used.

Hardware level: Gidel IP’s- MegaFIFO and registers • Gidel IP- The MegaFIFO • provide a simple and convenient way to transfer data to/from Gidel PROC board. • In this system there are tow FIFO’s : FIFO_IN for the incoming packets and FIFO_OUT for the processed packets. • To access those memories the host uses Gidel predefined HAL functions while the hardware uses ack/req protocol. • Gidel IP- Register • used to transfer data from hardware to software and vise versa. • In this system- they are used for error and statistics reporting.

Design example • In the PROCWizard tool we define the top level entity of the design. It generates the HDL code for the design and an H file for the host. • We can see here the definition of one IC (FPGA), tow FIFO’s, some registers and the LBS module.

Now that we have the system- lets check some numbers • Basic system: • 1 NiosII s (s for standard) system • 1 simulator we built • 1 algorithm

Performance check: • 3 methods: • Timer module- inside the sopc, used as timestamp. Resolution: 10 us. • Statistics reporter module- counts packet entering and exiting the system. As long as their numbers are not identical it counts clock cycles and info register has the value of 128. Resolution: 0.01 us. • Software timer- initiated by the host from the moment it writes the data to the moment info register is zero- indicating all packets returned. Resolution: 1 us. • Later we will demonstrate how 3 methods converge

Performance check for the algorithm • Computing time as a function of TOA number. • Computing time= O(n^2) • %Absent=0 • %noise=0

Performance check for the algorithm • Computing time as a function of % absent • Around 6% the algorithm finds more then 1 sequence, with double frequencies… • %noise=0

Performance check for the algorithm • Computing time as a function of %noise

Testbench vector • According to the above results, we choose an average vector to check different systems • Vector length: 495 • % absent= 4% • % noise= 25%

Calculating time on different CPU’s • In order to decide what Nios configuration to use we checked them with the same vector • The economic CPU needs little space on the FPGA, but has poor performance • The fast version has some advantage over the simple CPU, but needs a lot more FPGA resources, and so we choose the simple one

Calculating time on systems with several CPU’s • The CPUs are independent and so doubling their number doubles the performance • There is no major overhead for adding CPUs

Final results:PC vs FPGA • In order to come to final conclusions we sent 10k random vectors to both PC and most powerful FPGA system • The PC does thejob 7.64 times faster

Does it mean there is no chance of getting good results? • No! • The Nios CPUs we used are no match for the PC Pentium CPU, but there are a few ways to get better performances • 1. Increasing the system’s 100 MHz clock rate • 2. Adding an accelerator unit to each CPU • 3. Shrinking the code lines from 8.5kbytes to 8kbytes and by that gaining more Ram cells for more CPUs • 4. Optimize utilization of Mram cells.

We would like to thank • Ina Rivkin • Lab staff- Eli Shoshan and Moni Orbach • Oleg and Maxim • Vitaly • Michael and Liran

The end

Technion – Israel Institute of Technology Department of Electrical Engineering