הפקולטה להנדסת חשמל Department of Electrical Engineering

הטכניון - מכון טכנולוגי לישראל Technion - Israel Institute of Technology הפקולטה להנדסת חשמל Department of Electrical Engineering High Speed Digital System Laboratory HS-DSL Neural Network For Handwritten Digits Recognition Roi Ben Haim & Omer Zimerman Supervisor: Guy Revach Winter 2013/2014

Background Neural Network is a Machine Learning System designed for supervised learning using examples. Such network can be used for hand written digit recognition, and when used in software is in-efficient in time and resources. This project is the third part of a 3-parts project. Our goal is to implement an efficient hardware solution to the handwritten digit recognition problem. Implementing a dedicated HW to this task is part of a new trend in VLSI architecture called heterogeneous computing- design of a system on chip with many accelerators for different tasks which will achieve better performance/power ratio, each for its purposed task.

Theoretical Background The Network NN networks are based on the biological neural system. The basic units that construct the network are Neurons and Weights. Weight Neuron Is the basic unit that connects the neurons. The weight multiplies the data passing through it with the weight value. Is connected to multiple inputs and outputs. The neuron output is the result of an activation function on the sum of the inputs

output input Theoretical Background The Network From Neurons and Weights we can construct a neural network with as many layers as we like. Each layer contains a certain amount of neurons and a set of weights connects the layer to other layers. The complexity of the network is determined by the dimension of the inputs, the more complex and more variable the input is, so does the network.

Learning algorithm The output of a network can be a multiple neurons. Each neuron can be represented mathematically as a function with multiple variables. If we will approach the weights as parameters. So, for each Input X, we would like to minimize the average error between Y and the desired vector D.

Learning algorithm The method we use to reach the minimum error it a gradient based algorithm. For each example input we compute: This step is done for each weight in each layer, what the calculation actually does is walking us one small step towards the minimum of the error. min b min a An error function has lots of local minimums, the algorithm does not guarantee we will reach the global minimum.

Network Description Structure & Functionality Output: 10 neurons +1 – answer -1 – other 9 Input: 29x29 grayscale image Convolutional layers Fully-Connected layers layer 4 layer 0 layer 3 layer 1 layer 2

NN Structure & Functionality Layer #1 1014 neurons Layer #0 841 neurons Feature map #0 Input Image 13x13 Feature map #5 29x29 13x13

NN Structure & Functionality Layer #1 Layer #2 1250neurons Layer #3 100 neurons Feature map #0 map #0 n#0 Layer #4 10neurons n#1 map #1 Output Layer d#0 Feature map #1 map #2 8 d#9 Feature map #5 map #49 n#99 5x5 13x13

Network Description Structure & Functionality • Layer #0: The first layer. The input to this layer is the 29x29 pixels image. The pixels can be seen as 841 neurons without weights. The pixels are the input to the next layer. • Layer #1: The first convolution layer that produces 6 feature maps, each has 13x13 pixels/neurons. Each neuron on the output layer is a result of a masking operation between 5x5 (+1 for the bias, total of 26 weights) weights kernel, different for each one of the 6 maps, and 25 pixels from the input image. The 25 results are summed with a bias and entered to the activation function (tanh). Each feature map is the result of a non-standard 2D masking between a 5x5 weight kernel (each weight kernel results in a different feature map) and the 29x29 input neurons, summed with an added bias. The masking is of non-standard form, because the 5x5 neuron sub-matrices are derived by shifts of 2 (instead of 1), both vertically and horizontally, starting with the 5x5 sub-matrix at the upper left corner of the 29x29 input neurons.

Network Description Structure & Functionality • Layer #2: This layer is the second convolution layer. Its output is 50 feature maps of 5x5 neurons (summing up to a total of 1250 neurons). Each neuron is a result of a similar masking calculation as in the previous layer, only now each of the 50 feature maps is the sum of six 2D mask operations, each masking has its own 5x5 (+1 bias weight) weight kernel, and is between the kernel and its matching feature map of layer 1 (horizontal and vertical shift are 2, as in previous layer).

Network Description Structure & Functionality • Layer #3: • This is a fully connected layer that contains 100 neurons. Each has 1250 entries (output neurons of previous layer) that are multiplied by 1250 corresponding weights. There are 125100 weights on this layer. • Layer #4: • Last fully connected layer. It contains 10 output neurons, each of which is connected to the previous layer's 100 neurons by a different weights vector. The 10 outputs represent the 10 possible recognition options. The neuron with the highest value out of the 10 neurons corresponds to the recognized image. There are 1010 weights in this layer (101x10). 12

Network Description Structure & Functionality Summary Table: Total of 3215 neurons and 134,066 weights

SW simulator implementation summary In project A, we have implemented the neural network which is described in previous slides using matlab: The matlab implementation achieved 98.5% correct digit recognition rate.

Software simulation results usage In the current project, we used the results of the previous software implementation as reference point to the hardware implementation, both in the success rate and in the performance of the implementation. We tried to achieve the same success rate, and in order to do so we have simulated several fixed-point implementations of the network (opposed to previous matlab floating point arithmetic), and chose the minimal format that achieved ~98.5% : Another usage of SW simulation results is for the weight parameters of the network, which were produced using the learning process implementation. Chosen implementation

Project Goals Devise efficient & scalable HW architecture for the algorithm Implement dedicated HW for handwritten digit recognition. Achieving SW model’s recognition rate (~98.5%). Major performance improvement compared to SW simulator. Low cell count, low power consumption. Fully functional system – NN HW implementation on FPGA with PC I/F that runs a digit recognition application.

Project Top Block Diagram

Architecture aspects This architecture tries to optimize the resources/throughput tradeoff. Neuron networks have strong parallel nature, and so our implementation tries to pursue this parallel nature (which is expressed by high throughput at the output of the system). A fully parallel implementation would require 338,850 multipliers (for a total of 338,850 multiplications needed for a single digit recognition input), which is obviously not a feasible implementation. In our architecture, we have decided to make use of 150 multipliers. This number was chosen with careful attention to current FPGA technology – On the one hand we didn't want to utilize all the multipliers of the FPGA, but on the other we do want to utilize a substantial number of multipliers, in order to support the parallel nature of the algorithm.

Architecture aspects Our destination technology is VIRTEX 6 XC6VLX240T, which offers 768 DSP slices, each containing (among other things) a 25X18 bit multiplier, meaning we are utilizing 150 of the 768 DSP blocks. In theory, we can add future functionalities to the FPGA, as we are far from the resources limit. This was done intentionally, as modern FPGA DSP designs are usually systems which integrates many DSP modules.

Memory aspects Another important guideline for the architecture is memory capacity. The algorithm requires ~135,000 weights and ~3250 neurons, each of them represented by 8 bits (fixed point 3.5 format, which is the minimum number of bits required to achieve the same success rate as the MATLAB double precision model). This means that a minimum of 1.1 Mb (Megabit) of memory is required. VIRTEX 6 XC6VLX240T offers 416 RAM blocks of 36 kb (kilobit), totaling in 14.625Mb. This means that we utilize only 7.5% of the internal FPGA RAM memory.

Micro-architecture implementation • Memories: • All RAM memories were generated using Coregen, specifically designed for the target technology (VIRTEX 6). Small memories (~ 10 kb) were implemented as distributed RAM, and large memories were implemented using block RAM. • Overall, 4 memory blocks were generated: • Layer 0 neuron memory:single-port distributed RAM block of depth 32 and width 29*8=232. total memory size ~ 9 kb • Layer 1 neuron memory:single-port distributed RAM block of depth 16 and width 13*6*8=624. total memory size ~ 10 kb • Weights bias memory:single-port ROM block of depth 261 and width 6*8=48. total memory size ~ 12 kb

Micro-architecture implementation Weights and layer 2 memory: dual-port block RAM. One port has read & write width of 1200 (depth of 970 each), and the second port has write width of 600 (depth 1940) and read width of 1200 (depth 970). total memory size ~ 1.15 Mb Layer 2 neuron memory and weights memory were combined to one big RAM block for better utilization of the memory architecture provided by VIRTEX 6. Layer 3 & Layer 4 neuron memory: implemented in registers (110 Bytes).

Micro-architecture implementation • Mult_add_top: • This unit receives 150 neuron,150 weights & 6 bias weights, and returns the following output: • This arithmetic operation is implemented using 150 DSP blocks and 6 adder trees, each adder tree containing 15 signed adders (all adders generated using coregen and implemented using fabric and not DSP blocks), totaling at 90 adders.

Micro-architecture implementation • Tanh: • Implemented as a simple LUT – 8 input bits are mapped to 8 output bits (total LUT size therefore 256 x 8)

Micro-architecture implementation • Rounding & saturation unit: • Logic to cope with the bit-expansion caused by multiplication (results in twice the amount of bits of the multiplicands) & addition (results in 1 bit expansion compared to the added numbers representation) operands. • Neurons are represented using 8 bits, of 3.5 fixed point format. This format was decided upon after simulating several fixed point formats, and finding the minimal number of bits needed to achieve 98.5% accurate digit recognitions (which is equal to the success rate of matlab’s floating point). • The rounding & saturation logic operates according to the following rules: • if input < -4 then assign output = -4 (binary '100.00000') • else if input > 3.96875 then assign output = 3.96875 (binary '011.11111') • else assign output = round(input*25)*2-5 • where round() is the hardware implementation of MATLAB's round() function.

Micro-architecture implementation • UART interface: • uart_if module receives 29x29 bytes image from the PC in a bit-by-bit manner (serial I/F). RX module coalesces each 29 bytes to 1 memory word written to L0 neuron memory. After all 29 memory words are written, start signal rises and image processing begins. When image processing is done, digit recognition unit outputs the result (10B bus- byte per digit) to UART TX. UART TX outputs the results to the serial I/F.

Resource utilization summary As can be seen, the naïve implementation (which is a brute-force, totally parallel implementation of the network) is not feasible in hardware, because of impractical resource demands. Our architecture offers reasonable resource utilization, while still improving performance substantially in comparison to software implementation.

Development Environment SW development platforms- MATLAB. HW development platforms- Editor – Eclipse. Simulation- Modelsim 10.1b Synthesis- XST (ISE). FPGA- Virtex 6 XC6VLX240T

HDL implementation & verification All of the modules described in previous slides, were successfully implemented using Verilog HDL. A testbench was created for each module, input vectors were created & injected, and simulation results were compared to a bit-accurate matlab model. After simulation results of all stimuli vectors were consistent with the bit-accurate model, the model was considered verified. After each module was individually verified, we have connected between the different modules, and implemented a controller over The entire logic. A testbench & bit accurate matlab models for all stages of the controller were created for the entire project.

HDL implementation & verification Here are modelsim simulation results for recognition of the digit ‘9’:

Performance Analysis As mentioned before, the purpose of implementing the system in hardware, was better utilizing the parallel nature of the algorithm, thus increasing performance. Previous SW implementation required ~5000μs to perform a single digit recognition. The FPGA implementation, requires ~3000 clock cycles to perform a single digit recognition. In our implementation, we have used a system clock which operates in a frequency of 150 MHz, meaning that total time required for a single digit recognition is ~20 μs. In summary, the hardware implementation achieves a performance speed-up of about 250. 32

FPGA resource utilization Virtex 6 basic resources: Each Virtex-6 FPGA slice contains four LUTs and eight flip-flops Each DSP48E1 slice contains a 25 x 18 multiplier, an adder, and an accumulator. Block RAMs are fundamentally 36 Kbits in size. Each block can also be used as two independent 18 Kb blocks. Utilization of the FPGA’s basic resources in our implementation: 33 33

User interface (GUI) In order to make the matlab implementation usable for future users, we have designed a GUI (Graphic User Interface) which wraps the network’s functions while allowing configurability. The GUI includes 4 modes of work: training mode, verification mode, user mode and FPGA mode. First 3 modes (training, verification and user mode) are running the SW simulator from project’s first part. In FPGA mode the user sends a digit image via UART to the digit recognition unit implemented on the FPGA.

Project challenges • The main goal of the project, which was implementing a highly functioning handwritten digit recognition system, proved very challenging. We can divide the challenges into 3 main categories: Architecture, implementation and verification. • Architecture-oriented problems – • Devising efficient hardware architecture for the algorithm proved one of the biggest challenges of the project. Neural networks theoretically can be implemented completely in parallel, but this solution is not practical resources-wise. A lot of thought had been put into the tradeoff between parallelism and resource usage. • In order to allow a degree of scalability, we had to think of the common things between the different layers, and figure out an architecture that will allow all layers to use the same logic modules, instead of implementing each layer in a straight forward fashion. • Resource estimation – our target device is a Virtex 6 FGPA, and therefore our architecture had to take that under consideration. We had a well defined limit regarding the amount of memory & multipliers which were available for us, and therefore needed to devise an architecture which will not exceed these limits.

Project challenges • Implementation-oriented problems – • Our target device (Xilinx Virtex 6 FPGA) was unfamiliar to us, and so we had to learn how to operate Xilinx's tools to implement logic modules which are compatible to the target technology • We have used fixed point arithmetic for the first time, and gained much experience in this area, including implementing hardware rounding & saturation logic. At first, we have implemented a simple truncation rounding, but found out that it did not satisfy us and lowered the success rate. Therefore, we needed to implement a more complicated rounding method, which imitates matlab's round() function. • In order to implement such modular & scalable architecture, a smart controller had to be implemented. Composing a control algorithm and afterwards coding this controller proved very challenging 37

Project challenges • Verification-oriented problems – • Verification of the system was probably the most challenging aspect • As stated in the previous section (Implementation-oriented problems), we were unfamiliar with Xilinx's tools, and therefore after creating the desired logic (such as memories, multipliers etc.) using this tools, we had to verify that they work. Nothing worked at first, so this proved a long process, until we have learned to properly use Xilinx's IP's • Most challenging of all was to achieve a successful verification of the entire system. Our system contains an extremely large amount of data (neurons, weights, partial results), and so every small mistake in the controller leads to a lot of wrong output data, and it is very difficult to pinpoint the origin of the mistake. For example, if we accidently coded the controller such that a data_valid strobe arrives to a certain module one clock earlier than it actually should have, than the entire data flow continues with data which is essentially garbage, and it is hard to find the origin of the mistake. To overcome this, we had to produce a matlab bit accurate module to each step of the design, not only to the final results. 38 38

Thank you

הפקולטה להנדסת חשמל Department of Electrical Engineering