Design and Implementation of a NoC-Based Cellular Computational System

Design and Implementation of a NoC-Based Cellular Computational System By: Shervin Vakili Supervisors: Dr. Sied Mehdi Fakhraie Dr. Siamak Mohammadi February 09, 2009

Outline • Introduction and Motivations • Basics of Evolvable Multiprocessor System (EvoMP) • EvoMP Operational View • EvoMP Architectural View • Simulation and Synthesis Results • Summary 2

Introduction and Motivations Basics of Evolvable Multiprocessor System (EvoMP) EvoMP Operational View EvoMP Architectural View Simulation and Synthesis Results Summary 3

Introduction and Motivations (1) • Computing systems have played an important role in advances of human life in last four decades. • Number and complexity of applications are countinously increasing. • More computational power is required. • Three main hardware design approaches: • ASIC (hardware realization) • Reconfigurable Computing • Processor-Based Designs (software realization) Flexibility Performance 4

Introduction and Motivations (2) • Microprocessors are the most pupular approach. • Flexibility and reprogramability • Low performance • Architectural techniques to improve processor performance: • Pipeline, out of order execution, Super Scalar, VLIW, etc. • Seems to be saturated in recent years. 5

Introduction and Motivations (3) • Emerging trends aim to achieve: • More performance • Preserving the classical software development process. [1] 6

Why Multi-Proseccor? • One of the main trends is to increase number of processors. • Uses Thread-level Parallelism (TLP) • Similarity to single-processor: • Short time-to market • Post-fabricate reusability • Flexibility and programmability • Moving toward large number of simple processors on a chip. 7

Number of Processing Cores in Different Products [3] [3] 8

MPSoC Development Challenges (1) • MP systems faces some major challenges. • Programming models: • MP systems require concurrent software. • Concurrent software development requires two operations: • Decomposition of the program into some tasks • Scheduling the tasks among cooperating processors • Both are NP-complete problems • Strongly affects the performance 9

MPSoC Development Challenges (2) • Two main solutions: 1. Software development using parallel programming libraries. • e.g. MPI and OpenMP • Manually by the programmer. • Requires huge investment to re-develop existing software. 2. Automatic parallelization at compile-time • Does not require reprogramming but requires re-compilation. • Compiler performs both Task decomposition and scheduling. 10

MPSoC Development Challenges (3) • Control and Synchronization • To Address inter-processor data dependencies • Debugging • Tracking concurrent execution is difficult. • Particularly in heterogeneous architecture with different ISA processors. 11

MPSoC Development Challenges (4) • All MPSoCs can be divided into two categories: • Static scheduling • Task scheduling is performed before execution. • Predetermined number of contributing processors. • Has access to entire program. • Dynamic scheduling • A run-time scheduler (in hardware or OS) performs task scheduling. • Does not depend on number of processors. • Only has access to pending tasks and available resources. 12

Introduction and Motivations Basics of Evolvable Multiprocessor System EvoMP Operational View EvoMP Architectural View Simulation and Synthesis Results Summary 13

Proposal of Evolvable Multi-processor System (1) This thesis introduces a novel MPSoC Uses evolutionary strategies for run-time task decomposition and scheduling. Is called EvoMP (Evolvable Multi-Processor system). Features: Can directly execute classical sequential codes on MP platform. Uses a hardware evolutionary algorithm core to perform run time task decomposition and scheduling. Distributed control and computing Flexibility NoC-Based, 2D mesh, and homogeneous 14

All computational units have one copy of the entire program EvoMP architecture exploits a hardware evolutionary core to generates a bit-string (chromosome). This bit-string determines the processor which is in charge of executing each instruction. Primary version of EvoMP uses a genetic algorithm core. Proposal of Evolvable Multi-processor System (2) 15

Target Applications • Target Applications: • Applications, which perform a unique computation on a stream of data, e.g.: • Digital signal processing • Packet processing in network applications • Huge sensory data processing • … 16

Streaming Applications Code Style Streaming programs have two main parts: Initialization Infinite (or semi-infinite) Loop ;Initial 1- MOV R1, 0 2- MOV R2, 0 L1: ;Loop 3- MOV R1, Input 4- MUL R3, R1, Coe1 5- MUL R4, R2, Coe2 6- ADD R1, R3, R4 7- MOV Output, R1 8- MOV R1, R2 9- Genetic 10-JUMP L1 Two-Tap FIR Filter 17

EvoMP Top View Genetic core produces a bit-string (chromosome) Determines the location of execution of each instruction 1- MOV R1, 0 2- MOV R2, 0 L1: ;Loop 3- MOV R1, Input 4- MUL R3, R1, Coe1 5- MUL R4, R2, Coe2 6- ADD R1, R3, R4 7- MOV Output, R1 8- MOV R1, R2 9- JUMP L1 1- MOV R1, 0 2- MOV R2, 0 L1: ;Loop 3- MOV R1, Input 4- MUL R3, R1, Coe1 5- MUL R4, R2, Coe2 6- ADD R1, R3, R4 7- MOV Output, R1 8- MOV R1, R2 9-JUMP L1 SW00 SW01 P-00 P-01 Chromosome: 0110110…11 Genetic Core 1- MOV R1, 0 2- MOV R2, 0 L1: ;Loop 3- MOV R1, Input 4- MUL R3, R1, Coe1 5- MUL R4, R2, Coe2 6- ADD R1, R3, R4 7- MOV Output, R1 8- MOV R1, R2 9- JUMP L1 1- MOV R1, 0 2- MOV R2, 0 L1: ;Loop 3- MOV R1, Input 4- MUL R3, R1, Coe1 5- MUL R4, R2, Coe2 6- ADD R1, R3, R4 7- MOV Output, R1 8- MOV R1, R2 9- JUMP L1 SW10 SW11 P-10 P-11 19

How EvoMP Works? (1) Following process is repeated in each iteration: At the beginning of each iteration: genetic core generates and sends the bit-string (chromosome) to all processors. Processors execute this iteration with the determined decomposition and scheduling scheme. A counter in genetic core counts number of spent clock cycles. When all processors reached end of the loop: The genetic core uses the output of this counter as the fitness value. 20

How EvoMP Works? (2) Three main working states Initialize: Just in first population Genetic core generates random particles. Evolution: Uses recombination to produce new populations . When the termination condition is met, system goes to final state. Final: The best chromosome is used as constant output of the genetic core. When one of the processors becomes faulty, the system returns to evolution stage Terminate Initialize Evolution Final Fault detected 21

How Chromosome Codes the Scheduling Data? (1) Each chromosome consists of some small words (gene). Each word contains two fields: A processor number Number of instructions 22

How Chromosome Codes the Scheduling Data (2) Assume that we have a 2X2 mesh 1-MOV R1, 0 2- MOV R2, 0 L1: ;Loop 3- MOV R1, Input 4- MUL R3, R1, Coe1 5- MUL R4, R2, Coe2 6- ADD R1, R3, R4 7- MOV Output, R1 8- MOV R1, R2 9- GENETIC 10-JUMP L1 Chromosome 10 001 Word1 00 01 01 010 Word2 11 000 Word3 Word4 10 101 10 11 10 # of Instructions 23

Data Dependency Problem Data dependencies are the main challenge. Must be detected dynamically at run-time. Is addressed using: Particular machine code style Architectural techniques 24

EvoMP Machine Code Style • Source operands are replaced by line-number of the most recent instructions that has changed it (ID). • Will enormously simplify dependency detection. 10. ADD R1,R2,R3 ; R3=R1+R2 11. AND R2,R6,R7 ; R7=R2&R6 12. SUB R7,R3,R4 ; R4=R7-R3 12. SUB (11), (10), R4 25

Architecture of each Processor • Number of FUs is configurable. • Homogeneous or heterogeneous policies can be used for FUs. • Supports out of order execution. • First free FU grabs the instruction from Instr bus (Daisy Chain). 27

Fetch_Issue Unit • PC1-Instr bus is used for executive instructions. • PC2-Invalidate_Instr bus is used for data dependency detection. 28

Functional Unit • Can be configured to execute different operations: • Arithmetic Operations • Add • Sub • Shift/Rotate Right/Left • Multiply: Add and shift • Logical Operations 29

Genetic Core SW00 SW01 Cell-00 Cell-01 Genetic Core SW10 SW11 Cell-10 Cell-11 • Population size and mutation rate are configurable. • Elite count is constant and equal to two in order to reduce the hardware complexity 30

EvoMP Challenges • Current versions uses centralized memory unit. • In “00” address. • This address does not contain computational circuits. • Major issue for scalability • Search space of genetic algorithm is very large. • Exponentially grows up with linear increase of number of processors. 31

PSO Core [8] 32

Configurable Parameters • There are some configurable parameters in EvoMP: • Word-length of the system • Size of the mesh (number of processors) • Flit length: bit-length of NoC switch links • Population size • Crossover rate 34

Simulation Results • Two sets of applications are used for performance evaluation. • Some DSP programs • Some sample neural Network • Two other decomposition and scheduling methods are implemented enabling the comparison • Static Decomposition Genetic Scheduler (SDGS) • Decomposition is performed statically i.e. tasks are predetermined manually • Genetic core only specifies scheduling scheme • Static Decomposition First Free Scheduler (FF) • Assigns the first task in job-queue to the first free processor in the system 35

16-Tap FIR Filter • Parameters: • 16 bit mode • Population size=16 • Crossover Rate=8 • NoC connection width=16 Best fitness shows number of clock cycles required to execute one iteration using the best particle which has been found yet. • 74 Instructions • 16 multiplication 36

8-Point DCT • Parameters: • 16 bit mode • Population size=16 • Crossover Rate=8 • NoC connection width=16 • 88 Instructions • 32 multiplication 37

16-point DCT • Parameters: • 16 bit mode • Population size=16 • Crossover Rate=6 • NoC connection width=16 • 320 Instructions • 128 multiplication 38

5x5 Matrix Multiplication • Parameters: • 16 bit mode • Population size=16 • Crossover Rate=6 • NoC connection width=16 • 406 Instructions • 125 multiplication 39

Neural Network Case Study 42

Fault Tolerance Results • When a fault is detected in a processor, the evolutionary core eliminates it of contribution in next iterations. • It also returns to evolution stage to find the suitable solution for the new situation. • Best obtained fitness in a 2x3 EvoMP for 16-point DCT program is evaluated. • Faults are injected into 010, 001 and 101 processors in 1000000us, 2000000us and 3000000us respectively 43

Genetic vs. PSO • Population size in both experiments is 16 44

Synthesis Results • Synthesis results on VIRTEX II (XC2V3000) FPGA using SinplifyPro. 45

Summary • The EvoMP which is a novel MPSoC system was studied. • EvoMP exploits evolvable strategies to perform run-time task decomposition and scheduling. • EvoMP does not require concurrent codes because it can parallelize th sequential codes at run-time. • Exploits particular and novel processor architecture in order to address data dependency problem. • Experimental results confirm the applicability of EvoMP novel ideas. 47

Main References [1] N. S. Voros and K. Masselos, System Level Design of Reconfigurable Systems-on-Chip. Netherlands: Springer, 2005. [2] G. Martin, “Overview of the MPSoC design challenge,”Proc. Design and Automation Conf., July 2005, pp. 274-279. [3] S. Amarasinghe, “Multicore programming primer and programming competition,” class notes for 6.189, Computer Architecture Group, Massachusetts Institute of Technology, Available: www.cag.csail.mit.edu/ps3/lectures/6.189-lecture1-intro.pdf. [4] M. Hubner, K. Paulsson, and J. Becker, “Parallel and flexible multiprocessor system-on-chip for adaptive automotive applications based on Xilinx MicroBlaze soft-cores,”Proc. Intl. Symp. Parallel and Distributed Processing, 2005. [5] D. Gohringer, M. Hubner, V. Schatz, and J. Becker, “Runtime adaptive multi-processor system-on-chip: RAMPSoC,”Proc. Intl. Symp. Parallel and Distributed Processing, April 2008, pp. 1-7. [6] A. Klimm, L. Braun, and J. Becker, “An adaptive and scalable multiprocessor system for Xilinx FPGAs using minimal sized processor cores,”Proc. Symp. Parallel and Distributed Processing, April 2008, pp. 1-7. [7] Z.Y. Wen and Y.J. Gang, “A genetic algorithm for tasks scheduling in parallel multiprocessor systems,”Proc. Intl. Conf. Machine Learning and Cybernetics,Nov. 2003, pp.1785-1790. [8] A. Farmahini-Farahani, S. Vakili, S. M. Fakhraie, S. Safari, and C. Lucas, “Parallel scalable hardware implementation of asynchronous discrete particle swarm optimization,” Elsevier J. of Engineering Applications of Artificial Intelligence, submitted for publication. 48

Main References (2) [9] A. A. Jerraya and W. Wolf, Multiprocessor Systems-on-Chips. San Francisco: Morgan Kaufmann Publishers, 2005. [10] A.J. Page and T.J. Naughton, “Dynamic task scheduling using genetic algorithms for heterogeneous distributed computing,”Proc. Intl. Symp. Parallel and Distributed Processing, April 2005, pp. 189.1. [11] E. Carvalho, N. Calazans, and F. Moraes, “Heuristics for dynamic task mapping in NoC based heterogeneous MPSoCs”, Proc. Int. Rapid System Prototyping Workshop, pp. 34-40, 2007. [12] R. Canham, and A. Tyrrell, “An embryonic array with improved efficiency and fault tolerance,”Proc. NASA/DoD Conf. on Evolvable Hardware, July 2003, pp. 265-272. [13] W. Barker, D. M. Halliday, Y. Thoma, E. Sanchez, G. Tempesti, and A. Tyrrell, “Fault tolerance using dynamic reconfiguration on the POEtic Tissue,”IEEE Trans. Evolutionary Computing, vol. 11, num. 5, Oct. 2007, pp. 666-684. 49

Related Publications • Journal: • S. Vakili, S. M. Fakhraie, and S. Mohammadi, “EvoMP: a novel MPSoC architecture with evolvable task decomposition and scheduling,” Submitted to IET Comp. & Digital Tech., (Under Revision). • S. Vakili, S. M. Fakhraie, and S. Mohammadi, “Low-cost fault tolerance in evolvable multiprocessor system: a graceful degradation approach,” Submitted to Journal of Zhejiang University SCIENCE A (JZUS-A). • Conference: • S. Vakili, S. M. Fakhraie, and S. Mohammadi, “Designing an MPSoC architecture with run-time and evolvable task decomposition and scheduling,” Proc. 5’th IEEE Intl. Conf. Innovations in Information Technology, Dec.2008. • S. Vakili, S. M. Fakhraie, S. Mohammadi, and Ali Ahmadi, “Particle swarm optimization for run-time task decomposition and scheduling in evolvable MPSoC,” Proc. IEEE. Intl. conf. Computer Engineering and Technology, Jan. 2009. 50

Design and Implementation of a NoC-Based Cellular Computational System