Parallel Event Driven Simulation using GPU (CUDA)

Cmpe 436 TERM PROJECT Parallel Event Driven Simulationusing GPU (CUDA) M.Sancar Koyunlu & Ervin Domazet

LOGIC SIMULATION In Cycle Based Simulation*, the evalution schedule of gates in the design for each step of simulation is determined once at the compilation time of the simulator. Event Based Simulation has a more complicated scheduling policy where a gate is simulated only if at least one of input values have changed. *Alper Şen, Barış Aksanlı, Murat Bozkurt, "Speeding Up Cycle Based Logic Simulation Using Graphic Processing Units", http://cseweb.ucsd.edu/~baksanli/ijpp11.pdf

What is Cuda ? CUDA™ is a parallel computing platform and programming model invented by NVIDIA. It enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU). Computing languages and API's: C, C++,CUDA x86, Fortran, OpenCL, DirectCompute. We started with C++ to our code. C++ support is very limited right now. We had to switch our entire code from c++ to c after finishing sequential algortihm.

Example Applications Identify hidden plaque in arteries: Heart attacks are the leading cause of death worldwide. Harvard Engineering, Harvard Medical School and Brigham & Women's Hospital have teamed up to use GPUs to simulate blood flow and identify hidden arterial plaque without invasive imaging techniques or exploratory surgery. Analyze air traffic flow: The National Airspace System manages the nationwide coordination of air traffic flow. Computer models help identify new ways to alleviate congestion and keep airplane traffic moving efficiently. Using the computational power of GPUs, a team at NASA obtained a large performance gain, reducing analysis time from ten minutes to three seconds.

CUDA MOTIVATION

The reason behind the discrepancy in floating-point capability between the CPU and the GPU is that the GPU is specialized for compute-intensive, highly parallel computation – exactly what graphics rendering is about – and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control, as schematically illustrated by Figure 1-2.

What are we using CUDA for ? We tried to implement the event-driven gate level simulation in cuda supported GPU's.

CUDA BASICS A cuda program consist of host code and device code. Host code runs on cpu and calls cuda kernel with a configuration. In configuration, number of threads that will run on GPU(device code) is determined.

CUDA BASICS How to pass parameter to cuda? Passing a single parameter is easy. It is like parameter passing to a function. __global__ void VecAdd(float A, float B, float C) { int i = threadIdx.x; //some calculations }int main() { // Kernel invocation with N threads VecAdd<<<1, N>>>(A, B, C); }

CUDA BASICS How to pass parameter to cuda? When passing pointers, you have to be careful, because pointers have addresses that are in your machine not in GPU device. There is cudaMalloc() function to allocate memory in cuda device. You can pass struct arrays to CUDA device. But if they have any pointer in them, it will cause a problem. For example struct sampleStr{ int * dsa; } You cannot initilaze this struct in cpu and send it to cuda.

How to send an array and fill it in cuda? __global__ void sample1(int* num){ num[threadIdx.x] = threadIdx.x ;} cudaMalloc((void**)&deviceInt, sizeof(int)*10); int* hostInt = new int[10]; sample1<<<1,10>>>(deviceInt); cudaMemcpy(hostInt,deviceInt,sizeof(int)*10, cudaMemcpyDeviceToHost);

Some important functions of CUDA that we used __syncthreads() : This function synchronizes threads within a block. On current GPUs, a thread block may contain up to 1024 threads. int atomicExch(int* address, int val): reads the 32-bit or 64-bit word old located at the address address in global orshared memory and stores val back to memory at the same address. These two operations are performed in one atomic transaction. The function returns old. We used this function the implement a lock mechanism in our project. (TestAndSet)

MORE ATOMIC TRANSACTIONS • atomicAdd() • atomicSub() • atomicExch() • atomicMin() • atomicMax() • atomicInc() • atomicDec() • atomicCAS() • atomicAnd() • atomicOr() • atomicXor() Details in Cuda_C_Programming_Guide.pdf

THRUST LIBRARY In our implementation of device code, we needed vectors (dynamic arrays). c++ stl library have vector implementation, but it is missing in cuda. At this stage, we run accross thrust library, which has stl library for cuda. http://code.google.com/p/thrust/ It turned out that, it is not what we expected. It is just speeding up the stl libraries using cuda, not providing data structures to use in cuda. After that we had to mimic dynamic arrays using extra offset arrays etc.

This is just a summary of what we have learned about cuda. If you are interested, there is lots of documents, slides, webinars etc in http://developer.nvidia.com/nvidia-gpu-computing-documentation http://developer.nvidia.com/category/zone/cuda-zone

Sequential algorithm of Event Driven Simulation

Procedure: Get the logic circuit Get the input list Simulate the circuit with default values(0) to input gates Start sequential algorithm Show output

1.Get the logic circuit <circuit> <gates> <gate> <name>AND1</name> <signal>0</signal> <delay>2</delay> <type>AND</type> <outGates> <name>OR1</name> <name>NOT1</name> </outGates> <inGates> <name>A</name> <name>B</name> </inGates> </gate> ... <gate> ... </gate> <ga tes> <circuit> • Type of the gate can be: • INPUT • AND • NAND • OR • NOR • XOR • XNOR • FLIPFLOP • NOT • We will get the overall logic circuit from a file, where its format will be as follows:

<inputs> <input> <name>B</name> <time>2</time> <value>1</value> </input> <input> <name>A</name> <time>3</time> <value>1</value> </input> </inputs> 2.Get the input list • The input list will be taken from a file , which has the following format:

Our sequential algorithm assumes that the circuit with current values is in a consistent state. In order to reach to such a state, before calling the sequential algorithm, we : Assign boolean 0 to all gates Starting from all INPUT gates, we iterate through all the affected gates recursivelly At an affected gate X, we do the following: Find all its input gates Get their current output values According to the type of X gate, make its operation with current input values, and modify the output accordingly 3.Simulate the circuit with default values(0) to input gates

Sequential algorithm uses a Future Event List data structure, which will help us in scheduling events. FutureEventList is an array of FutureEvent vectors basically Future event holds the: Index of the gate New value of the gate Its time to change The size of the Future Event List, is found according to the following calculation: size = MaximumDelay/time_increments + 1 time_increments=GCD(all_delays) In addition to the data structures that are used, the algorithm has a variable, which keeps the current time. 4.Start sequential algorithm

4.Sequential algorithm procedure • If current_time==gate_change_time, then get input • It finds all the affected gates • It iterates recursivelly to all af them, and calculates their new output value • If the new value != old value, then it schedules an event in the (current_time + gate._delay)'th place of the FEL • It checks whether there are events in FEL[current_time] • If so gets the next event, processes it and continues from point 2. • If not,prints the current values and increments the current time. • if there is not an update in FEL for one full cycle, then algorithm terminates • if there is an input at the current time, continues from point 1 • if not continues from point 5

At every time increment the current values of the gates is printed. The output format will be as such: 5.Show Output Time: 3 (A-1) (AND1-0) (OR1-0) (XOR1-1) (NOT2-1) (NOT1-1) (XOR1-1) Time: 4 (A-1) (AND1-0) (OR1-1) (XOR1-1) (NOT2-0) (NOT1-1) (XOR1-1)

Parallel algorithm of Event Driven Simulation

Procedure: Get the logic circuit Get the input list Simulate the circuit with default values(0) to input gates Start Parallel algorithm in CUDA Show output

Note: The only difference of Parallel and Sequential Event Driven Simulation is the fourth step of the procedure list.So we will now explain our solution to the Parallel algorithm of Event Driven Simulation using CUDA.(remaining steps are exactly same as in the sequential part.)

Parallel algorithm in CUDA As we explained at the beginning, a single CUDA block may have at most 1024 threads running. The algorithm starts with (N+1) number of Threads, where every single gate is assigned to a thread. Currently the algorithm assumes that the circuit has at most 1023 gates (including the inputs) since we are working in a single block.

Parallel algorithm in CUDA We will take the circuit as a 1-dimentional BaseGate array, from the host. The connections between gates are referanced as indexes, rather than pointers. CUDA kernel module will also take some other variables, such as the # of gates, #of input gates, a bool array for debugging purposes...

Parallel algorithm in CUDA The remaining thread will act as a "controlling thread", which will make the necessary changes on shared variables. Once again we will have a future event list data structure. We will initialize an array of FEL of size N (unique for every gate). Future Event List data structure will have a queue data structure inside it, where updates to a certain gate will be schedulled inside this queue.

Parallel algorithm in CUDA In addition to the queue, FEL will have an "update" flag, where if it the corresponding gate has a scheduled update, this flag will be set to "True". There will be 2 types of updates Updates regarding input gates New value, and time to change will be enqueued to the FEL's queue. Updates of remaining gates Only the time to change will be enqueued to the FEL's queue.

Parallel algorithm in CUDA Moreover, every gate's FEL will have a corresponding lock, since when processing an update, multiple threads may write to the same gates FEL queue. At every time unit, we will keep the current output values of the gates in a multi dimentional boolean array, of size (maxDelay+1), so that no conflict occurs.

Parallel algorithm in CUDA In this manner, when we will process an update, we will subtract its delay from the current time, and take modulus (maxDelay + 1). In this manner we will obtain the values of all gates in (current_time - delay)'s time.

Parallel algorithm in CUDA Besides this, we will have a shared "change" variable, which will be controlled by the Controlling thread. At every time unit, the controlling thread will pass across all gates, and if at least one of them has an update, it will set the Change to 1. On the other hand, if there is not a change in any of the gates, then the algorithm will terminate, by setting change to 0.

Overall Logic

Overall Logic Parallel algorithm in CUDA

Future Considerations: • To write a module which randomly creates big circuits • To test the parallel algorithm on those circuits • To make the necessary modifications in the parallel code, so that we can make use of multiple blocks, where the size of gates will not be limitted to 1023. • Memory optimizations

Thank you for your attention! Cmpe 436 TERM PROJECT M.Sancar Koyunlu & Ervin Domazet

Parallel Event Driven Simulation using GPU (CUDA)