1 / 12

Parallelizing C Programs Using Cilk

Parallelizing C Programs Using Cilk. Mahdi Javadi. Cilk Language. Cilk is a language for multithreaded parallel programming based on C. The programmer should not worry about scheduling the computation to run efficiently. There are three additional keywords: cilk , spawn and sync.

chun
Download Presentation

Parallelizing C Programs Using Cilk

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallelizing C Programs Using Cilk Mahdi Javadi

  2. Cilk Language • Cilk is a language for multithreaded parallel programming based on C. • The programmer should not worry about scheduling the computation to run efficiently. • There are three additional keywords: cilk, spawn and sync.

  3. Example: Fibonacci Int fib (int n) { int x, y; if (n<2) return n; x = fib (n-1); y = fib (n-2); return x+y; } cilk Int fib (int n) { int x, y; if (n<2) return n; x = spawn fib (n-1); y = spawn fib (n-2); sync; return x+y; }

  4. Performance Measures • Tp= execution time on P processors. • T1 is called work. • T∞ is called span. • Obvious lower bounds: Tp ≥ T1/P Tp ≥ T∞ • p =T1/T∞is called parallelism.Using more than p processors makes little sense.

  5. Cilk Compiler • The file extension should be “.cilk”. • Example: > cilkc -O3 fib.cilk -o fib • To find the 30th Fibonacci number using 4 CPUs: > fib --nproc 4 30 • To collect timings of each processor and compute the span (not efficient): > cilkc -cilk-profile -cilk-span -O3 fib.cilk -o fib

  6. ( ( ) ) ( ) ( ) C11 C12 C21 C22 C11 C12 C21 C22 A11 A12 A21 A22 B11 B12 B21 B22 . = = ( ) A11 B11+ A12 B21 A11 B12+ A12 B22 A21 B11+ A22 B21 A21 B12+ A22 B22 Example: Matrix Multiplication • Suppose we want to multiply two n by n matrices: • We can recursively formulate the problem: • i.e. one n by n matrix multiplication reduces to: 8 multiplications and for additions of (n/2) by (n/2) submatrices.

  7. Multiplication Procedure Mult(C, A, B, n) if (n = 1) C[1,1] = A[1,1].B[1,1] else { spawn Mult(C11,A11,B11,n/2); … spawn Mult(C22,A21,B12,n/2); spawn Mult(T11,A12,B21,n/2); … spawn Mult(T22,A22,B22,n/2); sync; Add(C,T,n); }

  8. Addition Procedure Add(C,T,n) if (n = 1) C[1,1] = C[1,1]+T[1,1]; else { spawn Add(C11,T11,n/2); … spawn Add(C22,T22,n/2); sync; } • T1 (work) for addition = O(n2). • T∞(span) for addition = O(log(n)).

  9. Complexity of Multiplication • We know that matrix multiplication is O(n3) hence T1 (work) for multiplication = O(n3). • T∞: M∞(n)= M∞(n/2) + O(log(n)) = O(log2(n)). • p = T1/ T∞ = O(n3) / O(log2(n)). • To multiply 1000 by 1000: p = 107 ( a lot of CPUs !!!)

  10. Discrete Fourier Transform DFT(n,w,p,…) ... t = w2 mod p DFT(n/2,t,p,…); DFT(n/2,t,p,…); … w1 = 1; for (i = 0; i < n/2; i++) { … a[i] = … w1 = w1.w mod p; } cilk DFT(n,w,p,…) ... t = w2 mod p spawn DFT(n/2,t,p,…); spawn DFT(n/2,t,p,…); sync; … spawn ParCom(n,a,p,1,…); cilk ParCom(n,a,p,m,…) if (n <= 512) … spawn ParCom(n/2,a,p,1,…); m’ = m . wn/2 mod p; spawn ParCom(n/2,a+n/2,p,m’,…); sync;

  11. Complexity of ParCom • The sequential combining does n/2 multiplication. • T∞ (span) for ParCom: • T∞(n) = T∞(n/2) + O(log(n)) T∞(n) = O(log2(n)). • p = O(n/log2(n)). • We run FFT on “stan” which has 4 CPUs. • Thus p > 4 does not make sense, so we cut off the parallelism at some level of recursion to speed up the program.

  12. Sequential FFT: 123789 (ms) Timings

More Related