1 / 43

Design and Implementation of the CCC Parallel Programming Language

Design and Implementation of the CCC Parallel Programming Language. Nai-Wei Lin Department of Computer Science and Information Engineering National Chung Cheng University. Outline. Introduction The CCC programming language The CCC compiler Performance evaluation Conclusions. Motivations.

travis-ryan
Download Presentation

Design and Implementation of the CCC Parallel Programming Language

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design and Implementationof the CCC Parallel Programming Language Nai-Wei Lin Department of Computer Science and Information Engineering National Chung Cheng University

  2. Outline • Introduction • The CCC programming language • The CCC compiler • Performance evaluation • Conclusions ICS2004

  3. Motivations • Parallelismis the future trend • Programming in parallelismuch more difficult thanprogramming in serial • Parallel architectures arevery diverse • Parallel programming models arevery diverse ICS2004

  4. Motivations • Design a parallel programming language that uniformly integrates various parallel programming models • Implement a retargetable compiler for this parallel programming language on various parallel architectures ICS2004

  5. Approaches to Parallelism • Library approach • MPI (Message Passing Interface), pthread • Compiler approach • HPF (High Performance Fortran), HPC++ • Language approach • Occam, Linda, CCC (Chung Cheng C) ICS2004

  6. Models of Parallel Architectures • Control Model • SIMD: Single Instruction Multiple Data • MIMD: Multiple Instruction Multiple Data • Data Model • Shared-memory • Distributed-memory ICS2004

  7. Models of Parallel Programming • Concurrency • Control parallelism:simultaneously execute multiple threads of control • Data parallelism:simultaneously execute the same operations on multiple data • Synchronization and communication • Shared variables • Message passing ICS2004

  8. Granularity of Parallelism • Procedure-level parallelism • Concurrent execution of procedures on multiple processors • Loop-level parallelism • Concurrent execution of iterations of loops on multiple processors • Instruction-level parallelism • Concurrent execution of instructions on a single processor with multiple functional units ICS2004

  9. The CCC Programming Language • CCC is a simple extension of C and supports bothcontrolanddataparallelism • A CCC program consists of a set ofconcurrentandcooperativetasks • Control parallelism runs inMIMDmode and communicates viashared variablesand/ormessage passing • Data parallelism runs inSIMDmode and communicates viasharedvariables ICS2004

  10. Control Parallel Data Parallel Tasks in CCC Programs ICS2004

  11. Control Parallelism • Concurrency • task • par and parfor • Synchronization and communication • shared variables – monitors • message passing – channels ICS2004

  12. Monitors • The monitor construct is a modular and efficient construct for synchronizing shared variables among concurrent tasks • It provides data abstraction, mutual exclusion, and conditional synchronization ICS2004

  13. Customer Customer Customer An Example - Barber Shop Barber Chair ICS2004

  14. An Example - Barber Shop task::main( ) { monitorBarber_Shop bs; int i; par { barber( bs ); parfor (i = 0; i < 10; i++) customer( bs ); } } ICS2004

  15. An Example - Barber Shop task::barber(monitorBarber_Shop in bs) { while (1 ) { bs.get_next_customer( ); bs.finished_cut( ); } } task::customer(monitorBarber_Shop in bs) { bs.get_haircut( ); } ICS2004

  16. An Example - Barber Shop monitorBarber_Shop { int barber, chair, open; cond barber_available, chair_occupied; cond door_open, customer_left; Barber_Shop( ); void get_haircut( ); void get_next_customer( ); void finished_cut( ); }; ICS2004

  17. An Example - Barber Shop Barber_Shop( ) { barber = 0; chair = 0; open = 0; } void get_haircut( ) { while (barber == 0) wait(barber_available); barber = 1; chair += 1; signal(chair_occupied); while (open == 0) wait(door_open); open = 1; signal(customer_left); } ICS2004

  18. An Example - Barber Shop void get_next_customer( ) { barber += 1; signal(barber_available); while (chair == 0) wait(chair_occupied); chair = 1; } void get_haircut( ) { open += 1; signal(door_open); while (open > 0) wait(customer_left); } ICS2004

  19. Channels • The channel construct is a modular and efficient construct for message passing among concurrent tasks • Pipe:one to one • Merger:many to one • Spliter:one to many • Multiplexer:many to many ICS2004

  20. Channels • Communication structures among parallel tasks are morecomprehensive • The specification of communication structures is easier • The implementation of communication structures is moreefficient • The static analysis of communication structures is moreeffective ICS2004

  21. An Example - Consumer-Producer consumer producer consumer spliter consumer ICS2004

  22. An Example - Consumer-Producer task::main( ) { spliterint chan; int i; par { producer( chan ); parfor (i = 0; i < 10; i++) consumer( chan ); } } ICS2004

  23. An Example - Consumer-Producer task::producer(spliterin int chan) { int i; for (i = 0; i < 100; i++) put(chan, i); for (i = 0; i < 10; i++) put(chan, END); } ICS2004

  24. An Example - Consumer-Producer task::consumer(spliterinint chan) { int data; while ((data = get(chan)) != END) process(data); } ICS2004

  25. Data Parallelism • Concurrency • domain – an aggregate of synchronous tasks • Synchronization and communication • domain – variables in global name space ICS2004

  26. An Example – Matrix Multiplication  = ICS2004

  27. An Example– Matrix Multiplication domainmatrix_op[16] { int a[16], b[16], c[16]; multiply(distribute inint [16:block][16], distributeinint [16][16:block], distributeoutint [16:block][16]); }; ICS2004

  28. An Example– Matrix Multiplication task::main( ) { int A[16][16], B[16][16], C[16][16]; domain matrix_op m; read_array(A); read_array(B); m.multiply(A, B, C); print_array(C); } ICS2004

  29. An Example– Matrix Multiplication matrix_op::multiply(A, B, C) distribute in int [16:block][16] A; distribute in int [16][16:block] B; distribute out int [16:block][16] C; { int i, j; a := A; b := B; for (i = 0; i < 16; i++) for (c[i] = 0, j = 0; j < 16; j++) c[i] += a[j] * matrix_op[i].b[j]; C := c; } ICS2004

  30. Platforms for the CCC Compiler • PCs and SMPs • Pthread: shared memory + dynamic thread creation • PC clusters and SMP clusters • Millipede: distributed shared memory + dynamic remote thread creation • The similarities between these two classes of machines enable a retargetable compiler implementation for CCC ICS2004

  31. Organization of the CCC Programming System CCC applications CCC compiler CCC runtime library Virtual shared memory machine interface Pthread Millipede SMP SMP cluster ICS2004

  32. The CCC Compiler • Tasks → threads • Monitors → mutex locks, read-write locks, and condition variables • Channels → mutex locks and condition variables • Domains → set of synchronous threads • Synchronous execution → barriers ICS2004

  33. Virtual Shared Memory Machine Interface • Processor management • Thread management • Shared memory allocation • Mutex locks • Read-write locks • Condition variables • Barriers ICS2004

  34. The CCC Runtime Library • The CCC runtime library contains a collection of functions that implements the salient abstractions of CCC on top of the virtual shared memory machine interface ICS2004

  35. Performance Evaluation • SMPs • Hardware:an SMP machine with four CPUs, each CPU is an Intel PentiumII Xeon 450MHz, and cache is 512K • Software:OS is Solaris 5.7 and library is pthread 1.26 • SMP clusters • Hardware:four SMP machines, each of which has two CPUs, each CPU is Intel PentiumIII 500MHz, and cache is 512K • Software:OS is windows 2000 and library is millipede 4.0 • Network:Fast ethernet network 100Mbps ICS2004

  36. Benchmarks • Matrix multiplication (1024 x 1024) • Warshall’s transitive closure (1024 x 1024) • Airshed simulation (5) ICS2004

  37. Sequential 1thread/1cpu 2threads/1cpu 4threads/1cpu 8threads/1cpu CCC (1 cpu) 287.5 295.05 (0.97, 0.97) 264.24 (1.08, 1.08) 250.45 (1.14, 1.14) 275.32 (1.04, 1.04) Pthread (1 cpu) 292.42 (0.98, 0.98) 257.45 (1.12, 1.12) 244.24 (1.17, 1.17) 266.20 (1.08, 1.08) CCC (2 cpu) 152.29 (1.89, 0.94) 110.54 (2.6, 1.3) 98.32 (2.93, 1.46) 124.44 (2.31, 1.16) Pthread (2 cpu) 149.88 (1.91, 0.96) 105.45 (2.72, 1.36) 93.56 (3.07, 1.53) 119.42 (2.41, 1.20) CCC (4 cpu) 76.39 (3.76, 0.94) 69.44 (4.14, 1.03) 73.54 (3.90, 0.98) Pthread (4 cpu) 74.72 (3.85, 0.96) 65.42 (4.39, 1.09) 69.88 (4.11, 1.02) Matrix Multiplication (SMPs) 64.44 (4.46, 1.11) 59.44 (4.83, 1.20) ICS2004 (unit :sec)

  38. Sequential 1thread/1cpu 2threads/1cpu 4threads/1cpu 8threads/1cpu CCC (1mach x 2cpu) 470.44 253.12 (1.85, 0.929) 201.23 (2.33, 1.16) 158.31 (2.97, 1.48) 234.46 (2.0, 1.0) Millipede (1mach x 2cpu) 248.11 (1.89, 0.95) 196.33 (2.39, 1.19) 154.22 (3.05, 1.53) 224.95 (2.09, 1.05) CCC (2mach x 2cpu) 136.34 (3.45, 0.86) 102.25 (4.6, 1.15) 96.25 (4.89, 1.22) 148.25 (3.17, 0.79) Millipede (2mach x 2cpu) 129.33 (3.63, 0.91) 96.52 (4.87, 1.22) 91.45 (5.14, 1.27) 142.45 (3.31, 0.82) CCC (4mach x 2cpu) 87.25 (5.39, 0.67) 62.33 (7.54, 0.94) 80.25 (5.45, 0.73) 102.45 (4.67, 0.58) Millipede (4mach x 2cpu) 78.37 (6.0, 0.75) 54.92 (8.56, 1.07) 75.98 (5.57, 0.75) 95.44 (4.87, 0.61) Matrix Multiplication (SMP clusters) (unit :sec) ICS2004

  39. Sequtial 1thread/1cpu 2threads/1cpu 4threads/1cpu 8threads/1cpu CCC (1 cpu) 150.32 152.88 (0.98, 0.98) 138.44 (1.08, 1.08) 143.54 (1.05, 1.05) 154.33 (0.97, 0.97) Pthread (1 cpu) 151.25 (0.99, 0.99) 135.45 (1.11, 1.11) 139.21 (1.07, 1.07) 152.44 (0.99, 0.99) CCC (2 cpu) 83.36 (1.80, 0.90) 69.45 (2.16, 1.08) 78.54 (1.91, 0.96) 98.24 (1.53, 0.77) Pthread (2 cpu) 79.32 (1.90, 0.95) 66.85 (2.25, 1.12) 74.24 (2.02, 1.01) 93.44 (1.60, 0.80) CCC (4 cpu) 49.43 (3.04, 0.76) 43.19 (3.48, 0.87) 58.44 (2.57, 0.64) 77.42 (1.94, 0.49) Pthread (4 cpu) 44.14 (3.40, 0.85) 40.89 (3.68, 0.91) 55.23 (2.72, 0.68) 74.21 (2.02, 0.51) Warshall’s Transitive Closure (SMPs) (unit :sec) ICS2004

  40. Sequential 1thread/1cpu 2threads/1cpu 4threads/1cpu 8threads/1cpu CCC (1mach x 2cpu) 305.35 159.24 (1.91, 0.96) 132.81 (2.29, 1.14) 102.19 (2.98, 1.49) 153.90 (1.98, 0.99) Millipade (1mach x 2cpu) 155.34 (1.96, 0.98) 125.91 (2.42, 1.21) 95.29 (3.20, 1.59) 144.53 (2.11, 1.56) CCC (2mach x 2cpu) 100.03 (3.05, 0.76) 82.40 (3.70, 0.92) 148.97 (2.04, 0.52) 202.78 (1.50, 0.38) Millipede (2mach x 2cpu) 88.45 (3.45, 0.86) 75.91 (4.02, 1.00) 140.28 (2.17, 0.54) 189.38 (1.61, 0.41) CCC (4mach x 2cpu) 60.06 (5.08, 0.64) 54.56 (5.59, 0.70) 89.68 (3.40, 0.43) 138.76 (2.20, 0.27) Millipede (4mach x 2cpu) 54.05 (5.65, 0.71) 47.53 (6.42, 0.80) 81.28 (3.75, 0.46) 129.96 (2.36, 0.30) Warshall’s Transitive Closure (SMP clusters) ICS2004 (unit :sec)

  41. Airshed simulation (SMPs) threads (unit :sec) ICS2004

  42. Airshed simulation (SMP clusters) threads (unit :sec) ICS2004

  43. Conclusions • A high-level parallel programming language that uniformly integrates • Both control and data parallelism • Both shared variables and message passing • A modular parallel programming language • A retargetable compiler ICS2004

More Related