Design and Implementation of the CCC Parallel Programming Language

Design and Implementationof the CCC Parallel Programming Language Nai-Wei Lin Department of Computer Science and Information Engineering National Chung Cheng University

Outline • Introduction • The CCC programming language • The CCC compiler • Performance evaluation • Conclusions ICS2004

Motivations • Parallelismis the future trend • Programming in parallelismuch more difficult thanprogramming in serial • Parallel architectures arevery diverse • Parallel programming models arevery diverse ICS2004

Motivations • Design a parallel programming language that uniformly integrates various parallel programming models • Implement a retargetable compiler for this parallel programming language on various parallel architectures ICS2004

Approaches to Parallelism • Library approach • MPI (Message Passing Interface), pthread • Compiler approach • HPF (High Performance Fortran), HPC++ • Language approach • Occam, Linda, CCC (Chung Cheng C) ICS2004

Models of Parallel Architectures • Control Model • SIMD: Single Instruction Multiple Data • MIMD: Multiple Instruction Multiple Data • Data Model • Shared-memory • Distributed-memory ICS2004

Models of Parallel Programming • Concurrency • Control parallelism:simultaneously execute multiple threads of control • Data parallelism:simultaneously execute the same operations on multiple data • Synchronization and communication • Shared variables • Message passing ICS2004

Granularity of Parallelism • Procedure-level parallelism • Concurrent execution of procedures on multiple processors • Loop-level parallelism • Concurrent execution of iterations of loops on multiple processors • Instruction-level parallelism • Concurrent execution of instructions on a single processor with multiple functional units ICS2004

The CCC Programming Language • CCC is a simple extension of C and supports bothcontrolanddataparallelism • A CCC program consists of a set ofconcurrentandcooperativetasks • Control parallelism runs inMIMDmode and communicates viashared variablesand/ormessage passing • Data parallelism runs inSIMDmode and communicates viasharedvariables ICS2004

Control Parallel Data Parallel Tasks in CCC Programs ICS2004

Control Parallelism • Concurrency • task • par and parfor • Synchronization and communication • shared variables – monitors • message passing – channels ICS2004

Monitors • The monitor construct is a modular and efficient construct for synchronizing shared variables among concurrent tasks • It provides data abstraction, mutual exclusion, and conditional synchronization ICS2004

Customer Customer Customer An Example - Barber Shop Barber Chair ICS2004

An Example - Barber Shop task::main( ) { monitorBarber_Shop bs; int i; par { barber( bs ); parfor (i = 0; i < 10; i++) customer( bs ); } } ICS2004

An Example - Barber Shop task::barber(monitorBarber_Shop in bs) { while (1 ) { bs.get_next_customer( ); bs.finished_cut( ); } } task::customer(monitorBarber_Shop in bs) { bs.get_haircut( ); } ICS2004

An Example - Barber Shop monitorBarber_Shop { int barber, chair, open; cond barber_available, chair_occupied; cond door_open, customer_left; Barber_Shop( ); void get_haircut( ); void get_next_customer( ); void finished_cut( ); }; ICS2004

An Example - Barber Shop Barber_Shop( ) { barber = 0; chair = 0; open = 0; } void get_haircut( ) { while (barber == 0) wait(barber_available); barber = 1; chair += 1; signal(chair_occupied); while (open == 0) wait(door_open); open = 1; signal(customer_left); } ICS2004

An Example - Barber Shop void get_next_customer( ) { barber += 1; signal(barber_available); while (chair == 0) wait(chair_occupied); chair = 1; } void get_haircut( ) { open += 1; signal(door_open); while (open > 0) wait(customer_left); } ICS2004

Channels • The channel construct is a modular and efficient construct for message passing among concurrent tasks • Pipe:one to one • Merger:many to one • Spliter:one to many • Multiplexer:many to many ICS2004

Channels • Communication structures among parallel tasks are morecomprehensive • The specification of communication structures is easier • The implementation of communication structures is moreefficient • The static analysis of communication structures is moreeffective ICS2004

An Example - Consumer-Producer consumer producer consumer spliter consumer ICS2004

An Example - Consumer-Producer task::main( ) { spliterint chan; int i; par { producer( chan ); parfor (i = 0; i < 10; i++) consumer( chan ); } } ICS2004

An Example - Consumer-Producer task::producer(spliterin int chan) { int i; for (i = 0; i < 100; i++) put(chan, i); for (i = 0; i < 10; i++) put(chan, END); } ICS2004

An Example - Consumer-Producer task::consumer(spliterinint chan) { int data; while ((data = get(chan)) != END) process(data); } ICS2004

Data Parallelism • Concurrency • domain – an aggregate of synchronous tasks • Synchronization and communication • domain – variables in global name space ICS2004

An Example – Matrix Multiplication  = ICS2004

An Example– Matrix Multiplication domainmatrix_op[16] { int a[16], b[16], c[16]; multiply(distribute inint [16:block][16], distributeinint [16][16:block], distributeoutint [16:block][16]); }; ICS2004

An Example– Matrix Multiplication task::main( ) { int A[16][16], B[16][16], C[16][16]; domain matrix_op m; read_array(A); read_array(B); m.multiply(A, B, C); print_array(C); } ICS2004

An Example– Matrix Multiplication matrix_op::multiply(A, B, C) distribute in int [16:block][16] A; distribute in int [16][16:block] B; distribute out int [16:block][16] C; { int i, j; a := A; b := B; for (i = 0; i < 16; i++) for (c[i] = 0, j = 0; j < 16; j++) c[i] += a[j] * matrix_op[i].b[j]; C := c; } ICS2004

Platforms for the CCC Compiler • PCs and SMPs • Pthread: shared memory + dynamic thread creation • PC clusters and SMP clusters • Millipede: distributed shared memory + dynamic remote thread creation • The similarities between these two classes of machines enable a retargetable compiler implementation for CCC ICS2004

Organization of the CCC Programming System CCC applications CCC compiler CCC runtime library Virtual shared memory machine interface Pthread Millipede SMP SMP cluster ICS2004

The CCC Compiler • Tasks → threads • Monitors → mutex locks, read-write locks, and condition variables • Channels → mutex locks and condition variables • Domains → set of synchronous threads • Synchronous execution → barriers ICS2004

Virtual Shared Memory Machine Interface • Processor management • Thread management • Shared memory allocation • Mutex locks • Read-write locks • Condition variables • Barriers ICS2004

The CCC Runtime Library • The CCC runtime library contains a collection of functions that implements the salient abstractions of CCC on top of the virtual shared memory machine interface ICS2004

Performance Evaluation • SMPs • Hardware：an SMP machine with four CPUs, each CPU is an Intel PentiumII Xeon 450MHz, and cache is 512K • Software：OS is Solaris 5.7 and library is pthread 1.26 • SMP clusters • Hardware：four SMP machines, each of which has two CPUs, each CPU is Intel PentiumIII 500MHz, and cache is 512K • Software：OS is windows 2000 and library is millipede 4.0 • Network：Fast ethernet network 100Mbps ICS2004

Benchmarks • Matrix multiplication (1024 x 1024) • Ｗarshall’s transitive closure (1024 x 1024) • Airshed simulation (5) ICS2004

Sequential 1thread/1cpu 2threads/1cpu 4threads/1cpu 8threads/1cpu CCC (1 cpu) 287.5 295.05 (0.97, 0.97) 264.24 (1.08, 1.08) 250.45 (1.14, 1.14) 275.32 (1.04, 1.04) Pthread (1 cpu) 292.42 (0.98, 0.98) 257.45 (1.12, 1.12) 244.24 (1.17, 1.17) 266.20 (1.08, 1.08) CCC (2 cpu) 152.29 (1.89, 0.94) 110.54 (2.6, 1.3) 98.32 (2.93, 1.46) 124.44 (2.31, 1.16) Pthread (2 cpu) 149.88 (1.91, 0.96) 105.45 (2.72, 1.36) 93.56 (3.07, 1.53) 119.42 (2.41, 1.20) CCC (4 cpu) 76.39 (3.76, 0.94) 69.44 (4.14, 1.03) 73.54 (3.90, 0.98) Pthread (4 cpu) 74.72 (3.85, 0.96) 65.42 (4.39, 1.09) 69.88 (4.11, 1.02) Matrix Multiplication (SMPs) 64.44 (4.46, 1.11) 59.44 (4.83, 1.20) ICS2004 (unit ：sec)

Sequential 1thread/1cpu 2threads/1cpu 4threads/1cpu 8threads/1cpu CCC (1mach x 2cpu) 470.44 253.12 (1.85, 0.929) 201.23 (2.33, 1.16) 158.31 (2.97, 1.48) 234.46 (2.0, 1.0) Millipede (1mach x 2cpu) 248.11 (1.89, 0.95) 196.33 (2.39, 1.19) 154.22 (3.05, 1.53) 224.95 (2.09, 1.05) CCC (2mach x 2cpu) 136.34 (3.45, 0.86) 102.25 (4.6, 1.15) 96.25 (4.89, 1.22) 148.25 (3.17, 0.79) Millipede (2mach x 2cpu) 129.33 (3.63, 0.91) 96.52 (4.87, 1.22) 91.45 (5.14, 1.27) 142.45 (3.31, 0.82) CCC (4mach x 2cpu) 87.25 (5.39, 0.67) 62.33 (7.54, 0.94) 80.25 (5.45, 0.73) 102.45 (4.67, 0.58) Millipede (4mach x 2cpu) 78.37 (6.0, 0.75) 54.92 (8.56, 1.07) 75.98 (5.57, 0.75) 95.44 (4.87, 0.61) Matrix Multiplication (SMP clusters) (unit ：sec) ICS2004

Sequtial 1thread/1cpu 2threads/1cpu 4threads/1cpu 8threads/1cpu CCC (1 cpu) 150.32 152.88 (0.98, 0.98) 138.44 (1.08, 1.08) 143.54 (1.05, 1.05) 154.33 (0.97, 0.97) Pthread (1 cpu) 151.25 (0.99, 0.99) 135.45 (1.11, 1.11) 139.21 (1.07, 1.07) 152.44 (0.99, 0.99) CCC (2 cpu) 83.36 (1.80, 0.90) 69.45 (2.16, 1.08) 78.54 (1.91, 0.96) 98.24 (1.53, 0.77) Pthread (2 cpu) 79.32 (1.90, 0.95) 66.85 (2.25, 1.12) 74.24 (2.02, 1.01) 93.44 (1.60, 0.80) CCC (4 cpu) 49.43 (3.04, 0.76) 43.19 (3.48, 0.87) 58.44 (2.57, 0.64) 77.42 (1.94, 0.49) Pthread (4 cpu) 44.14 (3.40, 0.85) 40.89 (3.68, 0.91) 55.23 (2.72, 0.68) 74.21 (2.02, 0.51) Warshall’s Transitive Closure (SMPs) (unit ：sec) ICS2004

Sequential 1thread/1cpu 2threads/1cpu 4threads/1cpu 8threads/1cpu CCC (1mach x 2cpu) 305.35 159.24 (1.91, 0.96) 132.81 (2.29, 1.14) 102.19 (2.98, 1.49) 153.90 (1.98, 0.99) Millipade (1mach x 2cpu) 155.34 (1.96, 0.98) 125.91 (2.42, 1.21) 95.29 (3.20, 1.59) 144.53 (2.11, 1.56) CCC (2mach x 2cpu) 100.03 (3.05, 0.76) 82.40 (3.70, 0.92) 148.97 (2.04, 0.52) 202.78 (1.50, 0.38) Millipede (2mach x 2cpu) 88.45 (3.45, 0.86) 75.91 (4.02, 1.00) 140.28 (2.17, 0.54) 189.38 (1.61, 0.41) CCC (4mach x 2cpu) 60.06 (5.08, 0.64) 54.56 (5.59, 0.70) 89.68 (3.40, 0.43) 138.76 (2.20, 0.27) Millipede (4mach x 2cpu) 54.05 (5.65, 0.71) 47.53 (6.42, 0.80) 81.28 (3.75, 0.46) 129.96 (2.36, 0.30) Warshall’s Transitive Closure (SMP clusters) ICS2004 (unit ：sec)

Airshed simulation (SMPs) threads (unit ：sec) ICS2004

Airshed simulation (SMP clusters) threads (unit ：sec) ICS2004

Conclusions • A high-level parallel programming language that uniformly integrates • Both control and data parallelism • Both shared variables and message passing • A modular parallel programming language • A retargetable compiler ICS2004

Design and Implementation of the CCC Parallel Programming Language

Design and Implementation of the CCC Parallel Programming Language

Presentation Transcript

Programming Languages: Design, Specification, and Implementation

L21: Parallel Programming Language Features

CS 341 Programming Language Design and Implementation

CS 341 Programming Language Design and Implementation

CS 341 Programming Language Design and Implementation

Teaching Programming Language Design and Implementation: Perspective From Two Roles

CS 341 Programming Language Design and Implementation

CS 341 Programming Language Design and Implementation

CS 341 Programming Language Design and Implementation

CS 341 Programming Language Design and Implementation

CS 341 Programming Language Design and Implementation

CS 341 Programming Language Design and Implementation

Programming Languages: Design, Specification, and Implementation

CS 341 Programming Language Design and Implementation

CS 341 Programming Language Design and Implementation

Programming Languages: Design, Specification, and Implementation

Programming Languages: Design, Specification, and Implementation

Design Patterns for Parallel Programming

Programming Languages: Design, Specification, and Implementation

Programming Languages: Design, Specification, and Implementation

Implementation of Computational Algorithms using Parallel Programming