View-Oriented Parallel Programming for multi-core systems

View-Oriented Parallel Programming for multi-core systems Dr Zhiyi Huang World 45 Univ of Otago

An age of CMT • CMT offers us the power of parallel computing • To harness the power relies on good parallel applications and competent parallel programmers • Sound parallel programming methodology is the key

Two camps • Message passing vs. shared memory • Message passing style is complex • Communication with shared memory is simple and easy, but…

Problems for SM-based PP (1) • Data race condition is a pain • Data race: there are concurrent accesses to the same memory location, and at least one of them is write access • To debug a data race condition is difficult since a parallel execution is normally not repeatable

Problems for SM-based PP (2) • Deadlock is another pain • Mutual exclusive primitives such as locks are required to prevent data races, but • it may result in deadlock, a situation where multiple threads/processes wait for each other due to competing for locks • Mutual exclusion has complicated the mental model of parallel programming

Problems for SM-based PP (3) • Poor portability is yet another pain • Parallel applications are system dependent • Mutual exclusive primitives such as lock are not standardized • Synchronization primitives such as barrier are not standardized • Shared memory allocation is not standardized

Solutions? • A parallel programming style with the following features • Data race free • Mutual exclusion free • Deadlock free • Portable to any systems with shared memory

View-Oriented Parallel Programming

What is a view? • Suppose M is the set of data objects in shared memory • A view is a group of data objects from the shared memory •  V, VM • Views must not overlap each other •  Vi, Vj, i  j, Vi  Vj =  • Suppose there are n views in shared memory • ∑ Vi=M

VOPP Requirements • The programmer should divide the shared data into a number of views according to the data flow of the parallel algorithm. • A view should consist of data objects that are always processed as an atomic set in a program. • Views can be created and destroyed anytime. • Each view has a unique view identifier

VOPP Requirements (cont.) • View primitives such as acquire_view and release_view must be used when a view is accessed. acquire_view(View_A); A = A + 1; release_view(View_A); • acquire_Rview and release_Rview can be used when a view is only read by a processor.

VOPP Requirements (cont.) • When a process/thread accesses multiple views at the same time, only one acquiring primitive is used. acquire_3_views(V_A, V_B, V_C); C = A + B; release_views();

Example • A VOPP program for a producer/consumer problem If(prod_id == 0){ acquire_view(1); produce(x); release_view(1); } barrier(0); acquire_Rview(1); consume(x); release_Rview(1);

VOPP features • No concern of data race condition • The programmer is only concerned about views, not mutual exclusion • Mutual exclusion is implemented by the system which detects potential data races as well by checking view boundaries • Deadlock free • Mutual exclusion is implemented by the system and can be implemented data race free and deadlock free • Portability? • By standardization of API

Requirements for the system • Keep track of view locations • Capable to check view boundaries • Guarantee deadlock free when implementing mutual exclusion

Advantages of VOPP • Keep the convenience of shared memory programming • Focus on data partitioning and data access instead of data race and mutual exclusion • View primitives automatically achieve mutual exclusion • View primitives are not extra burden • The programmer can finely tune the parallel algorithm by careful view partitioning

Advantages of VOPP (cont.) • Implementation independent • View access can be based on mutual exclusion or Transactional Memory (TM) • TM is a memory system that checks access conflicts • Programming language independent • Can be implemented as a user space library • Performance advantage • Cache pre-fetching when a view is acquired • Can cache a view until the view is not acquired by any other threads/processes

Philosophy of VOPP • Shared memory is a critical resource that needs to be used with care • If there is no need to use shared memory, don’t use it • Justification is wanted before a view is created • Compatible with Throughput Computing which encourages multiple independent threads running in a chip

VOPP vs. MPI • Easier for programmers than MPI • For problems like task queue, programming with MPI is horrific. • Can mimic any finely-tuned MPI program • Shared message  view • Send/recv  acquire_view • Essential differences • View is location transparent • More barriers in VOPP

Implementation • VOPP is supported by our DSM system called VODCA • DSM: Distributed Shared Memory system provides a virtual shared memory on multi-computers • VODCA: View-Oriented, Distributed, Cluster-based Approach to parallel computing • VODCA version 1.0 • Will be released as an open source software • A library run at the user space • Its implementation will be published on DSM06

Experiment • Use a cluster computer • The cluster computer, in Tsinghua Univ., consists of 128 Itanium 2 running Linux 2.4, connected by InfiniBand. Each node has two 1.3 GHz processors and 4 Gbytes RAM. We run two processes on each node. • We used four applications, Integer Sort (IS), Gauss, Successive Over-Relaxation (SOR), and Neural Network (NN).

Related systems • TreadMarks (TMK) is a state-of-the-art Distributed Shared Memory system based on traditional parallel programming. • Message Passing Interface (MPI) is a standard for message passing-based parallel programming. We used LAM/MPI.

Performance of NN

Performance of IS

Performance of SOR

Performance of Gauss

Future work on VOPP • API for multi-core systems • Implementation on Niagara • More benchmarks/applications, especially telecommunication applications • Performance evaluation on CMT • A view-based debugger for VOPP

Questions?

View-Oriented Parallel Programming for multi-core systems