Processor-oblivious parallel algorithms and scheduling Illustration on parallel prefix

Processor-oblivious parallel algorithms and schedulingIllustration on parallel prefix Jean-Louis Roch, Daouda Traore INRIA-CNRS Moais team - LIG Grenoble, France • Contents • I. What is a processor-oblivious parallel algorithm ? • II. Work-stealing scheduling of parallel algorithms • III. Processor-oblivious parallel prefix computation

Processor-oblivious algorithms Dynamic architecture : non-fixed number of resources, variable speeds eg: grid, … but not only: SMP server in multi-users mode => motivates « processor-oblivious » parallel algorithm that : + is independent from the underlying architecture: no reference to p nori(t) = speed of processor i at time t nor … + on a given architecture, has performance guarantees : behaves as well as an optimal (off-line, non-oblivious) one Problem: often, the larger the parallel degree, the larger the #operations to perform !

* * * * Prefix of size n/2 13 … n * * * 24 … n-1 Prefix computation • Prefix problem : • input : a0, a1, …, an • output : 0, 1, …, n with • Sequential algorithm:for (i= 0 ; i <= n; i++ ) [ i ] = [ i – 1 ] * a [ i ] ; • Fine grain optimal parallel algorithm [Ladner-Fischer]: performs W1= W = noperations a0 a1 a2 a3 a4 … an-1 an Critical time W =2. log n but performs W1= 2.n ops Twice more expensive than the sequential …

Prefix computation : an example where parallelism always costs • Any parallel algorithm with critical time W runs on p processors in time • strict lower bound : block algorithm + pipeline [Nicolau&al. 1996] • Question : How to design a generic parallel algorithm, independent from the architecture, that achieves optimal performance on any given architecture ? • > to design a malleable algorithm where scheduling suits the number of operations performed to the architecture

Architecture model - Heterogeneous processors with changing speed => i(t) = instantaneous speed of processor i at time t in #operations per second - Average speed per processor for a computation with duration T : - Lower bound for the time of prefix computation :

Work-stealing (1/2) « Work » W1= #total operations performed «Depth » W = #ops on a critical path (parallel time on resources) • Workstealing = “greedy” schedule but distributed and randomized • Each processor manages locally the tasks it creates • When idle, a processor steals the oldest ready task on a remote -non idle- victim processor (randomly chosen)

Work-stealing (2/2) « Work » W1= #total operations performed «Depth » W = #ops on a critical path (parallel time on resources) • Interests : -> suited to heterogeneous architectures with slight modification [Bender-Rabin02] -> if W small enough near-optimal processor-oblivious schedule with good probability on p processors with average speeds ave NB : #succeeded steals = #task migrations < p W [Blumofe 98, Narlikar 01, Bender 02] • Implementation: work-first principle[Cilk serie-parallel, Kaapi dataflow]-> Move scheduling overhead on the steal operations (infrequent case)-> General case : “local parallelism” implemented by sequential function call

How to get both optimal work W1and W small? • General approach: to mix both • a sequential algorithm with optimal work W1 • and a fine grain parallel algorithm with minimal critical time W • Folk technique : parallel, than sequential • Parallel algorithm until a certain « grain »; then use the sequential one • Drawback : W increases ;o) …and, also, the number of steals • Work-preserving speed-up technique[Bini-Pan94] sequential, then parallelCascading [Jaja92] : Careful interplay of both algorithms to build one with both W small and W1 = O( Wseq ) • Use the work-optimal sequential algorithm to reduce the size • Then use the time-optimal parallel algorithm to decrease the time Drawback : sequential at coarse grain and parallel at fine grain ;o(

SeqCompute SeqCompute Extract_par LastPartComputation Alternative :concurrently sequential and parallel Based on the Work-first principle : Executes always a sequential algorithm to reduce parallelism overhead • use parallel algorithm only if a processor becomes idle (ie steals) by extracting parallelism from a sequential computation Hypothesis : two algorithms : • - 1 sequential : SeqCompute- 1 parallel : LastPartComputation : at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm • Self-adaptive granularity • Examples : - iterated product [Vernizzi 05] - gzip / compression [Kerfali 04] - MPEG-4 / H264 [Bernard 06] - prefix computation [Traore 06]

0 a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 Main Seq.  Steal request  Work-stealer 1  Work-stealer 2 Adaptive Prefix on 3 processors Sequential 1 Parallel

0 a1 a2 a3 a4 Main Seq.  1 2 Steal request  a5 a6 a7 a8 a9 a10 a11 a12  6 Work-stealer 1 i=a5*…*ai  Work-stealer 2 Adaptive Prefix on 3 processors Sequential 3 Parallel 7

0 a1 a2 a3 a4 Main Seq.  1 2 3 4 8 8 Preempt 4  a5 a6 a7 a8  6 7 Work-stealer 1 i=a5*…*ai a9 a10 a11 a12  10 Work-stealer 2 i=a9*…*ai Sequential Adaptive Prefix on 3 processors Parallel 8

0 a1 a2 a3 a4 8 Main Seq.  1 2 3 4 11 Preempt 11 8  a5 a6 a7 a8  6 7 Work-stealer 1 i=a5*…*ai a9 a10 a11 a12  10 Work-stealer 2 i=a9*…*ai Adaptive Prefix on 3 processors Sequential 8 Parallel 5 6 8 9 11

0 a1 a2 a3 a4 8 11 a12 Main Seq.  1 2 3 4  a5 a6 a7 a8  6 7 Work-stealer 1 i=a5*…*ai a9 a10 a11 a12  10 Work-stealer 2 i=a9*…*ai Adaptive Prefix on 3 processors Sequential 8 11 12 Parallel 5 6 7 8 9 10 11

0 a1 a2 a3 a4 8 11 a12 Main Seq. 1  2 3 4 8 11 12  a5 a6 a7 a8  5 6 6 7 7 8 Work-stealer 1 i=a5*…*ai a9 a10 a11 a12  9 10 10 11 Work-stealer 2 i=a9*…*ai Adaptive Prefix on 3 processors Sequential Implicit critical path on the sequential process Parallel

Analysis of the algorithm • Execution time • Scheme of the proof : • Dynamic coupling of two algorithms that completes simultaneously: • Sequential: (optimal) number of operations S on one processor • Parallel : minimal time but performs X operations on other processors • dynamic splitting always possible till finest grain BUT local sequential • Critical path small ( eg : log X) • Each non constant time task can potentially be splitted (variable speeds) • Algorithmic scheme ensures Ts = Tp + O(log X)=> enables to bound the whole number X of operations performedand the overhead of parallelism = (s+X) - #ops_optimal

Optimal off-line on p procs Pure sequential Oblivious Adaptive prefix : experiments1 Prefix sum of 8.106 double on a SMP 8 procs (IA64 1.5GHz/ linux) Single user context Time (s) #processors Single-user context : processor-oblivious prefix achieves near-optimal performance : - close to the lower bound both on 1 proc and on p processors - Less sensitive to system overhead : even better than the theoretically “optimal” off-line parallel algorithm on p processors :

Off-line parallel algorithm for p processors Oblivious Adaptive prefix : experiments 2 Prefix sum of 8.106 double on a SMP 8 procs (IA64 1.5GHz/ linux) External charge (9-p external processes) Time (s) #processors Multi-user context : Additional external charge: (9-p) additional external dummy processes are concurrently executed Processor-oblivious prefix computation is always the fastest15% benefit over a parallel algorithm for p processors with off-line schedule,

Conclusion The interplay of an on-line parallel algorithm directed by work-stealing schedule useful for the design of processor-oblivious algorithms Application to prefix computation : - theoretically reaches the lower bound on heterogeneous processors with changing speeds - practically, achieves near-optimal performances on multi-user SMPs Generic adaptivescheme to implement parallel algorithms with provable performance - work in progress : parallel 3D reconstruction [oct-tree scheme with deadline constraint]

Thank you

Adaptative 8 proc. Parallel 8 proc. Parallel 7 proc. Parallel 6 proc. Parallel 5 proc. Parallel 4 proc. Parallel 3 proc. Parallel 2 proc. Sequential The Prefix race: sequential/parallel fixed/ adaptive On each of the 10 executions, adaptive completes first

Parallel Parallel Adaptive Adaptive Adaptive prefix : some experiments Prefix of 10000 elements on a SMP 8 procs (IA64 / linux) External charge Time (s) Time (s) #processors #processors Multi-user context Adaptive is the fastest15% benefit over a static grain algorithm • Single user context • Adaptive is equivalent to: • - sequential on 1 proc • - optimal parallel-2 proc. on 2 processors • - … • - optimal parallel-8 proc. on 8 processors

With * = double sum ( r[i]=r[i-1] + x[i] ) Finest “grain” limited to 1 page = 16384 octets = 2048 double Single user Processors with variable speeds Remark for n=4.096.000 doubles : - “pure” sequential : 0,20 s - minimal ”grain” = 100 doubles : 0.26s on 1 proc and 0.175 on 2 procs (close to lower bound)

Processor-oblivious parallel algorithms and scheduling Illustration on parallel prefix

Processor-oblivious parallel algorithms and scheduling Illustration on parallel prefix

Presentation Transcript

Parallel Algorithms

Parallel Algorithms

Parallel Algorithms

Parallel Graph Algorithms

Parallel and Distributed Algorithms

Parallel Graph Algorithms

Parallel Algorithms

Parallel Algorithms

Hardware Acceleration of Parallel Prefix Algorithms

PARALLEL PROCESSOR ORGANIZATIONS

Truthful Algorithms for Scheduling Selfish Tasks on Parallel Machines

Parallel Machine Scheduling

Parallel Job Scheduling Algorithms and Interfaces

Seminar Parallel Programming and Parallel Algorithms

Processor-oblivious parallel algorithms and scheduling Illustration on parallel prefix

Parallel Algorithms

Parallel Algorithms

Scheduling on Parallel Systems

Parallel Prefix and Data Parallel Operations

Parallel Algorithms

Scheduling on Parallel Systems

Parallel Algorithms