Laboratoire Informatique de Grenoble

Laboratoire Informatique de Grenoble Parallel algorithms and scheduling: adaptive parallel programming and applications Bruno Raffin, Jean-Louis Roch, Denis Trystram Projet MOAIS [ INRIA / CNRS / INPG / UJF ] moais.imag.fr

Parallel algorithms and scheduling: adaptive parallel programming and applications • Contents • I. Motivations for adaptation and examples. • II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] • III On-line work-stealing scheduling and parallel adaptation • IV. A first example : iterative product ; application to gzip • V. Processor-oblivious parallel prefix computation • VI. Adaptation to time-constraints : oct-tree computation [Bruno/Luciano] • VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis] • VIII. Adaptation to support fault-tolerance by work-stealing Conclusion

Choices in the algorithm • sequential / parallel(s) • approximated / exact • in memory / out of core • … An algorithm is « hybrid » iff there is a choice at a high level between at least two algorithms, each of them could solve the same problem Why adaptive algorithms and how? Resources availability is versatile Input data vary Measures on resources Measures on data Adaptation to improve performances • Scheduling • partitioning • load-balancing • work-stealing • Calibration • tuning parameters block size/ cache choice of instructions, … • priority managing

Adaptationto choose algo_fj for each call to f Modeling an hybrid algorithm • Several algorithms to solve a same problem f : • Eg : algo_f1, algo_f2(block size), … algo_fk : • each algo_fk being recursive algo_fi ( n, … ) { …. f ( n - 1, … ) ; …. f ( n / 2, … ) ; … } • E.g. “practical” hybrids: • Atlas, Goto, FFPack • FFTW • cache-oblivious B-tree • any parallel program with scheduling support: Cilk, Athapascan/Kaapi, Nesl,TLib… .

How to manage overhead due to choices ? • Classification 1/2 : • Simple hybrid iff O(1) choices [eg block size in Atlas, …] • Baroque hybrid iff an unbounded number of choices [eg recursive splitting factors in FFTW] • choices are either dynamic or pre-computed based on input properties.

Choices may or may not be based on architecture parameters. • Classification 2/2. : an hybrid is • Oblivious: control flow does not depend neither on static properties of the resources nor on the input [eg cache-oblivious algorithm [Bender] • Tuned : strategic choices are based on static parameters [eg block size w.r.t cache, granularity, ] • Engineered tuned or self tuned[eg ATLAS and GOTO libraries, FFTW, …][eg [LinBox/FFLAS] [ Saunders&al] • Adaptive : self-configuration of the algorithm, dynamlc • Based on input properties or resource circumstances discovered at run-time[eg : idle processors, data properties, …] [eg TLib Rauber&Rünger]

Examples • BLAS libraries • Atlas: simple tuned (self-tuned) • Goto : simple engineered (engineered tuned) • LinBox / FFLAS : simple self-tuned,adaptive [Saunders&al] • FFTW • Halving factor : baroque tuned • Stopping criterion : simple tuned

Adaptation in parallel algorithms Problem: compute f(a) parallel P=max parallel P=2 Sequential algorithm parallel P=100 … … . . . ? Which algorithm to choose ? Heterogeneous network Multi-user SMP server Grid Dynamic architecture : non-fixed number of resources, variable speeds eg: grid, … but not only: SMP server in multi-users mode

Processor-oblivious algorithms Dynamic architecture : non-fixed number of resources, variable speeds eg: grid, … but not only: SMP server in multi-users mode => motivates « processor-oblivious » parallel algorithm that : + is independent from the underlying architecture: no reference to p nori(t) = speed of processor i at time t nor … + on a given architecture, has performance guarantees : behaves as well as an optimal (off-line, non-oblivious) one

How to adapt the application ? • By minimizing communications • e.g. amortizing synchronizations in the simulation [Beaumont, Daoudi, Maillard, Manneback, Roch - PMAA 2004]adaptive granularity • By contolling latency (interactivity constraints) : • FlowVR[Allard, Menier, Raffin] overhead • By managing node failures and resilience [Checkpoint/restart][checkers] • FlowCert[Jafar, Krings, Leprevost; Roch, Varrette] • By adapting granularity • malleable tasks [Trystram, Mounié] • dataflow cactus-stack : Athapascan/Kaapi[Gautier] • recursive parallelism by « work-stealling » [Blumofe-Leiserson 98, Cilk, Athapascan, ... ] • Self-adaptive grain algorithms • dynamic extraction of paralllelism [Daoudi, Gautier, Revire, Roch - J. TSI 2004 ]

Parallelism and efficiency «Depth » parallel time on resources T = #ops on a critcal path « Work » sequential timeT1= #operations Problem : how to adapt the potential parallelism to the resources ? Scheduling control of the policy (realisation) efficient policy (close to optimal) Difficult in general (coarse grain) But easy ifTsmall (fine grain) Tp = T1/p + T[Greedy scheduling, Graham69] Expensive in general (fine grain) But small overhead if coarse grain => to have T small with coarse grain control

Parallel algorithms and scheduling: adaptive parallel programming and applications Bruno Raffin, Jean-Louis Roch, Denis Trystram INRIA-CNRS Moais team - LIG Grenoble, France • Contents • I. Motivations for adaptation and examples. • II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] • III On-line work-stealing scheduling and parallel adaptation • IV. A first example : iterative product ; application to gzip • V. Processor-oblivious parallel prefix computation • VI. Adaptation to time-constraints : oct-tree computation • VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis] Conclusion

A COMPLETER Denis

Parallel algorithms and scheduling: adaptive parallel programming and applications Bruno Raffin, Jean-Louis Roch, Denis Trystram INRIA-CNRS Moais team - LIG Grenoble, France • Contents • I. Motivations for adaptation and examples. • II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] • III On-line work-stealing scheduling and parallel adaptation • IV. A first example : iterative product ; application to gzip • V. Processor-oblivious parallel prefix computation • VI. Adaptation to time-constraints : oct-tree computation • VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis] Conclusion

Work-stealing (1/2) « Work » W1= #total operations performed «Depth » W = #ops on a critical path (parallel time on resources) • Workstealing = “greedy” schedule but distributed and randomized • Each processor manages locally the tasks it creates • When idle, a processor steals the oldest ready task on a remote -non idle- victim processor (randomly chosen)

Work-stealing (2/2) « Work » W1= #total operations performed «Depth » W = #ops on a critical path (parallel time on resources) • Interests : -> suited to heterogeneous architectures with slight modification [Bender-Rabin02] -> if W small enough near-optimal processor-oblivious schedule with good probability on p processors with average speeds ave NB : #succeeded steals = #task migrations < p W [Blumofe 98, Narlikar 01, Bender 02] • Implementation: work-first principle[Cilk serie-parallel, Kaapi dataflow]-> Move scheduling overhead on the steal operations (infrequent case)-> General case : “local parallelism” implemented by sequential function call

Work stealing scheduling of a parallel recursive fine-grain algorithm • Work-stealing scheduling • an idle processor steals the oldest ready task • Interests : => #succeeded steals < p. T [Blumofe 98, Narlikar 01, ....] => suited to heterogeneous architectures [Bender-Rabin 03, ....] • Hypothesis for efficient parallel executions: • the parallel algorithm is « work-optimal » • T is very small (recursive parallelism) • a « sequential » execution of the parallel algorithm is valid • e.g. : search trees, Branch&Bound, ... • Implementation : work-first principle[Multilisp, Cilk, …] • overhead of task creation only upon steal request: sequential degeneration of the parallel algorithm • cactus-stack management

f1 f1 steal f2 P P’ Implementation of work-stealing Hypothesis : a sequential schedule is valid + non-préemptive execution of ready task Stack f1() { …. fork f2 ; … } • Intérêt : Grain fin « statique », mais contrôle dynamique • Inconvénient: surcôut possible de l’algorithme parallèle [ex. préfixes] fork f2

Modèle de coût : avec une grande probabilité, sur p proc. Identiques - Temps d’exécution = - nombre de requètes de vols =

Experimentation: knary benchmark Distributed Archi. iClusterAthapascan SMP Architecture Origin 3800 (32 procs)Cilk / Athapascan Ts = 2397 s  T1 = 2435

How to obtain an efficientfine-grain algorithm ? • Hypothesis for efficiency of work-stealing : • the parallel algorithm is « work-optimal » • T is very small (recursive parallelism) • Problem : • Fine grain (T small) parallel algorithms may involve a large overhead with respect to a sequential efficient algorithm: • Overhead due to parallelism creation and synchronization • But also arithmetic overhead

Work-stealing and adaptability • Work-stealing ensures allocation of processors to tasks transparently to the application with provable performances • Support to addition of new resources • Support to resilience of resources and fault-tolerance (crash faults, network, …) • Checkpoint/restart mechanisms with provable performances [Porch, Kaapi, …] • “Baroque hybrid” adaptation: there is an -implicit- dynamic choice between two algorithms • a sequential (local) algorithm : depth-first (default choice) • A parallel algorithm : breadth-first • Choice is performed at runtime, depending on resource idleness • Well suited to applications where a fine grain parallel algorithm is also a good sequential algorithm [Cilk]: • Parallel Divide&Conquer computations • Tree searching, Branch&X … -> suited when both sequential and parallel algorithms perform (almost) the same number of operations

SeqCompute SeqCompute Extract_par LastPartComputation Self-adaptive grain algorithm • Principle : To save parallelism overhead by privilegiating a sequential algorithm : => use parallel algorithm only if a processor becomes idle by extracting parallelism from a sequential computation • Hypothesis : two algorithms : • - 1 sequential : SeqCompute • - 1 parallel : LastPartComputation => at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm

Generic self-adaptive grain algorithm

Parallel algorithms and scheduling: adaptive parallel programming and applications • Contents • I. Motivations for adaptation and examples. • II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] • III On-line work-stealing scheduling and parallel adaptation • IV. A first example : iterative product ; application to gzip • V. Processor-oblivious parallel prefix computation • VI. Adaptation to time-constraints : oct-tree computation • VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis] VIII Adaptation for fault tolerance • Conclusion

How to get both optimal work W1and W small? • General approach: to mix both • a sequential algorithm with optimal work W1 • and a fine grain parallel algorithm with minimal critical time W • Folk technique : parallel, than sequential • Parallel algorithm until a certain « grain »; then use the sequential one • Drawback : W increases ;o) …and, also, the number of steals • Work-preserving speed-up technique[Bini-Pan94] sequential, then parallelCascading [Jaja92] : Careful interplay of both algorithms to build one with both W small and W1 = O( Wseq ) • Use the work-optimal sequential algorithm to reduce the size • Then use the time-optimal parallel algorithm to decrease the time Drawback : sequential at coarse grain and parallel at fine grain ;o(

Illustration : f(i), i=1..100 LastPart(w) W=2..100 SeqComp(w) sur CPU=A f(1)

Illustration : f(i), i=1..100 LastPart(w) W=3..100 SeqComp(w) sur CPU=A f(1);f(2)

Illustration : f(i), i=1..100 LastPart(w)on CPU=B W=3..100 SeqComp(w) sur CPU=A f(1);f(2)

Illustration : f(i), i=1..100 LastPart(w)on CPU=B LastPart(w’) LastPart(w) W=3..51 W’=52..100 SeqComp(w) surCPU=A f(1);f(2) SeqComp(w’)

Illustration : f(i), i=1..100 LastPart(w’) LastPart(w) W=3..51 W’=52..100 SeqComp(w) sur CPU=A f(1);f(2) SeqComp(w’)

Illustration : f(i), i=1..100 LastPart(w) LastPart(w’) W=3..51 W’=53..100 SeqComp(w) sur CPU=A f(1);f(2) SeqComp(w’) sur CPU=B f(52)

Expérimentation : parallèle <=> adaptatif Produit iteré Séquentiel, parallèle, adaptatif [Davide Vernizzi] • Séquentiel : • Entrée: tableau de n valeurs • Sortie: • c/c++ code: for (i=0; i<n; i++) res += atoi(x[i]); • Algorithme parallèle : • calcul récursif par bloc (arbre binaire avec fusion) • Taille de bloc = pagesize • Code kaapi : athapascan API

Expérimentation : - l’algorithme parallèle coûte environ 2 fois plus que l’algorithme séquentiel- l’algorithme adaptatif a une efficacité proche de 1 Variante : somme de pages • Entrée: ensemble de n pages. Chaque page est un tableau de valeurs • Sortie: une page où chaque élément estla somme des éléments de même indice des pages précédentes • c/c++ code: for (i=0; i<n; i++) for (j=0; j<pageSize; j++) res [j] += f (pages[i][j]);

Démonstration sur ensibull Script: [vernizzd@ensibull demo]$ more go-tout.sh #!/bin/sh ./spg /tmp/data & ./ppg /tmp/data 1 --a1 -thread.poolsize 3 & ./apg /tmp/data 1 --a1 -thread.poolsize 3 & Résultat: [vernizzd@ensibull demo]$ ./go-tout.sh Page size: 4096 Memory allocated Memory allocated 0:In main: th = 1, parallel 0: ----------------------------------------- 0: res = -2.048e+07 0: time = 0.408178 s ADAPTATIF (3 procs) 0: Threads created: 54 0: ----------------------------------------- 0: res = -2.048e+07 0: time = 0.964014 s PARALLELE (3 procs) 0: #fork = 7497 0: ----------------------------------------- : ----------------------------------------- : res = -2.048e+07 : time = 1.15204 s SEQUENTIEL (1 proc) : -----------------------------------------

Algorithme adaptatif (1/3) • Hypothèse: ordonnancement non préemptif - de type work-stealing • Couplage séquentiel adaptatif : void Adaptative (a1::Shared_w<Page> *resLocal, DescWork dw) { // cout << "Adaptative" << endl; a1::Shared <Page> resLPC; a1::Fork<LPC>() (resLPC, dw); Page resSeq (pageSize); AdaptSeq (dw, &resSeq); a1::Fork <Merge> () (resLPC, *resLocal, resSeq); }

Algorithme adaptatif (2/3) • Côté séquentiel : void AdaptSeq (DescWork dw, Page *resSeq){ DescLocalWork w; Page resLoc (pageSize); double k; while (!dw.desc->extractSeq(&w)) { for (int i=0; i<pageSize; i++ ) { k = resLoc.get (i) + (double) buff[w*pageSize+i]; resLoc.put(i, k); } } *resSeq=resLoc; }

Algorithme adaptatif (3/3) • Côté extraction = algorithme parallèle : struct LPC { void operator () (a1::Shared_w<Page> resLPC, DescWork dw){ DescWork dw2; dw2.Allocate(); dw2.desc->l.initialize(); if (dw.desc->extractPar(&dw2)) { a1::Shared<Page> res2; a1::Fork<AdaptativeMain>() (res2, dw2.desc->i, dw2.desc->j); a1::Shared<Page> resLPCold; a1::Fork<LPC>() (resLPCold, dw); a1::Fork<MergeLPC>() (resLPCold, res2, resLPC); } } };

Application : gzip parallelisation • Gzip : • Utilisé (web) et coûteux bien que de complexité linéaire • Code source :10000 lignes C, structures de données complexes • Principe : LZ77 + arbre Huffman • Pourquoi gzip ? • Problème P-complet, mais parallélisation pratique possible similar to iterated product • Inconvénient: toute parallélisation (connue) entraîne un surcoût • -> perte de taux de compression

Algorithme Parallélisation => Fichieren entrée Partition statique en blocs Partition dynamique en blocs Compressionà la volée => Compressionparallèle Blocs compressés Fichiercompressé Comment paralléliser gzip ? Parallélisation « facile » ,100% compatible avec gzip/gunzip Problèmes : perte de taux de compression, grain dépend de la machine, surcoût

SeqComp InputFile Compressionà la volée Dynamicpartitionin blocks Parallelcompression Outputcompressedfile Outputcompressedblocks cat Parallélisation gzip à grain adaptatif LastPartComputation

Performances sur SMP Pentium 4x200 Mhz

Performances en distribué Recherche distribuée dans 2 répertoires de même taille chacun sur un disque distant (NFS) • Séquentiel Pentium 4x200 Mhz • SMP Pentium 4x200 Mhz • Architecture distribuée MyrinetPentium 4x200 Mhz + 2x333 Mhz

Surcoût en taille de fichier comprimé Gain entemps

Performances Pentium 4x200Mhz

Laboratoire Informatique de Grenoble

Laboratoire Informatique de Grenoble

Presentation Transcript

Laboratoire Bordelais Recherche en Informatique

Daniel Santos Laboratoire de Physique Subatomique et de Cosmologie Grenoble

Pascale Vérant Laboratoire de Spectrométrie Physique. Grenoble.

LORIA – Laboratoire Lorrain de Recherche en Informatique et ses Applications

Anne-Marie Lagrange 1 et al 1 Laboratoire d’Astrophysique de Grenoble, France

Equipe LAME (LAser Molécules et Environnement) Laboratoire de Spectrométrie Physique, Grenoble

LORIA – Laboratoire Lorrain de Recherche en Informatique et ses Applications