1 / 29

Dariusz Kowalski University of Connecticut & Warsaw University joint work with Alex Shvartsman

Dariusz Kowalski University of Connecticut & Warsaw University joint work with Alex Shvartsman University of Connecticut & MIT. Performing Tasks in Asynchronous Environments. Do-All problem ( [DHW] et al. ).

bishop
Download Presentation

Dariusz Kowalski University of Connecticut & Warsaw University joint work with Alex Shvartsman

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dariusz Kowalski University of Connecticut & Warsaw University joint work with AlexShvartsman University of Connecticut & MIT Performing Tasks in Asynchronous Environments

  2. Do-All problem ([DHW] et al.) DA(p,t)problem abstracts the basic problem of cooperation in a distributed setting: p processors must perform t tasks, andat least one processor must know about it[Dwork Halpern Waarts 92/98] Tasks are: • known to every processor • similar - each takes similar number of local steps • independent - may be performed in any order • idempotent - may be performed concurrently Performing Work with Asynchronous Processors

  3. Do-All: synchronous model with crashes Model: processors are synchronous, may fail by crashes Solutions: problem well understood, results close to optimal • Shared-memory model -- communication by read/write • Kanellakis, P.C., Shvartsman, A.A.: Fault-tolerant parallel computation. Kluwer Academic Publishers (1997) • Message-passing model -- communication by exchanging messages • Dwork, C., Halpern, J., Waarts, O. Performing work efficiently in the presence of faults. SIAM Journal on Computing, 27 (1998) • De Prisco, R., Mayer, A., Yung, M. Time-optimal message-efficient work performance in the presence of faults. Proc. of 13th PODC, (1994) • Chlebus, B., De Prisco, R., Shvartsman, A.A. Performing tasks on synchronous restartable message- passing processors. Distributed Computing, 14 (2001) Performing Work with Asynchronous Processors

  4. Do-All: asynchronous models Models: • Shared-memory model -- communication by read/write -- widely studied, but solutions far from optimal • Kanellakis, P.C., Shvartsman, A.A.: Fault-tolerant parallel computation. Kluwer Academic Publishers (1997) • Anderson, R.J., Woll, H.: Algorithms for the certified Write-All problem. SIAM Journal on Computing, 26 (1997) • Kedem, Z., Palem, K., Raghunathan, A., Spirakis, P.: Combining tentative and definite executions for very fast dependable parallel computing. Proc. of 23rd STOC, (1991) • Message-passing model -- communication by exchanging messages -- no interesting solutions until recently Performing Work with Asynchronous Processors

  5. Shared-Memory vs. Message-Passing Shared-Memory (atomic registers): • processors communicate by read/write in shared-memory • atomicity - guarantees that read outputs the last written value • one read/write operation per local clock cycle • information propagatesand information ispersistent Hence cooperation is always possible, although delayedHere processor scheduling is the major challenge Message-Passing: • processors communicate by exchanging messages • duration of a local step may be unbounded • message delays may be unbounded • information may not propagate -- send/recv depend on delay Performing Work with Asynchronous Processors

  6. Message-delay-sensitive approach Even if messages delay are bounded by d (d-adversary),cooperation may be difficult Observation: If d = (t) then work must be (t ·p) This means that cooperation is difficult, and addressing scheduling alone is not enough - - algorithm design and analysis must be d-sensitive Message-delay-sensitive approach • C. Dwork, N. Lynch and L. Stockmeyer.: Consensus in the presence of partial synchrony. J. of the ACM, 35 (1988) Performing Work with Asynchronous Processors

  7. Measures of efficiency Termination time: the first time when all tasks are done and at least one processors knows about it • Used only to define work and message complexity • Not interesting on its own: if all processors but one are delayed then trivially time is (t) Work :measures the sum, over all processors, of the number of local steps taken until termination time Message complexity (message-passing model):measures number of all point-to-point messages sent until termination time Performing Work with Asynchronous Processors

  8. Structure of the presentation Part 2: Message-passing model. • Model: asynchrony, message delay, and modeling issues • Delay-sensitive lower bounds for Do-All • Progress-tree Do-All algorithms • Simulating shared-memory and Anderson-Woll (AW) • Asynch. message-passing progress-tree algorithm • Permutation Do-All algorithms Part 1: Shared-memory model • Model and bibliography • Improving AW algorithm in shared-memory by better scheduling processors (task load-balancing) Performing Work with Asynchronous Processors

  9. Shared-Memory - model and goal We consider the following model: • pasynchronous processors with PID in {0,…,p-1} • processors communicate by read/write in shared-memory • atomicity - read outputs the last written value • one read/write operation per local clock cycle Write-All : write 1’s into t locations of given array Goal: improve scheduling of cooperating asynchronous processors leading to better load-balancing wrt tasks Performing Work with Asynchronous Processors

  10. Write-All: Selected Bibliography Introducing Write-All problem • Kanellakis, P.C., Shvartsman, A.A.: Efficient parallel algorithms can be made robust. PODC (1989), Distributed Computing (1992) AW algorithm with work O(t p ) • Anderson, R.J., Woll, H.: Algorithms for the certified Write-All problem. SIAM Journal on Computing, 26 (1997) Randomized algorithm with work (t + plog p) • Martel, C., Subramonian, R.: On the complexity of Certified Write-All algorithms. J. Algorithms 16 (1994) First work-optimal deterministic algorithm for t = (p4log p) • Malewicz, G.: A work-optimal deterministic algorithm for the asynchronous Certified Write-All problem. PODC (2003) Performing Work with Asynchronous Processors

  11. Progress tree algorithms [BKRS, AW] • Shared memory • p processors, t tasks(p = t) • q permutations of [q] • q-ary progress tree of depth logq p • nodes are binary completion bits • Permutations establish the order in which the children are visited • p processors traverse the tree and use q-ary expansion of their PID to choose permutations • [Anderson Woll] 1 2 3 q 1 2 3 q 1 2 3 q Performing Work with Asynchronous Processors

  12. Algorithm AWT [Anderson Woll] 3 1 2 3 1 2 2 3 1 2 3 1 • Progress tree data structure is stored in shared memory p, t = 9 , q = 3  : list of 3 schedules from S3 T : ternary tree of 9 leaves (progress tree), values 0-1 PID(j) : j-th digit of ternary-representation of PID 0 PID = 0,3,6 1 2 3 1 PID = 1,4,7 0 2 PID = 2,5,8 7=213 1 2 3 7=213 4 5 6 7 8 9 10 11 12 Performing Work with Asynchronous Processors

  13. Contention of permutations Sn - group of all permutations on set [n], with composition  and identity n ,  - permutations in Sn  - set of q permutations from Sn • i is lrm (left-to-right maximum) in  if (i) > maxj<i (j) • LRM( ) - number of lrm in  [Knuth] • Cont(, ) =  LRM( -1  ) • Contentionof : Cont( ) = maxCont(, ) [AW] Theorem: [AW] For any n > 0 there exists set  of n permutations from Sn with Cont( )  3nHn = (n log n). [Knuth] Knuth, D.E.: The art of computer programming Vol. 3 (third edition). Addison-Wesley Pub Co. (1998) 3 5 2 4 6 1 9 7 8 11 10 Performing Work with Asynchronous Processors

  14. Procedure “Oblivious Do” n - number of jobs and units  - list of n schedules from Sn Procedure Oblivious : Forall processors PID= 0 to n-1 fori = 1 tondo perform Job( PID(i)) Execution of Job( PID(i)) by processor PID is primary, if job  PID(i) has not been previously performed Lemma:[AW] In algorithm Oblivious with n units, n jobs, and using the list of n permutations from Sn, the number of primary job executions is at most Cont( ). Performing Work with Asynchronous Processors

  15. AWT(q)- new progress tree traversal algorithm 4 1 2 3 3 1 4 2 2 3 1 4 4 1 2 3 • Instead of using q permutations on set [q],we use q permutations on set [n], where n = q2 log q p = 6 , t = 16 , q = 2, n = 4  : list of 2 schedules from S4 T : 4-ary tree of 16 leaves (progress tree), values 0-1 PID(j) : j-th digit of ternary-representation of PID 0 PID : even 1 PID : odd 0 5=1014 4 1 2 3 5=1014 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Performing Work with Asynchronous Processors

  16. Main result • Set n =q2 log q and let belist of q schedules from Sn • Define Cont(, ) = max  Cont(, ) Lemma: For sufficiently large q and any set  of at most exp(q2 log2q) permutations on set [q2 log q], there is a list of q schedules from Sn such that Cont(, ) q2 log q + 6q log q • Take q = log p and from above Lemma Theorem: For every  > 0, sufficiently large p and t = (p2+), algorithm AWT(q) performs work O(t). Performing Work with Asynchronous Processors

  17. Message-Passing - model and goals We consider the following model: • pasynchronous processors with PID in {0,…,p-1} • processors communicate by message passing • in one local step each processor can send a message to any subset of processors • messages incur delays between send and receive • processing of all received messages can be done during one local step Goal: understand the impact of message delay on efficiency of algorithmic solutions for Do-All Performing Work with Asynchronous Processors

  18. Lower bound - randomized algorithms Theorem: Any randomized algorithm solving DA with t tasks using p asynchronous message-passing processors performs expected work (t+pdlogd+1t) against any d-adversary. Proof (sketch): Adversary partitions computation into stages, each containingd time units, and constructs delay pattern stage after stage: delays all messages in stage to be received at the end of stage  delays linear number of processors (which want to perform more than (1-1/(3d)) fraction of undone tasks) during stage selection is on-line, with high probability has good properties Performing Work with Asynchronous Processors

  19. Simulating shared-memory algorithms Write-All algorithm AWT • Anderson, R.J., Woll, H.: Algorithms for the certified Write-All problem. SIAM Journal on Computing, 26 (1997) Quorum systems & Atomic memory services • Attiya, H., Bar-Noy, A., Dolev, D.: Sharing memory robust-ly in message passing systems. J. of the ACM, 42 (1996) • Lynch, N., Shvartsman, A.: RAMBO: A Reconfigurable Atomic Memory Service. Proc. of 16th DISC, (2002) Emulating asynchronous shared-memory algorithms : • Momenzadeh, M.: Emulating shared-memory Do-All in asynchronous message passing systems. Masters Thesis, CSE, University of Conn, (2003) Performing Work with Asynchronous Processors

  20. Atomic memory is not required • We use q-ary progress trees as the main data structure that is “written” and “read” -- note that atomicity is not required • If the following two writes occur (the entire tree is written), then a subsequent read may obtain a third value that was never written: • Property of monotone progress : • 1 at a tree node i indicates that all tasks attached to the leaves in the sub-tree rooted in i have been performed • If 1 is written at a node i in the progress tree of a processor, it remains 1 forever 0 0 0 write write read 0 1 1 0 1 1 Performing Work with Asynchronous Processors

  21. Algorithm DAq- traverse progress tree 3 1 2 3 1 2 2 3 1 2 3 1 • Instead of using shared memory, processors broadcast their progress trees as soon as local progress is recorded p, t = 9 , q = 3  : list of 3 schedules from S3 T : ternary tree of 9 leaves (progress tree), values 0-1 PID(j) : j-th digit of ternary-representation of PID 0 PID = 0,3,6 1 2 3 1 PID = 1,4,7 0 2 PID = 2,5,8 7=213 1 2 3 7=213 4 5 6 7 8 9 10 11 12 Performing Work with Asynchronous Processors

  22. Algorithm DAq - casep  t Performing Work with Asynchronous Processors

  23. Procedure DOWORK Performing Work with Asynchronous Processors

  24. Algorithm DAq- analysis Modification of algorithm DAq for p < t : • We partition the t tasks into pjobs of size t /p and let the algorithm DAq work with these jobs. • It takes a processor O(t /p) work (instead of constant) to process such a job (job unit). • In each step, a processor broadcasts at most one message to p-1 other processors, we obtain: Theorem 4: For any constant  > 0 there is a constant q such that the algorithm DAq has work W(p,t,d) = O(tp + pdt /d  ) and message complexity O(p W(p,t,d)) against any d-adversary (d=o(t)). Performing Work with Asynchronous Processors

  25. Permutation algorithms - case p  t Algorithms proceed in a loop: • select the next task using ORDER+SELECT rule • perform selected task • send messages, receive messages, and update state ORDER+SELECT rules: PARAN1 : initially processor PIDpermutes tasks randomly PID selects first task remaining on his schedule PARAN2 : no initial order PID selects task from remaining sets randomly PADET : initially processor PID chooses schedule PID in  PID selects first task remaining on schedule PID  - list of p schedules from St Performing Work with Asynchronous Processors

  26. d-Contention of permutations We introduce the notion of d-Contention : • i is d-lrm in  if |{j < i | (i) < (j)}| < d d = 2 • LRMd() - number of d-lrm in  • Contd(, ) =  LRMd( -1  ) • d-Contentionof : Contd( ) = maxContd(, ) Theorem: For sufficiently large p and n, there is a list  of p permutations from Sn such that, for every integer d >1, Contd( ) n log n + 5pd ln(e+n/d). Moreover, random  is good with high probability. 3 5 2 4 6 1 9 7 8 11 10 Performing Work with Asynchronous Processors

  27. d-Contention and work Lemma: For algorithms PADET and PARAN1, the respective worst case work and expected work is at most Contd( ) against any d-adversary. Example: p = 2, t = 11, d = 2 Order of tasks to perform : 1,2,3,4,5,6,7,8,9,10,11 1 1 3 3 2 2 5 5 7 7 4 9 9 8 6 11 11 10 10 2 2 4 4 6 6 8 8 10 10 11 11 9 7 5 3 1 Performing Work with Asynchronous Processors

  28. Permutation algorithms - results Theorem: Randomized algorithms PARAN1 and PARAN2 perform expected work O(tlog p + pdlog(t /d)) and have expected communication O(tplog p + p2dlog(t /d)) against any d-adversary (d=o(t)). Corollary: There exists a deterministic list of schedules  such that algorithm PADET performs work O(tlog p + pmin{t,d}log(2+t /d)) and has communication O(tplog p + p2min{t,d}log(2+t /d)) when p  t. Performing Work with Asynchronous Processors

  29. Conclusions and open problems • Work-optimal Write-All algorithm for t = (p2+) • First message-delay-sensitive analysis of the Do-All problem for asynchronous processors in message-passing model • lower bounds for deterministic and randomized algorithms • deterministic and randomized algorithms with subquadratic(in p and t ) work for any message delay d as long as d=o(t) • Among the interesting open questions are • is there work-optimal scheduling for t = (p log p) • for algorithm PADET : how to construct list  of permutations efficiently • closing the gap between the upper and the lower bounds • investigate algorithms that simultaneously control work and message complexity Performing Work with Asynchronous Processors

More Related