1 / 21

Persistent Linda 3.0 Peter Wyckoff New York University

Persistent Linda 3.0 Peter Wyckoff New York University. Roadmap. Goals Model Simple example Transactions Tuning transactions Checkpointing Experiments Summary. Goals. Utilize networks of workstations for parallel applications.

kaspar
Download Presentation

Persistent Linda 3.0 Peter Wyckoff New York University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Persistent Linda 3.0 • Peter Wyckoff New York University

  2. Roadmap • Goals • Model • Simple example • Transactions • Tuning transactions • Checkpointing • Experiments • Summary

  3. Goals • Utilize networks of workstations for parallel applications. • Low cost fault tolerance mechanisms for parallel applications with state • Provably good algorithms for minimizing fault tolerance overhead • A robust system with “good” performance on real applications

  4. Void Multiply(float A,B,C[size][size]) { for(int I = 0 ; I < numHelpers ; I++) create_helper(‘helper’); output(‘B Matrix’, B); for(int I = 0 ; I < size ; I++) output(‘Row of A’, I, A[I]); for(int I = 0 ; I < size ; I++) input(‘Row of C’, I, ? C[I]) } Void helper() { float B[size][size]; float A[size],C[size]; read(‘B Matrix’, ?B); while(! Done) { input(‘Row of A’, ?I,?A); mult(A,B,C); output(‘Row of C’, I, C); } Example: Matrix Multiplication Sample Program

  5. Fault Tolerance • Problems • Process resiliency • Process local consistency • Global consistency

  6. Solution: Transactions • Good points • Well defined behavior • All or nothing • Retains global consistency • Durable • Bad points • does not address local consistency • does not address process resiliency • expensive

  7. Void Multiply(float A,B,C[size][size]) { beginTransaction; for(int I = 0 ; I < numHelpers ; I++) create_helper(‘helper’); output(‘B Matrix’, B); for(int I = 0 ; I < size ; I++) output(‘Row of A’, I, A[I]); endTransaction; beginTransaction; for(int I = 0 ; I < size ; I++) input(‘Row of C’, I, ? C[I]) endTransaction; } Void helper() { float B[size][size]; float A[size],C[size]; read(‘B Matrix’, ?B); while(! Done) { beginTransaction; input(‘Row of A’, ?I,?A); mult(A,B,C); output(‘Row of C’, I, C); endTransaction; } Example: Matrix Multiplication with Traditional Transactions Sample Program

  8. Continuations • Encode process state, live variables, in an architecture dependent encoding • Save it to stable store at the end of each transaction • Use the transactional semantics on this information so that it is updated only if the transaction finishes

  9. Void Multiply(float A,B,C[size][size]) { int tranNumber = 0; recoverIfNeeded(tranNumber, A, B, C); beginTransaction(0) for(int I = 0 ; I < numHelpers ; I++) create_helper(‘helper’); output(‘B Matrix’, B); for(int I = 0 ; I < size ; I++) output(‘Row of A’, I, A[I]); endTransaction(++tranNumber, A,B,C); beginTransaction(1); for(int I = 0 ; I < size ; I++) input(‘Row of C’, I, ? C[I]) endTransaction(++tranNumber, A,B,C); } Void helper() { float B[size][size]; float A[size],C[size]; read(‘B Matrix’, ?B); while(! Done) { beginTransaction; input(‘Row of A’, ?I,?A); mult(A,B,C); output(‘Row of C’, I, C); endTransaction; } Matrix Multiplication with Transactions and Continuations Sample Program

  10. Runtime Systemwith Durable Transactions

  11. Cost of Transactions • Do not address local consistency • Do not address process resiliency • expensive • Durability achieved by committing to stable store • Sending the continuations to the server(s) over the network

  12. Lite Transactions • Do not commit to stable store • Transactions still durable from client perspective • Commit is to memory at server and therefore fast! • Periodically checkpoint the server • On server failure (which is rare), rollback

  13. Continuation Committing Revisited 1 • Does the continuation really need to be sent to the server at each and every transaction commit? • What if we only get the continuations to the server every n minutes and checkpoint at the same time? • On process failure will rollback to previous checkpoint • Very low fault tolerance overhead • Single failure leads to rollback

  14. Continuation Committing Revisited 2 • What if we replicate each process? • Only send continuation when checkpointing or when the other process fails to create another replica • Low overhead fault tolerance • Can quickly recover from one failure • Massive rollback for the general failure case

  15. Continuation Committing Mechanisms • Commit consistent • Periodic with replication • Periodic with message logging and replay • Periodic with undo log • Periodic only • Others?

  16. Use the Mode that Best Suits Each Process • Each mode has a different failure free versus recovery time tradeoff • Each mode is good for processes with different characteristics • Commit consistent is great for processes with fairly small state running on unstable machines • Message logging and replay is good for large state processes which don’t communicate too much • Can have the end user or programmer decide which mode an application should use • Use all modes at the same time and have algorithms which decide which mode to use for each transaction based on the process and machine characteristics

  17. Runtime System Different Committing Modes

  18. Challenges • Checkpointing - • algorithms exist for each mechanism in isolation • some processes are inconsistent • must not block clients • Choose the best mechanism for each process in an application at a particular time • Complicated to keep consistency when using different modes

  19. Single Server Checkpointing • Always keep the two latest checkpoints • Flush committed memory to stable store • Flush consistent process continuations to stable store • Request continuations from all inconsistent processes • Continue servicing all requests from consistent processes • Continue servicing all but commit requests from inconsistent processes • Provably correct

  20. Experiments

  21. Current and Future Work • How to replicate the server(s) to provide availability • Algorithms for minimizing fault tolerance overhead • Predicting idle times • Combining the flexibility of PLinda’s programming model with the ease of Calypso’s programming model

More Related