1 / 38

Fault-tolerant Stream Processing using a Distributed, Replicated File System

Fault-tolerant Stream Processing using a Distributed, Replicated File System. CMPE.516 Term presentatIon Erkan çetiner 2008700030. OUTLINE. Brief Information Introduction Background Stream Processing Engines Distributed, Replicated File System SGuard Overview MMM Memory Manager

avon
Download Presentation

Fault-tolerant Stream Processing using a Distributed, Replicated File System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fault-tolerant Stream Processing using a Distributed, Replicated File System CMPE.516 Term presentatIon Erkan çetiner 2008700030

  2. OUTLINE • Brief Information • Introduction • Background • Stream Processing Engines • Distributed, Replicated File System • SGuard Overview • MMM Memory Manager • Peace Scheduler • Evaluation • Related Work • Conclusion

  3. Brief Information • SGuard – New fault-tolerance technique for distributed stream processing engines running in clusters of commodity servers • SGuard • Less disruptive to normal stream processing • Leaves more resources available for normal stream processing • Based On : ROLLBACK RECOVERY • It checkpoints the state of stream processing nodes periodically • Restarts failed nodes from their most recent checkpoints • Compared to previous proposals • Performs checkpointing asynchronously • Operators continue processing streams during the checkpoints • Reduces potential disruption due to checkpointing activity

  4. Introduction • Problem Definition • Today’s Web & Internet Services (e-mail , instant messaging , online games etc.) must handle millions of users distributed across the world . • To manage these large scale systems , service providers must continuously monitor the health of their infrastructures , the quality of service experienced by users and any potential malicious activities . • SOLUTION -> Stream Processing Engines • Well suited to monitor since they process data streams continuously and with low latencies • ARISED PROBLEM -> Requires running these SPEs in clusters of commodity servers Important Challenge = Fault Tolerance

  5. Introduction • As size of cluster grows , one or more servers may fail • These failures may include both software crashes & hardware failures • The server may never recover & worst thing , multiple servers may fail simultaneously • Failure  A distributed SPE to block or to produce erroneous results •  Negative Impact on Applications • To address this problem and create a higly-available SPE , several fault tolerance techniques proposed . • These techniques based on replication -> the state of stream processing operators is replicated on multiple servers

  6. Introduction • Two Proposed Systems • 1 - These servers all actively process the same input streams at same time • 2 - Only one primary server performs processing while others passively hold replicated state in memory until primary fails • The replicated state is periodically refreshed through a process called checkpointing to ensure backup nodes hold recent copy of primary’s state • These techniques yields : • Excellent protection from failures • At relatively high costs • When all processing is replicated  normal stream processing is slowed down • When replicas only keep snapshots of primary’s state without performing any processing  CPU overhead is less but cluster loses at least half of its memory capacity

  7. Introduction • What SGuard brings ? • SPE with fault-tolerance at much lower cost & maintains the efiiciency of normal SPE operations • Based on 3 key innovations : 1. Memory Resources : • SPEs process streams in memory  memory is a critical resource • To save memory resources  SGuard saves checkpoints into disk • Stable storage previosuly not considered , since saving to disk was thought as too slow for streaming app. • SGuard partly overcomes this performance challenge by using a new generation of distributed and replicated file system (DFS) such as Google File System , Hadoop DFS . • These new DFSs are optimized for reads & writes of large data volumes and also for append-style workloads • Maintain high availability in face of disk failures • Each SPE node now serve as both stream processing node and a file system node • First Innovation = Use of a DFS as stable storage for SPE checkpoints

  8. Introduction 2. Resource Contention Problems : • When many nodes have large states to checkpoint  Contend for disk and network bandwith resources slowing down individual checkpoints • To address resource contention problem , SGuard extends the DFS master node with a new type of scheduler called “PEACE” • Given a queue of write requests , PEACE selects the replicas for each data chunk and schedules writes in a manner that significantly reduces the latency of individual writes , while only modestly increasing the completion time for all writes in queue . • Schedule as many concurrent writes as there are available resources  avoids resource contention

  9. Introduction 3. Transparent Checkpoints : • To make checkpoints transparent  a new memory management middleware • It enables to checkpoint the state of an SPE operator while it continues executing • MMM partitions the SPE memory into “application-level pages” and uses copy-on-write to perform asynchronous checkpoints • Pages are written to disk without requiring any expensive translation . • MMM checkpoints are more efficient & less disruptive to normal stream processing than previous schemes

  10. Background Stream Processing Engines • In an SPE • A query takes the form of a loop-free , directed graph of operators • Each operator processes data arriving on its input streams and produces data on its output streams • Processing is done in memory without going to disk • Query graphs are called “Query Diagrams” & can be distributed across multiple servers in LANs or WANs • Each server runs one instance of SPE and refer it to as a processing node

  11. Background Stream Processing Engines • Goal :To develop a fault tolerant scheme that does not require reserving half the memory in a cluster for fault tolerance  Save Checkpoints to Disk • To meet the challange of high latency for writing to disks , SGuard extends the DFS concept Distributed , Replicated File System • To manage large data volumes , new types of large-scale file systems or data stores in general are emerging . • They bring two important properties :

  12. Background Distributed , Replicated File System • Support for Large Data Files • Designed to operate on large , Multi GB files • Optimized for bulk read operations • Different chunks are stored on different nodes • A single master node makes all chunk placement decisions • Fault- Tolerance through Replication • Assume frequent failures • To protect , each data chunk is replicated on multiple nodes using synchronous master replication  the client sends data to closest replica , which forwards it to the next closest replica in a chain and so on • Once all replicas have data , client is notified that write has completed • SGuard relies on this aoutomatic replication to protect the distributed SPE against multiple simultaneous failures

  13. Background

  14. SGuard Overview • Work as follows : • Each SPE node takes periodiccheckpoints of its state and writes these checkpoints to stable storage • SGuard uses the DFS as stable storage • To take a checkpoint, a node suspends its processing andmakes a copy of its state (avoid this later) • Between checkpoints, each node logs the output tuples it produces • When a failure occurs, a backup node recovers by reading the mostrecent checkpoint from stable storage and reprocessing all input tuples since then • A node sends a message to all its upstream neighbors to inform them of the checkpointed input tuples • Because nodes buffer output tuples, they can checkpointtheir states independently and still ensure consistency inthe face of a failure • Within a node, however, all interconnectedoperators must be suspended and checkpointed atthe same time to capture a consistent snapshot of the SPE state • Interconnected groups of operators are called HA units

  15. SGuard Overview

  16. SGuard Overview • The level of replication of the DFS determines the level of fault-tolerance of SGuard .  k+1 replicas means , k simultaneous failures tolerated • Problem : Impact of chekpoints on normal stream processing flow • Since operators must be suspended during checkpoints , each checkpoint introduces extra latency in the result stream • This latency will be increased by writing checkpoints to the disks rather than keeping them in memory • To address these challanges , SGuard uses : • MMM (Memory Management Middleware) • PEACE Scheduler

  17. MMM Memory Manager Role of MMM • To checkpoint operator’s states without serializing them first into a buffer • To enable concurrent checkpoints , where state of an operator is copied to disk while operator continues its execution Working Procedure of MMM • Partitions the SPE memory into a collection of “pages” - large fragments of memory allocated on the heap – • Operator states are stored inside these pages • To checkpoint the state of an operator pages are recopied to disk • To enable an operator to execute during the checkpoint : • When the checkpoint begins, the operator is brieflysuspended and all its pages are marked as read-only • The operator execution is then resumed and pages are written to disk in the background • If the operator touches a pagethat has not yet been written to disk briefly interrupted while the MMM makes a copy of the page

  18. MMM Memory Manager

  19. MMM Memory Manager • Page Manager : • Allocates , frees & checkpoints pages • Data Structures : • Implements data structure abstractions on top of PM’s page abstraction • Page Layout : • Simplifies the implementation of new data structures on top of PM

  20. PEACE Scheduler Problem : • When many nodes try to checkpoint HA units at the same time they contend for network and disk resources slowing down individual checkpoints PEACE’s Solution • Peace runs inside the Coordinator at the DFS layer • Given a set of requests to write one or more data chunks : • Peace schedules the writes in a manner that reduces the time to write eachset of chunks while keeping the total time for completing all writes small • By scheduling only as many concurrentwrites as there are available resources • Scheduling all writes from the same set close together • Selecting the destination for each write in a manner that avoids resource contention Advantage : • Adding a scheduler at the DFSrather than the application layer is that the scheduler canbetter control and allocate resources

  21. PEACE Scheduler • Takes as input a model of the network in the formof a directed graph G = (V,E), where V is the set of verticesand E the set of edges

  22. PEACE Scheduler • All write requests used the following Algorithm • Nodes submit write requests to Coordinator in the form of triples (w,r,k) • Algorithm iterates over all requests • For each replica of each chunk , it selects best destination node by solving a min-cost max-flow problem over the graph G  extended with source node s and destination node d • S is connected to writer node with capacity 1 • D is connected to all servers except writer node with infinite capacity • The algorithm find minimum cost and maximum flow path from s to d • To ensure that different chunks& different replicas are written to different nodes , algorithm selects the destination for one chunk replica at a time • All replicas of a chunk must be written to complete the whole write task

  23. PEACE Scheduler • 2 properties have to be satisfied : • To exploit network bandwidth resources efficiently,a node does not send a chunk directly to all replicas.Instead, it only sends it to one replica, which then transmitsthe data to others in a pipeline • Second, to reduce correlations between failures, the DFS placesreplicas at different locations in the network (e.g., one copyon the same rack and another on a different rack ) • All edges have capacity 1 • svr1.1 & svr2.1 requests to write 1 chunk with replication factor 2 • First , process request from svr1.1  assigns first replica to svr1.2 and second to svr2.2 –Constraint given that replicas have to be on different racks- • Then , process request of svr2.1  can not find a path for first replica on same rock –Constraint given that first replica on same rock – • Thus it is processed at time t+1

  24. PEACE Scheduler File-System Write Protocol • Each node submits write requests to the Coordinator indicating the number of chunks it needs to write • PEACE schedules the chunks • The Coordinator then uses callbacks to let the client know when & where to write each chunk • To keep track of progress of writes , each node informs the Coordinator every time a write completed • Once a fraction of all scheduled writes completed , PEACE declares the end of a timestep & moves on next timestep • If the schedule does not allow a client to start writing right away , the Coordinator returns an error message • In SGuard , when this message is received , PM cancels the checkpoints by marking all pages as read-writable again • The state of HA unit is checkpointed again when the node can start finally writing

  25. EVALUATION MMM Evaluation • MMM implemented as C++ library = 9K lines of code • Modify the Borealis distributed SPE to use • To evaluate PEACE , HDFS used • 500 lines of code added to Borealis • 3K lines of code added to HDFS

  26. EVALUATION MMM Evaluation Runtime Operator Overhead • Overhead of using the MMM to holdthe state of an operator compared with using a standarddata structures library such as STL • Study Join and Aggregate, the two most common stateful operators • Executed on a dual 2.66GHz quad-core machine with 16GB RAM • Running 32bit Linux kernel 2.6.22 and single 500GB commodity SATA2 hard disk • JOIN does a self-join of a stream using an equality predicate on unique tuple identifiers • The overhead of using MMM for JOIN operator is within 3% for all 3 window sizes JOIN :

  27. EVALUATION MMM Evaluation Runtime Operator Overhead AGGREGATE : • AGGREGATE groups input tuples by a 4-byte group name & computes count , min and max on a timestamp field • More efficient for large number of groups because it forces the allocation of larger memory chunks at a time , amortizing the per group memory allocation cost • Negligible impact on operator performance • This needs only change 13 lines of JOIN operator code and 120 lines of AGGREGATE operator code

  28. EVALUATION MMM Evaluation Cost of Checkpoint Preperation • Before the state of an HA unit can be saved to disk, it must be prepared • This requires either serializing the stateor copying all PageIDs and marking all pages as read-only JOIN : • Overhead is linear with the size of checkpointed state • Avoiding serialization is beneficial

  29. EVALUATION MMM Evaluation Checkpoint Overhead • Overall runtime overhead by measuringthe end-to-end stream processing latency while the SPEcheckpoints the state of one operator • AGGREGATE used  shows the worst case performance since it touches randomly to the pages • Compare the performance of using the MMMagainst synchronously serializing state, and asynchronously serializing state • Also compare with off-the-shelf VM (VMware) • Fed 2.0K tuples/sec for 10 min while checkpointing every minute • MMM interruption is 2.84 times lower than its nearest competitor

  30. EVALUATION MMM Evaluation Checkpoint Overhead • Common technique for reducing checkpointing overhead is to partition the state of an operator • Aggregate operator in 4 partitions

  31. EVALUATION MMM Evaluation Checkpoint Overhead • To hide synchronous and asynchronous serialization overhead  split operator into 64 parts • Adds overhead of managing all partitions and may not always be possible • MMM is least disruptive to normal stream processing

  32. EVALUATION MMM Evaluation Recovery Performance • Once the coordinator detects a failure & selects a recovery node , the recovery proceeds in 4 steps • The recovery node reads the checkpointed state from the DFS • It reconstructs the Page Manager state • It reconstructs any additional Data Structure state • It replays tuples logged at upstream HA units • Total Recovery Time = Sum of 4 step + Failure Detection Time + Recovery Node Selection Time • Overhead is negligible • MMM recovery imposes a low extra overhead once pages are read from disk

  33. EVALUATION PEACE Evaluation • Evaluate the performance of readingcheckpoints from the DFS and writing them to the DFS • Cluster of 17 machines in two racks(8 & 9 machines each) • All machines in a rack share a gigabit ethernet switch and run the HDFS data node program • Two racks are connected with a gigabit ethernet link • First measure the performance of an HDFS client writing1GB of data to the HDFS cluster using varying degrees of parallelism • As number of threads increased , total througput at client also increases , until network capacity is reached with 4 threads • It takes more than one thread to saturate the network because client computes a checksum for each data chunk that it writes • With 4 threads client reads 95 MB/s so it can recover a 128 MB HA unit in 2 seconds • When more clients access , performance drops significantly for data node

  34. EVALUATION PEACE Evaluation • Measure the time to write checkpoints to the DFS when multiple nodes checkpoint their states simultaneously • The data to checkpoint is already prepared • Vary the number of concurrently checkpointing nodesfrom 8 to 16 and the size of each checkpoint from 128MB to 512MB • Replication level is set to 3 • Compare the performance against the original HDFS implementation • Peace waits until 80 % of writes complete in a timestep before starting next one • Increased global completion time for larger aggregated volumes of I/O tasks • Individual tasks complete I/O much faster than in original HDFS

  35. EVALUATION PEACE Evaluation • Peace significantly reduces the write latency for individual writes with only small decrease in overall resource utilization • It now takes longer for all nodes tocheckpoint their states, the maximum checkpoint frequency is reduced • Longer recoveries as more tuples need to be re-processed after a failure • Correct trade-off since recovery withpassive standby anyways imposes a small delay on streams

  36. Related Work • Semi-transparent approaches • The C3 application-level checkpointing • OODBMS Storage Manager • BerkeleyDB

  37. CONCLUSION • SGuard leverages the existence of a new type of DFS to provide efficient faulttoleranceat a lower cost than previous proposals • SGuard extends the DFS with Peace, a new schedulerthat reduces the time to write individual checkpoints in face of high contention • SGuard also improves the transparencyof SPE checkpoints through the Memory Management Middleware,which enables efficient asynchronous checkpointing • The performance of SGuard is promising • With Peace and the DFS, nodes in a 17-server cluster can checkpoint512MB of state within less than 20s each • MMM efficiently hides this checkpointing activity

  38. References • Y. Kwon, M. Balazinska, and A. Greenberg. Fault-Tolerant Stream Processing Using a Distributed, Replicated File System.PVLDB, 1(1):574–585, 2008. • Magdalena Balazinska , Hari Balakrishnan , Samuel R. Madden , Michael Stonebraker , “Fault-tolerance in the Borealis distributed stream processing system” , ACM New York, NY, USA , ISSN:0362-5915 , 2008

More Related