Fault-tolerant Stream Processing using a Distributed, Replicated File System

Fault-tolerant Stream Processing using a Distributed, Replicated File System CMPE.516 Term presentatIon Erkan çetiner 2008700030

OUTLINE • Brief Information • Introduction • Background • Stream Processing Engines • Distributed, Replicated File System • SGuard Overview • MMM Memory Manager • Peace Scheduler • Evaluation • Related Work • Conclusion

Brief Information • SGuard – New fault-tolerance technique for distributed stream processing engines running in clusters of commodity servers • SGuard • Less disruptive to normal stream processing • Leaves more resources available for normal stream processing • Based On : ROLLBACK RECOVERY • It checkpoints the state of stream processing nodes periodically • Restarts failed nodes from their most recent checkpoints • Compared to previous proposals • Performs checkpointing asynchronously • Operators continue processing streams during the checkpoints • Reduces potential disruption due to checkpointing activity

Introduction • Problem Definition • Today’s Web & Internet Services (e-mail , instant messaging , online games etc.) must handle millions of users distributed across the world . • To manage these large scale systems , service providers must continuously monitor the health of their infrastructures , the quality of service experienced by users and any potential malicious activities . • SOLUTION -> Stream Processing Engines • Well suited to monitor since they process data streams continuously and with low latencies • ARISED PROBLEM -> Requires running these SPEs in clusters of commodity servers Important Challenge = Fault Tolerance

Introduction • As size of cluster grows , one or more servers may fail • These failures may include both software crashes & hardware failures • The server may never recover & worst thing , multiple servers may fail simultaneously • Failure  A distributed SPE to block or to produce erroneous results •  Negative Impact on Applications • To address this problem and create a higly-available SPE , several fault tolerance techniques proposed . • These techniques based on replication -> the state of stream processing operators is replicated on multiple servers

Introduction • Two Proposed Systems • 1 - These servers all actively process the same input streams at same time • 2 - Only one primary server performs processing while others passively hold replicated state in memory until primary fails • The replicated state is periodically refreshed through a process called checkpointing to ensure backup nodes hold recent copy of primary’s state • These techniques yields : • Excellent protection from failures • At relatively high costs • When all processing is replicated  normal stream processing is slowed down • When replicas only keep snapshots of primary’s state without performing any processing  CPU overhead is less but cluster loses at least half of its memory capacity

Introduction • What SGuard brings ? • SPE with fault-tolerance at much lower cost & maintains the efiiciency of normal SPE operations • Based on 3 key innovations : 1. Memory Resources : • SPEs process streams in memory  memory is a critical resource • To save memory resources  SGuard saves checkpoints into disk • Stable storage previosuly not considered , since saving to disk was thought as too slow for streaming app. • SGuard partly overcomes this performance challenge by using a new generation of distributed and replicated file system (DFS) such as Google File System , Hadoop DFS . • These new DFSs are optimized for reads & writes of large data volumes and also for append-style workloads • Maintain high availability in face of disk failures • Each SPE node now serve as both stream processing node and a file system node • First Innovation = Use of a DFS as stable storage for SPE checkpoints

Introduction 2. Resource Contention Problems : • When many nodes have large states to checkpoint  Contend for disk and network bandwith resources slowing down individual checkpoints • To address resource contention problem , SGuard extends the DFS master node with a new type of scheduler called “PEACE” • Given a queue of write requests , PEACE selects the replicas for each data chunk and schedules writes in a manner that significantly reduces the latency of individual writes , while only modestly increasing the completion time for all writes in queue . • Schedule as many concurrent writes as there are available resources  avoids resource contention

Introduction 3. Transparent Checkpoints : • To make checkpoints transparent  a new memory management middleware • It enables to checkpoint the state of an SPE operator while it continues executing • MMM partitions the SPE memory into “application-level pages” and uses copy-on-write to perform asynchronous checkpoints • Pages are written to disk without requiring any expensive translation . • MMM checkpoints are more efficient & less disruptive to normal stream processing than previous schemes

Background Stream Processing Engines • In an SPE • A query takes the form of a loop-free , directed graph of operators • Each operator processes data arriving on its input streams and produces data on its output streams • Processing is done in memory without going to disk • Query graphs are called “Query Diagrams” & can be distributed across multiple servers in LANs or WANs • Each server runs one instance of SPE and refer it to as a processing node

Background Stream Processing Engines • Goal :To develop a fault tolerant scheme that does not require reserving half the memory in a cluster for fault tolerance  Save Checkpoints to Disk • To meet the challange of high latency for writing to disks , SGuard extends the DFS concept Distributed , Replicated File System • To manage large data volumes , new types of large-scale file systems or data stores in general are emerging . • They bring two important properties :

Background Distributed , Replicated File System • Support for Large Data Files • Designed to operate on large , Multi GB files • Optimized for bulk read operations • Different chunks are stored on different nodes • A single master node makes all chunk placement decisions • Fault- Tolerance through Replication • Assume frequent failures • To protect , each data chunk is replicated on multiple nodes using synchronous master replication  the client sends data to closest replica , which forwards it to the next closest replica in a chain and so on • Once all replicas have data , client is notified that write has completed • SGuard relies on this aoutomatic replication to protect the distributed SPE against multiple simultaneous failures

Background

SGuard Overview • Work as follows : • Each SPE node takes periodiccheckpoints of its state and writes these checkpoints to stable storage • SGuard uses the DFS as stable storage • To take a checkpoint, a node suspends its processing andmakes a copy of its state (avoid this later) • Between checkpoints, each node logs the output tuples it produces • When a failure occurs, a backup node recovers by reading the mostrecent checkpoint from stable storage and reprocessing all input tuples since then • A node sends a message to all its upstream neighbors to inform them of the checkpointed input tuples • Because nodes buffer output tuples, they can checkpointtheir states independently and still ensure consistency inthe face of a failure • Within a node, however, all interconnectedoperators must be suspended and checkpointed atthe same time to capture a consistent snapshot of the SPE state • Interconnected groups of operators are called HA units

SGuard Overview

SGuard Overview • The level of replication of the DFS determines the level of fault-tolerance of SGuard .  k+1 replicas means , k simultaneous failures tolerated • Problem : Impact of chekpoints on normal stream processing flow • Since operators must be suspended during checkpoints , each checkpoint introduces extra latency in the result stream • This latency will be increased by writing checkpoints to the disks rather than keeping them in memory • To address these challanges , SGuard uses : • MMM (Memory Management Middleware) • PEACE Scheduler

MMM Memory Manager Role of MMM • To checkpoint operator’s states without serializing them first into a buffer • To enable concurrent checkpoints , where state of an operator is copied to disk while operator continues its execution Working Procedure of MMM • Partitions the SPE memory into a collection of “pages” - large fragments of memory allocated on the heap – • Operator states are stored inside these pages • To checkpoint the state of an operator pages are recopied to disk • To enable an operator to execute during the checkpoint : • When the checkpoint begins, the operator is brieflysuspended and all its pages are marked as read-only • The operator execution is then resumed and pages are written to disk in the background • If the operator touches a pagethat has not yet been written to disk briefly interrupted while the MMM makes a copy of the page

MMM Memory Manager

MMM Memory Manager • Page Manager : • Allocates , frees & checkpoints pages • Data Structures : • Implements data structure abstractions on top of PM’s page abstraction • Page Layout : • Simplifies the implementation of new data structures on top of PM

PEACE Scheduler Problem : • When many nodes try to checkpoint HA units at the same time they contend for network and disk resources slowing down individual checkpoints PEACE’s Solution • Peace runs inside the Coordinator at the DFS layer • Given a set of requests to write one or more data chunks : • Peace schedules the writes in a manner that reduces the time to write eachset of chunks while keeping the total time for completing all writes small • By scheduling only as many concurrentwrites as there are available resources • Scheduling all writes from the same set close together • Selecting the destination for each write in a manner that avoids resource contention Advantage : • Adding a scheduler at the DFSrather than the application layer is that the scheduler canbetter control and allocate resources

PEACE Scheduler • Takes as input a model of the network in the formof a directed graph G = (V,E), where V is the set of verticesand E the set of edges

PEACE Scheduler • All write requests used the following Algorithm • Nodes submit write requests to Coordinator in the form of triples (w,r,k) • Algorithm iterates over all requests • For each replica of each chunk , it selects best destination node by solving a min-cost max-flow problem over the graph G  extended with source node s and destination node d • S is connected to writer node with capacity 1 • D is connected to all servers except writer node with infinite capacity • The algorithm find minimum cost and maximum flow path from s to d • To ensure that different chunks& different replicas are written to different nodes , algorithm selects the destination for one chunk replica at a time • All replicas of a chunk must be written to complete the whole write task

PEACE Scheduler • 2 properties have to be satisfied : • To exploit network bandwidth resources efficiently,a node does not send a chunk directly to all replicas.Instead, it only sends it to one replica, which then transmitsthe data to others in a pipeline • Second, to reduce correlations between failures, the DFS placesreplicas at different locations in the network (e.g., one copyon the same rack and another on a different rack ) • All edges have capacity 1 • svr1.1 & svr2.1 requests to write 1 chunk with replication factor 2 • First , process request from svr1.1  assigns first replica to svr1.2 and second to svr2.2 –Constraint given that replicas have to be on different racks- • Then , process request of svr2.1  can not find a path for first replica on same rock –Constraint given that first replica on same rock – • Thus it is processed at time t+1

PEACE Scheduler File-System Write Protocol • Each node submits write requests to the Coordinator indicating the number of chunks it needs to write • PEACE schedules the chunks • The Coordinator then uses callbacks to let the client know when & where to write each chunk • To keep track of progress of writes , each node informs the Coordinator every time a write completed • Once a fraction of all scheduled writes completed , PEACE declares the end of a timestep & moves on next timestep • If the schedule does not allow a client to start writing right away , the Coordinator returns an error message • In SGuard , when this message is received , PM cancels the checkpoints by marking all pages as read-writable again • The state of HA unit is checkpointed again when the node can start finally writing

EVALUATION MMM Evaluation • MMM implemented as C++ library = 9K lines of code • Modify the Borealis distributed SPE to use • To evaluate PEACE , HDFS used • 500 lines of code added to Borealis • 3K lines of code added to HDFS

EVALUATION MMM Evaluation Runtime Operator Overhead • Overhead of using the MMM to holdthe state of an operator compared with using a standarddata structures library such as STL • Study Join and Aggregate, the two most common stateful operators • Executed on a dual 2.66GHz quad-core machine with 16GB RAM • Running 32bit Linux kernel 2.6.22 and single 500GB commodity SATA2 hard disk • JOIN does a self-join of a stream using an equality predicate on unique tuple identifiers • The overhead of using MMM for JOIN operator is within 3% for all 3 window sizes JOIN :

EVALUATION MMM Evaluation Runtime Operator Overhead AGGREGATE : • AGGREGATE groups input tuples by a 4-byte group name & computes count , min and max on a timestamp field • More efficient for large number of groups because it forces the allocation of larger memory chunks at a time , amortizing the per group memory allocation cost • Negligible impact on operator performance • This needs only change 13 lines of JOIN operator code and 120 lines of AGGREGATE operator code

EVALUATION MMM Evaluation Cost of Checkpoint Preperation • Before the state of an HA unit can be saved to disk, it must be prepared • This requires either serializing the stateor copying all PageIDs and marking all pages as read-only JOIN : • Overhead is linear with the size of checkpointed state • Avoiding serialization is beneficial

EVALUATION MMM Evaluation Checkpoint Overhead • Overall runtime overhead by measuringthe end-to-end stream processing latency while the SPEcheckpoints the state of one operator • AGGREGATE used  shows the worst case performance since it touches randomly to the pages • Compare the performance of using the MMMagainst synchronously serializing state, and asynchronously serializing state • Also compare with off-the-shelf VM (VMware) • Fed 2.0K tuples/sec for 10 min while checkpointing every minute • MMM interruption is 2.84 times lower than its nearest competitor

EVALUATION MMM Evaluation Checkpoint Overhead • Common technique for reducing checkpointing overhead is to partition the state of an operator • Aggregate operator in 4 partitions

EVALUATION MMM Evaluation Checkpoint Overhead • To hide synchronous and asynchronous serialization overhead  split operator into 64 parts • Adds overhead of managing all partitions and may not always be possible • MMM is least disruptive to normal stream processing

EVALUATION MMM Evaluation Recovery Performance • Once the coordinator detects a failure & selects a recovery node , the recovery proceeds in 4 steps • The recovery node reads the checkpointed state from the DFS • It reconstructs the Page Manager state • It reconstructs any additional Data Structure state • It replays tuples logged at upstream HA units • Total Recovery Time = Sum of 4 step + Failure Detection Time + Recovery Node Selection Time • Overhead is negligible • MMM recovery imposes a low extra overhead once pages are read from disk

EVALUATION PEACE Evaluation • Evaluate the performance of readingcheckpoints from the DFS and writing them to the DFS • Cluster of 17 machines in two racks(8 & 9 machines each) • All machines in a rack share a gigabit ethernet switch and run the HDFS data node program • Two racks are connected with a gigabit ethernet link • First measure the performance of an HDFS client writing1GB of data to the HDFS cluster using varying degrees of parallelism • As number of threads increased , total througput at client also increases , until network capacity is reached with 4 threads • It takes more than one thread to saturate the network because client computes a checksum for each data chunk that it writes • With 4 threads client reads 95 MB/s so it can recover a 128 MB HA unit in 2 seconds • When more clients access , performance drops significantly for data node

EVALUATION PEACE Evaluation • Measure the time to write checkpoints to the DFS when multiple nodes checkpoint their states simultaneously • The data to checkpoint is already prepared • Vary the number of concurrently checkpointing nodesfrom 8 to 16 and the size of each checkpoint from 128MB to 512MB • Replication level is set to 3 • Compare the performance against the original HDFS implementation • Peace waits until 80 % of writes complete in a timestep before starting next one • Increased global completion time for larger aggregated volumes of I/O tasks • Individual tasks complete I/O much faster than in original HDFS

EVALUATION PEACE Evaluation • Peace significantly reduces the write latency for individual writes with only small decrease in overall resource utilization • It now takes longer for all nodes tocheckpoint their states, the maximum checkpoint frequency is reduced • Longer recoveries as more tuples need to be re-processed after a failure • Correct trade-off since recovery withpassive standby anyways imposes a small delay on streams

Related Work • Semi-transparent approaches • The C3 application-level checkpointing • OODBMS Storage Manager • BerkeleyDB

CONCLUSION • SGuard leverages the existence of a new type of DFS to provide efficient faulttoleranceat a lower cost than previous proposals • SGuard extends the DFS with Peace, a new schedulerthat reduces the time to write individual checkpoints in face of high contention • SGuard also improves the transparencyof SPE checkpoints through the Memory Management Middleware,which enables efficient asynchronous checkpointing • The performance of SGuard is promising • With Peace and the DFS, nodes in a 17-server cluster can checkpoint512MB of state within less than 20s each • MMM efficiently hides this checkpointing activity

References • Y. Kwon, M. Balazinska, and A. Greenberg. Fault-Tolerant Stream Processing Using a Distributed, Replicated File System.PVLDB, 1(1):574–585, 2008. • Magdalena Balazinska , Hari Balakrishnan , Samuel R. Madden , Michael Stonebraker , “Fault-tolerance in the Borealis distributed stream processing system” , ACM New York, NY, USA , ISSN:0362-5915 , 2008

Fault-tolerant Stream Processing using a Distributed, Replicated File System