StreamScope, S-Store

StreamScope, S-Store Akshun Gupta, Karthik Bala

What is Stream Processing? “Stream processing is designed to analyze and act on real-time streaming data, using “continuous queries” • Infoq - Stream Processing Difference between Batch Processing: “Ability to process potentially infinite input events continuously with delays in seconds and minutes, rather than processing a static data set in hours and days. “ • StreamS paper

Applications of Stream Processing • Twitter uses stream processing to show trending tweets • Algorithmic Trading or High Frequency Trading • Surveillance using sensors • Realtime Analytics • And many more!

Stream Processing: Challenges • Continuous infinite amounts of data • Need to deal with failures and planned maintenance • Latency sensitive • Need for high throughput All of this makes stream applications hard to develop, debug, and deploy!

StreamScope: Continuous Reliable Distributed Processing of Big Data Streams Microsoft Research Presented by Akshun Gupta

StreamScope - General Information • Paper came out of Microsoft Research • Has been deployed in a shared 20k server production cluster at Microsoft • Runs Microsoft’s core online advertisement service - created to handle business critical applications - supposed to give strong guarantees

Motivation Want to design a streaming computation engine to • Execute an event exactly once with server failures and message loss. • Handle large amounts of load • Scale well • Travel back in time • Continue operation during maintenance • Make distributed streaming programming easy

Key Contributions • StreamS shows a streaming computation engine does not need to unnaturally convert streaming computation to a series of mini-batch jobs. Eg, Apache Spark • Introduction of abstractions, rVertex and rStream, to simplify creating, debugging, and understanding data computation engines. • Proven system - deployed in production running business critical applications while coping with failures and variations.

StreamS Abstractions - DAG • Execution of program modeled as a DAG • Vertex performs local computation • Instreams and OutStreams

StreamS Abstractions - rStream • Abstraction to decouple upstream and downstream vertices with failure recovery mechanisms. • Maintains sequence of events and sequence numbers. • Provides API calls Write, Read, GarbageCollect • Maintains the following properties: • Uniqueness: Unique value for each sequence number • Validity: If a Read happens for seq, a Write for seq is guaranteed to have happened • Reliability: For any Write(seq, e), Read(seq) will return e

StreamS Abstractions - rVertex • Vertex can save state with snapshots • If Vertex fails, it can be restarted with Load(s). s is a saved snapshot. • rVertex guarantees determinism • Running Execute() on the same snapshot will produce the same result • Determinism ensures correctness. • Requires user defined functions to behave deterministically

Architecture

Failure Recovery Strategies • Checkpoint-based recovery • Not performant when vertices hold large internal state • Replay-based recovery • Rebuilding state using the most recent window like 5 minutes • Deterministic execution property comes in handy • Might have to reload large window but don’t have to checkpoint as frequently • Replication-based recovery • Multiple instances of the same vertex can be run at the same time • Determinism will ensure output of different machines but of the same vertex to be the same • Overhead of extra resources

Evaluation • Detect fraud clicks of online transactions • 3220 Vertices • 9.5 TB of events processed • 180 TB I/O • 21.3 TB aggregate memory usage • 7 day evaluation period

Evaluation - Failure Impact on Latency* A: Failed machines had high in-memory state → Latency increased for small number of failures B: Large number of failures but vertices did not have high in-memory state C: Unscheduled mass outage of machines → significant increase in latency D: scheduled maintenance → graceful transition and no significant increase in latency *End-to-end latency

Evaluation - Scalability X Axis: Degree of Parallelism Y Axis: Maximum throughput sustained under a 1-second latency bound.

Comparing Failure Recovery Strategies • No effect on latency when using Replication strategy • Longer latency delay for Replay because state in checkpoint is more condensed (common case) • Company uses 25% replay based but others uses checkpointing

Comments • Paper does not compare their streaming system with other streaming systems like Spark, Storm, etc. • No outlook given on whether this system will be provided as PaaS or their plan on making it open source. • Restriction on deterministic applications significant

Key Takeaways • Introduction of abstractions rStream and rVertex • A new way to design streaming systems • Decoupling upstream and downstream vertices • Valuable engineering advice • Good comparison between failover strategies • Checkpointing • Replay Based • Replication Based • Proven system under production load • Business critical application • 20k+ nodes used • Scaling is robust

S-Store Presented by Karthik Bala

Streaming Meets Transaction Processing • Streaming: handle large amounts of data, but... • Transaction Processing: ACID guarantees, but... Challenge: Build a streaming system which provides shared mutable state

Guarantees Transactions are stored procedures with input parameters -”Recall that it is the data that is sent to the query in streaming systems in contrast to the standard DBMS model of sending the query to the data” OLTP Transaction - can access public tables, “pull based” Streaming transaction - can access public tables, windows, streams, “push based”

Contributions • Start with traditional OLTP database system (H-Store) and add streaming transactions • streams and windows represented as time-varying state • triggers to enable push-based processing over such state • a streaming scheduler that ensures correct transaction ordering • a variant on H-Store’s recovery scheme that ensures exactly-once processing for streams

Transaction Execution s: stream b: atomic batch w: window (difference?) T: transaction

Transaction Execution • ACID: Wait till T commits to makeits writes public • Valid orderings? • For an ordering to be correct • Must follow the topological orderingof the dataflow graph (relaxed if graph has multiple orderings) • All batches must be processed inorder

Hybrid Schedules, Nested Transactions • Any OLTP transaction can interleave between any pair of streaming transactions (in a valid TE schedule) • Nested transactions : two or more transactions which execute like a block • No transaction can interleave between nested transactions

H- Store Architecture • Commit Log, Checkpointing • Layers

S-Store Extensions • Streams: time varying H-Store tables • Persistent, recoverable • Triggers • Attached to tables, activate when tuples added • PE/EE triggers • Window Tables

Fault Tolerance • Goal: Exactly once processing • Even if a failure happens, state must be as if transaction T occurred exactly once! • Weak recovery: correct but nondeterministic results

Recovery • Strong Recovery • Use H-Store’s commit log from latest snapshot + disable PE triggers (why?) • Weak Recovery • Apply Snapshot • Start at the inputs of dataflow graph (cached) • Leave PE triggers as is! Need interior transactions that were not logged to be re-executed • Finally, replay the log

Performance and Evaluation

Performance and Evaluation (2)

Performance and Evaluation (3)

Key Takeaways • Ordering • Push-based processing (triggers!) • Weak vs. strong recovery • ACID guarantees

Discussion • S-Store: >1 node?! • S-Store evaluation methods okay? • Implementation of different failure strategies for each vertex not given in the paper. • No details on how the optimizer works - how does it know the cost of running the application before deploying? • Job Manager fault tolerance not talked about in the paper. If not replicated, it is a single point of failure • Lack of custom DAG creation - probably because they have optimized for their own workload and applications

StreamScope, S-Store