1 / 40

Spark Streaming Preview

Spark Streaming Preview. Fault-Tolerant Stream Processing at Scale. Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker , Ion Stoica. UC BERKELEY. Motivation. Many important applications need to process large data streams arriving in real time

calder
Download Presentation

Spark Streaming Preview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Spark Streaming Preview Fault-Tolerant Stream Processing at Scale Matei Zaharia, Tathagata Das,Haoyuan Li, Scott Shenker, Ion Stoica UC BERKELEY

  2. Motivation • Many important applications need to process large data streams arriving in real time • User activity statistics (e.g. Facebook’s Puma) • Spam detection • Traffic estimation • Network intrusion detection • Our target: large-scale apps that need to run on tens-hundreds of nodes with O(1 sec) latency

  3. System Goals • Simple programming interface • Automatic fault recovery (including state) • Automatic straggler recovery • Integration with batch & ad-hoc queries(want one API for all your data analysis)

  4. Traditional Streaming Systems • “Record-at-a-time” processing model • Each node has mutable state • Event-driven API: for each record, update state and send out new records mutable state input records push node 1 node 3 input records node 2

  5. Challenges with Traditional Systems • Fault tolerance • Either replicate the whole system (costly) or use upstream backup (slow to recover) • Stragglers (typically not handled) • Consistency (few guarantees across nodes) • Hard to unify with batch processing

  6. Our Model: “Discretized Streams” • Run each streaming computation as a series of very small, deterministic batch jobs • E.g. a MapReduce every second to count tweets • Keep state in memory across jobs • New Spark operators allow “stateful” processing • Recover from faults/stragglers in same way as MapReduce (by rerunning tasks in parallel)

  7. Discretized Streams in Action batch operation t = 1: input immutable dataset(output or state); stored in memoryas Spark RDD immutable dataset(stored reliably) t = 2: input … … … stream 2 stream 1

  8. Example: View Count • Keep a running count of views to each webpage views = readStream("http:...", "1s") ones = views.map(ev => (ev.url, 1)) counts = ones.runningReduce(_ + _) views ones counts t = 1: map reduce t = 2: . . . = dataset = partition

  9. Fault Recovery • Checkpoint state datasets periodically • If a node fails/straggles, build its data in parallel on other nodes using dependency graph map output dataset input dataset Fast recovery without the cost of full replication

  10. How Fast Can It Go? • Currently handles 4 GB/s of data (42 million records/s) on 100 nodes at sub-second latency • Recovers from failures/stragglers within 1 sec

  11. Outline • Introduction • Programming interface • Implementation • Early results • Future development

  12. D-Streams • A discretized stream is a sequence of immutable, partitioned datasets • Specifically, each dataset is an RDD (resilient distributed dataset), the storage abstraction in Spark • Each RDD remembers how it was created, and can recover if any part of the data is lost

  13. D-Streams • D-Streams can be created… • either from live streaming data • or by transforming other D-streams • Programming with D-Streams is very similar to programming with RDDs in Spark

  14. D-Stream Operators • Transformations • Build new streams from existing streams • Include existing Spark operators, which act on each interval in isolation, plus new “stateful” operators • Output operators • Send data to outside world (save results to external storage, print to screen, etc)

  15. Example 1 Count the words received every second words = readStream("http://...", Seconds(1)) counts = words.count() D-Streams transformation words counts time = 0 - 1: count = RDD time = 1 - 2: count time = 2 - 3: count

  16. Demo • Setup • 10 EC2 m1.xlarge instances • Each instance receiving a stream of sentences at rate of 1 MB/s, total 10 MB/s • Spark Streaming receives the sentences and processes them

  17. Example 2 Count frequency of words received every second words = readStream("http://...", Seconds(1)) ones = words.map(w => (w, 1)) freqs = ones.reduceByKey(_ + _) Scala function literal words freqs ones time = 0 - 1: map reduce time = 1 - 2: time = 2 - 3:

  18. Demo

  19. Example 3 Count frequency of words received in last minute ones = words.map(w => (w, 1)) freqs = ones.reduceByKey(_ + _) freqs_60s = freqs.window(Seconds(60), Second(1)) .reduceByKey(_ + _) sliding window operator window length window movement words freqs ones time = 0 - 1: freqs_60s map reduce window reduce time = 1 - 2: time = 2 - 3:

  20. Simpler running reduce freqs= ones.reduceByKey(_ + _) freqs_60s = freqs.window(Seconds(60), Second(1)) .reduceByKey(_ + _) freqs= ones.reduceByKeyAndWindow(_ + _,Seconds(60),Seconds(1))

  21. Demo

  22. “Incremental” window operators words freqs freqs_60s t-1 t words freqs freqs_60s t+1 t-1 t+2 t t+3 t+1 + t+4 – t+2 t+3 Aggregation function + t+4 + Invertible aggregation function

  23. Smarter running reduce freqs= ones.reduceByKey(_ + _) freqs_60s = freqs.window(Seconds(60), Second(1)) .reduceByKey(_ + _) freqs= ones.reduceByKeyAndWindow(_ + _,Seconds(60),Seconds(1)) freqs = ones.reduceByKeyAndWindow( _ + _,_ - _,Seconds(60),Seconds(1))

  24. Output Operators • save: write results to any Hadoop-compatible storage system (e.g. HDFS, HBase) • foreachRDD: run a Spark function on each RDD freqs.save(“hdfs://...”) words.foreachRDD(wordsRDD => { // any Spark/scala processing, maybe save to database })

  25. Live + Batch + Interactive • Combining D-streams with historical datasets pageViews.join(historicCounts).map(...) • Interactivequeries on stream state from the Spark interpreter pageViews.slice(“21:00”, “21:05”).topK(10)

  26. Outline • Introduction • Programming interface • Implementation • Early results • Future development

  27. System Architecture Built on an optimized version of Spark Worker Client Input receiver Task execution Master Block manager Client Replication of input & checkpoint RDDs D-streamlineage Worker Task scheduler Block tracker Input receiver Client Task execution Block manager

  28. Implementation Optimizations on current Spark: • New block store • APIs: Put(key, value, storage level), Get(key) • Optimized scheduling for <100ms tasks • Bypass Mesos cluster scheduler (tens of ms) • Fast NIO communication library • Pipelining of jobs from different time intervals

  29. Evaluation • Ran on up to 100 “m1.xlarge” machines on EC2 • 4 cores, 15 GB RAM each • Three applications: • Grep: count lines matching a pattern • Sliding word count • Sliding top K words

  30. Scalability Maximum throughput possible with 1s or 2s latency 100-byte records (100K-500K records/s/node)

  31. Performance vs Storm and S4 • Storm limited to 10,000 records/s/node • Also tried S4: 7000 records/s/node • Commercial systems report 100K aggregated

  32. Fault Recovery • Recovers from failures within 1 second Sliding WordCount on 10 nodes with 30s checkpoint interval

  33. Fault Recovery Failures: Stragglers:

  34. Interactive Ad-Hoc Queries

  35. Outline • Introduction • Programming interface • Implementation • Early results • Future development

  36. Future Development • An alpha of discretized streams will go into Spark by the end of the summer • Engine improvements from Spark Streaming project are already there (“dev” branch) • Together, make Spark to a powerful platform for both batch and near-real-time analytics

  37. Future Development • Other things we’re working on/thinking of: • Easier deployment options (standalone & YARN) • Hadoop-based deployment (run as Hadoop job)? • Run Hadoop mappers/reducers on Spark? • Java API? • Need your feedback to prioritize these!

  38. More Details • You can find more about Spark Streaming in our paper: http://tinyurl.com/dstreams

  39. Related Work • Bulk incremental processing (CBP, Comet) • Periodic (~5 min) batch jobs on Hadoop/Dryad • On-disk, replicated FS for storage instead of RDDs • Hadoop Online • Does not recover stateful ops or allow multi-stage jobs • Streaming databases • Record-at-a-time processing, generally replication for FT • Approximate query processing, load shedding • Do not support the loss of arbitrary nodes • Different math because drop rate is known exactly • Parallel recovery (MapReduce, GFS, RAMCloud, etc)

  40. Timing Considerations • D-streams group input into intervals based on when records arrive at the system • For apps that need to group by an “external” time and tolerate network delays, support: • Slack time: delay starting a batch for a short fixed time to give records a chance to arrive • Application-level correction: e.g. give a result for time t at time t+1, then use later records to update incrementally at time t+5

More Related