Spark streaming preview
Sponsored Links
This presentation is the property of its rightful owner.
1 / 40

Spark Streaming Preview PowerPoint PPT Presentation


  • 102 Views
  • Uploaded on
  • Presentation posted in: General

Spark Streaming Preview. Fault-Tolerant Stream Processing at Scale. Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker , Ion Stoica. UC BERKELEY. Motivation. Many important applications need to process large data streams arriving in real time

Download Presentation

Spark Streaming Preview

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Spark Streaming Preview

Fault-Tolerant Stream Processing at Scale

Matei Zaharia, Tathagata Das,Haoyuan Li, Scott Shenker, Ion Stoica

UC BERKELEY


Motivation

  • Many important applications need to process large data streams arriving in real time

    • User activity statistics (e.g. Facebook’s Puma)

    • Spam detection

    • Traffic estimation

    • Network intrusion detection

  • Our target: large-scale apps that need to run on tens-hundreds of nodes with O(1 sec) latency


System Goals

  • Simple programming interface

  • Automatic fault recovery (including state)

  • Automatic straggler recovery

  • Integration with batch & ad-hoc queries(want one API for all your data analysis)


Traditional Streaming Systems

  • “Record-at-a-time” processing model

    • Each node has mutable state

    • Event-driven API: for each record, update state and send out new records

mutable state

input records

push

node 1

node 3

input records

node 2


Challenges with Traditional Systems

  • Fault tolerance

    • Either replicate the whole system (costly) or use upstream backup (slow to recover)

  • Stragglers (typically not handled)

  • Consistency (few guarantees across nodes)

  • Hard to unify with batch processing


Our Model: “Discretized Streams”

  • Run each streaming computation as a series of very small, deterministic batch jobs

    • E.g. a MapReduce every second to count tweets

  • Keep state in memory across jobs

    • New Spark operators allow “stateful” processing

  • Recover from faults/stragglers in same way as MapReduce (by rerunning tasks in parallel)


Discretized Streams in Action

batch operation

t = 1:

input

immutable dataset(output or state);

stored in memoryas Spark RDD

immutable dataset(stored reliably)

t = 2:

input

stream 2

stream 1


Example: View Count

  • Keep a running count of views to each webpage

    views = readStream("http:...", "1s")

    ones = views.map(ev => (ev.url, 1))

    counts = ones.runningReduce(_ + _)

views

ones

counts

t = 1:

map

reduce

t = 2:

. . .

= dataset

= partition


Fault Recovery

  • Checkpoint state datasets periodically

  • If a node fails/straggles, build its data in parallel on other nodes using dependency graph

map

output dataset

input dataset

Fast recovery without the cost of full replication


How Fast Can It Go?

  • Currently handles 4 GB/s of data (42 million records/s) on 100 nodes at sub-second latency

  • Recovers from failures/stragglers within 1 sec


Outline

  • Introduction

  • Programming interface

  • Implementation

  • Early results

  • Future development


D-Streams

  • A discretized stream is a sequence of immutable, partitioned datasets

    • Specifically, each dataset is an RDD (resilient distributed dataset), the storage abstraction in Spark

    • Each RDD remembers how it was created, and can recover if any part of the data is lost


D-Streams

  • D-Streams can be created…

    • either from live streaming data

    • or by transforming other D-streams

  • Programming with D-Streams is very similar to programming with RDDs in Spark


D-Stream Operators

  • Transformations

    • Build new streams from existing streams

    • Include existing Spark operators, which act on each interval in isolation, plus new “stateful” operators

  • Output operators

    • Send data to outside world (save results to external storage, print to screen, etc)


Example 1

Count the words received every second

words = readStream("http://...", Seconds(1))

counts = words.count()

D-Streams

transformation

words

counts

time = 0 - 1:

count

= RDD

time = 1 - 2:

count

time = 2 - 3:

count


Demo

  • Setup

    • 10 EC2 m1.xlarge instances

    • Each instance receiving a stream of sentences at rate of 1 MB/s, total 10 MB/s

  • Spark Streaming receives the sentences and processes them


Example 2

Count frequency of words received every second

words = readStream("http://...", Seconds(1))

ones = words.map(w => (w, 1))

freqs = ones.reduceByKey(_ + _)

Scala function literal

words

freqs

ones

time = 0 - 1:

map

reduce

time = 1 - 2:

time = 2 - 3:


Demo


Example 3

Count frequency of words received in last minute

ones = words.map(w => (w, 1))

freqs = ones.reduceByKey(_ + _)

freqs_60s = freqs.window(Seconds(60), Second(1))

.reduceByKey(_ + _)

sliding window operator

window length

window movement

words

freqs

ones

time = 0 - 1:

freqs_60s

map

reduce

window

reduce

time = 1 - 2:

time = 2 - 3:


Simpler running reduce

freqs= ones.reduceByKey(_ + _)

freqs_60s = freqs.window(Seconds(60), Second(1))

.reduceByKey(_ + _)

freqs= ones.reduceByKeyAndWindow(_ + _,Seconds(60),Seconds(1))


Demo


“Incremental” window operators

words

freqs

freqs_60s

t-1

t

words

freqs

freqs_60s

t+1

t-1

t+2

t

t+3

t+1

+

t+4

t+2

t+3

Aggregation function

+

t+4

+

Invertible aggregation function


Smarter running reduce

freqs= ones.reduceByKey(_ + _)

freqs_60s = freqs.window(Seconds(60), Second(1))

.reduceByKey(_ + _)

freqs= ones.reduceByKeyAndWindow(_ + _,Seconds(60),Seconds(1))

freqs = ones.reduceByKeyAndWindow(

_ + _,_ - _,Seconds(60),Seconds(1))


Output Operators

  • save: write results to any Hadoop-compatible storage system (e.g. HDFS, HBase)

  • foreachRDD: run a Spark function on each RDD

freqs.save(“hdfs://...”)

words.foreachRDD(wordsRDD => {

// any Spark/scala processing, maybe save to database

})


Live + Batch + Interactive

  • Combining D-streams with historical datasets

    pageViews.join(historicCounts).map(...)

  • Interactivequeries on stream state from the Spark interpreter

    pageViews.slice(“21:00”, “21:05”).topK(10)


Outline

  • Introduction

  • Programming interface

  • Implementation

  • Early results

  • Future development


System Architecture

Built on an optimized version of Spark

Worker

Client

Input receiver

Task execution

Master

Block manager

Client

Replication of input & checkpoint RDDs

D-streamlineage

Worker

Task scheduler

Block tracker

Input receiver

Client

Task execution

Block manager


Implementation

Optimizations on current Spark:

  • New block store

    • APIs: Put(key, value, storage level), Get(key)

  • Optimized scheduling for <100ms tasks

    • Bypass Mesos cluster scheduler (tens of ms)

  • Fast NIO communication library

  • Pipelining of jobs from different time intervals


Evaluation

  • Ran on up to 100 “m1.xlarge” machines on EC2

    • 4 cores, 15 GB RAM each

  • Three applications:

    • Grep: count lines matching a pattern

    • Sliding word count

    • Sliding top K words


Scalability

Maximum throughput possible with 1s or 2s latency

100-byte records (100K-500K records/s/node)


Performance vs Storm and S4

  • Storm limited to 10,000 records/s/node

  • Also tried S4: 7000 records/s/node

  • Commercial systems report 100K aggregated


Fault Recovery

  • Recovers from failures within 1 second

Sliding WordCount on 10 nodes with 30s checkpoint interval


Fault Recovery

Failures:

Stragglers:


Interactive Ad-Hoc Queries


Outline

  • Introduction

  • Programming interface

  • Implementation

  • Early results

  • Future development


Future Development

  • An alpha of discretized streams will go into Spark by the end of the summer

  • Engine improvements from Spark Streaming project are already there (“dev” branch)

  • Together, make Spark to a powerful platform for both batch and near-real-time analytics


Future Development

  • Other things we’re working on/thinking of:

    • Easier deployment options (standalone & YARN)

    • Hadoop-based deployment (run as Hadoop job)?

    • Run Hadoop mappers/reducers on Spark?

    • Java API?

  • Need your feedback to prioritize these!


More Details

  • You can find more about Spark Streaming in our paper: http://tinyurl.com/dstreams


Related Work

  • Bulk incremental processing (CBP, Comet)

    • Periodic (~5 min) batch jobs on Hadoop/Dryad

    • On-disk, replicated FS for storage instead of RDDs

  • Hadoop Online

    • Does not recover stateful ops or allow multi-stage jobs

  • Streaming databases

    • Record-at-a-time processing, generally replication for FT

  • Approximate query processing, load shedding

    • Do not support the loss of arbitrary nodes

    • Different math because drop rate is known exactly

  • Parallel recovery (MapReduce, GFS, RAMCloud, etc)


Timing Considerations

  • D-streams group input into intervals based on when records arrive at the system

  • For apps that need to group by an “external” time and tolerate network delays, support:

    • Slack time: delay starting a batch for a short fixed time to give records a chance to arrive

    • Application-level correction: e.g. give a result for time t at time t+1, then use later records to update incrementally at time t+5


  • Login