1 / 36

Spark Resilient Distributed Datasets:

Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Presentation by Antonio Lupher [ T hanks to Matei for diagrams & several of the nicer slides!] October 26, 2011. The world today….

emmet
Download Presentation

Spark Resilient Distributed Datasets:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Presentation by Antonio Lupher [Thanks to Matei for diagrams & several of the nicer slides!] October 26, 2011

  2. The world today… Most current cluster programming models are based on acyclic data flowfrom stable storage to stable storage Map Reduce Output Input Map Reduce Map

  3. The world today… • Most current cluster programming models are based on acyclic data flow from stable storage to stable storage Benefits: decide at runtime where to run tasks and automatically recover from failures

  4. … but • Inefficient for applications that repeatedly reuse working set of data: • Iterative machine learning, graph algorithms • PageRank, k-means, logistic regression, etc. • Interactive data mining tools (R, Excel, Python) • Multiple queries on the same subset of data • Reload data from disk on each query/stage of execution

  5. Goal: Keep Working Set in RAM iteration 1 one-timeprocessing iteration 2 iteration 3 Input Distributedmemory . . . iter. 1 iter. 2 . . . Input

  6. Requirements • Distributed memory abstraction must be • Fault-tolerant • Efficient in large commodity clusters • How to provide fault tolerance efficiently?

  7. Requirements • Existing distributed storage abstractions offer an interface based on fine-grained updates • Reads and writes to cells in a table • E.g. key-value stores, databases, distributed memory • Have to replicate data or logs across nodes for fault tolerance • Expensive for data-intensive apps, large datasets

  8. Resilient Distributed Datasets (RDDs) • Immutable, partitioned collection of records • Interface based on coarse-grainedtransformations (e.g. map, groupBy, join) • Efficient fault recovery using lineage • Log one operation to apply to all elements • Re-compute lost partitions of dataset on failure • No cost if nothing fails

  9. RDDs, cont’d • Control persistence (in RAM vs. on disk) • Tunable via persistence priority: user specifies which RDDs should spill to disk first • Control partitioning of data • Hash data to place data in convenient locations for subsequent operations • Fine-grain reads

  10. Implementation Spark runs on Mesos => share resources with Hadoop & other apps Can read from any Hadoop input source (HDFS, S3, …) Language-integrated API in Scala ~10,000 lines of code, no changes to Scala Can use interactively from interpreter Spark Hadoop MPI … Mesos Node Node Node Node

  11. Spark Operations • Transformations • Create new RDD by transforming data in stable storage using data flow operators • Map, filter, groupBy, etc. • Lazy: don’t need to be materialized at all times • Lineage information is enough to compute partitions from data in storage when needed

  12. Spark Operations • Actions • Return a value to application or export to storage • count, collect, save, etc. • Require a value to be computed from the elements in the RDD => execution plan

  13. Spark Operations

  14. RDD Representation • Common interface: • Set of partitions • Preferred locations for each partition • List of parent RDDs • Function to compute a partition given parents • Optional partitioning info (order, etc.) • Capture a wide range of transformations • Scheduler doesn’t need to know what each op does • Users can easily add new transformations • Most transformations implement in ≤ 20 lines

  15. RDD Representation • Lineage & Dependencies • Narrow dependencies • Each partition of parent RDD is used by at most one partition of child RDD • e.g. map, filter • Allow pipelined execution

  16. RDD Representation • Lineage & Dependencies • Wide dependencies • Multiple child partitions may depend on parent RDD partition • e.g. join • Require data from all parent partitions & shuffle

  17. Scheduler Task DAG (like Dryad) Pipelines functionswithin a stage Reuses previouslycomputed data Partitioning-awareto avoid shuffles B: A: G: Stage 1 groupBy F: D: C: map E: join Stage 2 union Stage 3 = previously computed partition

  18. RDD Recovery • What happens if a task fails? • Exploit coarse-grained operations • Deterministic, affect all elements of collection • Just re-run the task on another node if parents available • Easy to regenerate RDDs given parent RDDs + lineage • Avoids checkpointing and replication • but you might still want to (and can) checkpoint: • long lineage => expensive to recompute • intermediate results may have disappeared, need to regenerate • Use REPLICATE flag to persist

  19. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Msgs. 1 Base RDD Transformed RDD lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) messages.persist() Worker results tasks Block 1 Driver Action messages.filter(_.contains(“foo”)).count Msgs. 2 messages.filter(_.contains(“bar”)).count Worker . . . Msgs. 3 Block 2 Worker Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec(vs 170 sec for on-disk data) Block 3

  20. Fault Recovery Results k-means

  21. Performance • Outperforms Hadoop by up to 20x • Avoiding I/O and Java object [de]serialization costs • Some apps see 40x speedup (Conviva) Query a 1TB dataset w/5-7 sec. latencies

  22. PageRank Results

  23. Behavior with Not Enough RAM

  24. Example: Logistic Regression Goal: find best line separating two sets of points random initial line + + + + + + – + + – – + – + – – – – – – target

  25. Logistic Regression Code val points = spark.textFile(...).map(parsePoint).persist() varw = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = points.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce((a,b) => a+b) w -= gradient } println("Finalw: " + w)

  26. Logistic Regression Performance 127 s / iteration first iteration 174 s further iterations 6 s

  27. More Applications EM alg. for traffic prediction (Mobile Millennium) • In-memory OLAP & anomaly detection (Conviva) • Twitter spam classification (Monarch) Pregel on Spark (Bagel) • Alternating least squares matrix factorization

  28. Mobile Millennium Estimate traffic using GPS on taxis

  29. ConvivaGeoReport • Aggregations on many keys w/ same WHERE clause • 40× gain comes from: • Not re-reading unused columns or filtered records • Avoiding repeated decompression • In-memory storage of deserialized objects Time (hours)

  30. SPARK Use transformations on RDDs instead of Hadoop jobs • Cache RDDs for similar future queries • Many queries re-use subsets of data • Drill-down, etc. • Scala makes integration with Hive (Java) easy… or easier • (Cliff, Antonio, Reynold)

  31. Comparisons DryadLINQ, FlumeJava • Similar language-integrated “distributed collection” API, but cannot reuse datasets efficiently across queries Piccolo, DSM, Key-value stores (e.g. RAMCloud) • Fine-grained writes but more complex fault recovery • Iterative MapReduce(e.g. Twister, HaLoop), Pregel • Implicit data sharing for a fixed computation pattern • Relational databases • Lineage/provenance, logical logging, materialized views • Caching systems (e.g. Nectar) • Store data in files, no explicit control over what is cached

  32. Comparisons: RDDs vs DSM

  33. Summary • Simple & efficient model, widely applicable • Express models that previously required a new framework efficiently, i.e. same optimizations • Achieve fault tolerance efficiently by providing coarse-grained operations and tracking lineage • Exploit persistent in-memory storage + smart partitioning for speed

  34. Thoughts: Tradeoffs • No fine-grain modifications of elements in collection • Not the right tool for all applications • E.g. storage system for web site, web crawler, anything where you need incremental/fine-grain writes • Scala-based implementation • Probably won’t see Microsoft use it anytime soon • But concept of RDDs is not language-specific (abstraction doesn’t even require functional language)

  35. Thoughts: Influence • Factors that could promote adoption • Inherent advantages • in-memory = fast, RDDs = fault-tolerant • Easy to use & extend • Already supports MapReduce, Pregel(Bagel) • Used widely at Berkeley, more projects coming soon • Used at Conviva, Twitter • Scala means easy integration with existing Java applications • (subjective opinion) More pleasant to use than Java

  36. Verdict Should spark enthusiasm in cloud crowds

More Related