1 / 26

What’s New in Spark 0.6 and Shark 0.2

What’s New in Spark 0.6 and Shark 0.2. November 5, 2012. www.spark-project.org. UC BERKELEY. Agenda. Intro & Spark 0.6 tour (Matei Zaharia) Standalone deploy mode (Denny Britz ) Shark 0.2 ( Reynold Xin ) Q & A. What Are Spark & Shark?.

kamal
Download Presentation

What’s New in Spark 0.6 and Shark 0.2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What’s New in Spark 0.6 and Shark 0.2 November 5, 2012 www.spark-project.org UC BERKELEY

  2. Agenda • Intro & Spark 0.6 tour (Matei Zaharia) • Standalone deploy mode (Denny Britz) • Shark 0.2 (ReynoldXin) • Q & A

  3. What Are Spark & Shark? • Spark: fast cluster computing engine based on general operators & in-memory computing • Shark: Hive-compatible data warehouse system built on Spark • Both are open source projectsfrom the UCBerkeley AMP Lab

  4. What is the AMP Lab? • 60-person lab focusing on big data • Funded by NSF, DARPA, 18 companies • Goal: build an open-source, next-generation analytics stack . . . Streaming Learning Graph Hadoop, MPI Shark . . . Spark UC BERKELEY Mesos

  5. Some Exciting News • Recently, three full-time developers joined AMP to work on these projects • Also encourage outside contributions! • This release: Shark server (Yahoo!), improved accumulators (Quantifind)

  6. Spark 0.6 Release • Biggest release so far in terms of features • Biggest in terms of developers (18 total, 12 new) • Focus areas: ease-of-use and performance

  7. Ease-of-Use • Spark already had good traction despite two fairly researchy aspects • Scala language • Requirement to run on Mesos • A big goal was to improve these: • Java API (and upcoming API in Python) • Simpler deployment (standalone mode, YARN)

  8. Java API • lines.filter(_.contains(“error”)).count() • JavaRDD<String> lines = sc.textFile(...);lines.filter(new Function<String, Boolean>() { Boolean call(String s) {returns.contains(“error”); }}).count();

  9. Java API Features • Supports all existing Spark features • RDDs, accumulators, broadcast variables • Retains type safety through specific classes for RDDs of special types • E.g. JavaPairRDD<K, V> for key-value pairs

  10. Using Key-Value Pairs • import scala.Tuple2; • JavaRDD<String> words = ...; • JavaPairRDD<String, Integer> ones = words.map(newPairFunction<String, String, Integer> {public Tuple2<String, Integer> call(String s) {returnnew Tuple2(s, 1);} }); • // Can now call ones.reduceByKey(), groupByKey(), etc More info: spark-project.org/docs/0.6.0/

  11. Coming Next: PySpark • lines = sc.textFile(sys.argv[1])counts = lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: (x, 1)) \.reduceByKey(lambda x, y: x + y)

  12. Simpler Deployment • Refactored Spark’s scheduler to allow running on different cluster managers • Denny will talk about the standalone mode…

  13. Other Ease-of-Use Work • Documentation • Big effort to improve Spark’s help and Scaladoc • Debugging hints (pointers to user code in logs) • Maven Central artifacts spark-project.org/documentation.html

  14. Performance • New ConnectionManager and BlockManager • Replace simple HTTP shuffle with faster, async NIO • Faster control-plane (task scheduling & launch) • Per-RDD control of storage level

  15. Some Graphs Wikipedia Search Demo Large User App(2000 maps / 1000 reduces)

  16. Per-RDD Storage Level • importspark.storage.StorageLevelval data = file.map(...) • // Keep in memory, recompute when out of space// (default behavior with cache())data.persist(StorageLevel.MEMORY_ONLY) • // Drop to disk instead of recomputingdata.persist(StorageLevel.MEMORY_AND_DISK) • // Serialize in-memory datadata.persist(StorageLevel.MEMORY_ONLY_SER)

  17. Compatibility • We’ve always strived to stay source-compatible! • Only change in this release is in configuration: spark.cache.class replaced with per-RDD levels

  18. Shark 0.2 • Hive compatibility improvements • Thrift server mode • Performance improvements • Simpler deployment (comes with Spark 0.6)

  19. Hive Compatibility • Hive 0.9 support • Full UDF/UDAF support • ADD FILE support for running scripts • User-supplied jars using ADD JAR

  20. Thrift Server • Contributed by Yahoo!, compatible with Hive Thrift server • Enable multiple clients share cached tables • BI tool integration (e.g. Tableau)

  21. Performance Join (1B join 150M) Group By(1B items, 150M distinct)

  22. Shark 0.3 Preview • In-memory columnar compression (dictionary encoding, run length encoding, etc) • Map pruning • JVM bytecode generation for expression evals • Persist cached table meta data across sessions

  23. Spark 0.7+ • Spark Streaming • PySpark: Python API for Spark • Memory monitoring dashboard

More Related