1 / 24

Spark 1.1 and Beyond

Spark 1.1 and Beyond. Patrick Wendell. About Me. Work at Databricks leading the Spark team Spark 1.1 Release manager Committer on Spark since AMPLab days. This Talk. Spark 1.1 (and a bit about 1.2) A few notes on performance Q&A with myself, Tathagata Das, and Josh Rosen.

Download Presentation

Spark 1.1 and Beyond

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Spark 1.1 and Beyond Patrick Wendell

  2. About Me Work at Databricks leading the Spark team Spark 1.1 Release manager Committer on Spark since AMPLab days

  3. This Talk Spark 1.1 (and a bit about 1.2) A few notes on performance Q&A with myself, Tathagata Das, and Josh Rosen

  4. A Bit about Spark… RDD-Based Tables RDD-Based Graphs RDD-Based Matrices DStream’s: Streams of RDD’s Spark SQL Spark Streamingreal-time GraphX Graph (alpha) MLLib machine learning Spark RDD API HDFS, S3, Cassandra YARN, Mesos, Standalone

  5. Spark Release Process ~3 month release cycle, time-scoped 2 months of feature development 1 month of QA Maintain older branches with bug fixes Upcoming release: 1.1.0 (previous was 1.0.2)

  6. Master More stable V1.1.0 For any P.O.C or non production cluster, we always recommend running off of the head of a release branch. More features branch-1.1 V1.0.0 V1.0.1 branch-1.0

  7. Spark 1.1 1,297 patches 200+ contributors (still counting) Dozens of organizations To get updates – join our dev list: E-mail dev-subscribe@spark.apache.org

  8. Roadmap Spark 1.1 and 1.2 have similar themes Spark core: Usability, stability, and performance MLlib/SQL/Streaming: Expanded feature set and performance Around ~40% of mailing list traffic is about these libraries.

  9. Spark Core in 1.1 Performance “out of the box” Sort-based shuffle Efficient broadcasts Disk spilling in Python YARN usability improvements Usability Task progress and user-defined counters UI behavior for failing or large jobs

  10. Spark SQL in 1.1 1.0 was the first “preview” release 1.1 provides upgrade path for Shark Replaced Shark in our benchmarks with 2-3X perf gains Can perform optimizations with 10-100X less effort than Hive.

  11. Turning an RDD into a Relation • // Define the schema using a case class.case class Person(name: String, age: Int)// Create an RDD of Person objects, register it as a table.val people =sc.textFile("examples/src/main/resources/people.txt").map(_.split(",") .map(p => Person(p(0), p(1).trim.toInt))people.registerAsTable("people")

  12. Querying using SQL • // SQL statements can be run directly on RDD’sval teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")// The results of SQL queries are SchemaRDDs and support // normal RDD operations.valnameList = teenagers.map(t => "Name: " + t(0)).collect() • // Language integrated queries (ala LINQ)val teenagers =people.where('age >= 10).where('age <= 19).select('name)

  13. Spark SQL in 1.1 JDBC server for multi-tenant access and BI tools Native JSON support Public types API – “make your own” Schema RDD’s Improved operator performance Native Parquet support and optimizations

  14. Spark Streaming Stability improvements across the board Amazon Kinesis support Rate limiting for streams Support for polling Flume streams Streaming + ML: Streaming linear regressions

  15. What’s new in MLlib v1.1 • Contributors: 40 (v1.0) -> 68 • Algorithms: SVD via Lanczos, multiclass support in decision tree, logistic regression with L-BFGS, nonnegative matrix factorization, streaming linear regression • Feature extraction and transformation: scaling, normalization, tf-idf, Word2Vec • Statistics: sampling (core), correlations, hypothesis testing, random data generation • Performance and scalability: major improvement to decision tree, tree aggregation • Python API: decision tree, statistics, linear methods

  16. Performance (v1.0 vs. v1.1)

  17. Sort-based Shuffle Old shuffle: Each mapper opens a file for each reducer and writes output simultaneously. Files = # mappers * # reducers New Shuffle: Each mapper buffers reduce output in memory, spills, then sort-merges on disk data.

  18. GroupBy Operator Spark groupByKey != SQL groupBy NO: people.map(p => (p.zipCode, p.getIncome)) .groupByKey() .map(incomes => incomes.sum) YES: people.map(p => (p.zipCode, p.getIncome)).reduceByKey(_ + _)

  19. GroupBy Operator Spark groupByKey != SQL groupBy NO: people.map(p => (p.zipCode, p.getIncome)) .groupByKey() .map(incomes => incomes.sum) YES: people.groupBy(‘zipCode).select(sum(‘income))

  20. GroupBy Operator Spark groupByKey != SQL groupBy NO: people.map(p => (p.zipCode, p.getIncome)) .groupByKey() .map(incomes => incomes.sum) YES: SELECT sum(income) FROM people GROUP BY zipCode;

  21. Other efforts Ooyala Job Server RDD-Based Tables RDD-Based Graphs RDD-Based Matrices DStream’s: Streams of RDD’s Hive on Spark Spark SQL Spark Streamingreal-time GraphX Graph (alpha) MLLib machine learning Pig on Spark Spark RDD API HDFS, S3, Cassandra YARN, Mesos, Standalone

  22. Looking Ahead to 1.2+ [Core] Scala 2.11 support Debugging tools (task progress, visualization) Netty-based communication layer [SQL] Portability across Hive versions Performance optimizations (TPC-DS and Parquet) Planner integration with Cassandra and other sources

  23. Looking Ahead to 1.2+ [Streaming] Python Support Lower level Kafka API w/ recoverability [MLLib] Multi-model training Many new algorithms Faster internal linear solver

  24. Q and A Tathagata Das Spark Streaming Lead Josh Rosen PySpark and Spark Core

More Related