1 / 37

Introduction to

Introduction to . Matei Zaharia, Pat McDonough spark.apache.org. What is Apache Spark?. Fast and general cluster computing system interoperable with Hadoop Improves efficiency through: In-memory computing primitives General computation graphs Improves usability through:

clea
Download Presentation

Introduction to

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Matei Zaharia, Pat McDonough spark.apache.org

  2. What is Apache Spark? • Fast and general cluster computing system interoperable with Hadoop • Improves efficiency through: • In-memory computing primitives • General computation graphs • Improves usability through: • Rich APIs in Scala, Java, Python • Interactive shell Up to 100×faster (2-10× on disk) 2-5× less code

  3. Project History • Started in 2009, open sourced 2010 • 30+ companies now contributing code • Databricks, Yahoo!, Intel, Adobe, Cloudera, Bizo, … • One of the largest communities in big data

  4. A General Stack Shark SQL Spark Streamingreal-time GraphX graph MLlib machine learning … Spark

  5. This Talk • Spark introduction & use cases • Other stack projects • The power of unification • Demo

  6. Why a New Programming Model? • MapReduce greatly simplified big data analysis • But once started, users wanted more: • More complex, multi-pass analytics (e.g. ML, graph) • More interactive ad-hoc queries • More real-time stream processing • All 3 need faster data sharing in parallel apps

  7. Data Sharing in MapReduce HDFSread HDFSwrite HDFSread HDFSwrite iter. 1 iter. 2 . . . Input result 1 query 1 HDFSread result 2 query 2 query 3 result 3 Input . . . Slow due to replication, serialization, and disk IO

  8. What We’d Like iter. 1 iter. 2 . . . Input query 1 one-timeprocessing query 2 query 3 Input Distributedmemory . . . 10-100×faster than network and disk

  9. Spark Model • Write programs in terms of transformations on distributed datasets • Resilient Distributed Datasets (RDDs) • Collections of objects that can be stored in memory or disk across a cluster • Built via parallel transformations (map, filter, …) • Automatically rebuilt on failure

  10. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Transformed RDD Base RDD Cache 1 lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda x: x.startswith(“ERROR”)) messages = errors.map(lambda x: x.split(‘\t’)[2]) messages.cache() Worker results tasks Driver Block 1 Action messages.filter(lambda x: “foo” in x).count() messages.filter(lambda x: “bar” in x).count() Cache 2 Worker . . . Cache 3 Worker Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec(vs 170 sec for on-disk data) Block 2 Block 3

  11. Fault Tolerance RDDs track lineage info to rebuild lost data • file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10) map reduce filter Input file

  12. Fault Tolerance RDDs track lineage info to rebuild lost data • file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10) map reduce filter Input file

  13. Example: Logistic Regression 110 s / iteration first iteration 80 s further iterations 1 s

  14. Behavior with Less RAM

  15. Spark in Scala and Java // Scala: val lines = sc.textFile(...)lines.filter(x => x.contains(“ERROR”)).count() // Java: JavaRDD<String> lines = sc.textFile(...);lines.filter(new Function<String, Boolean>() { Boolean call(String s) {returns.contains(“error”); }}).count();

  16. Supported Operators • map • filter • groupBy • sort • union • join • leftOuterJoin • rightOuterJoin • reduce • count • fold • reduceByKey • groupByKey • cogroup • cross • zip sample take first partitionBy mapWith pipe save ...

  17. Spark Community • One of the largest opensource projects in big data • 150+ developers contributing • 30+ companies contributing Contributors in past year

  18. Community Growth Spark 0.9: 83 contributors Spark 0.8: 67 contributors Spark 0.7:31 contributors Spark 0.6: 17 contributors Oct ‘12 Feb ‘13 Sept ‘13 Feb ‘14

  19. This Talk • Spark introduction & use cases • Other stack projects • The power of unification • Demo

  20. Shark: Hive on Spark • Columnar SQL analytics engine • Both SQL and complex analytics • Up to 100x faster than Hive • Compatible with Apache Hive • HiveQL, UDFs, SerDes, scripts • Existing Hive warehouses • In use at Yahoo! for BI

  21. Spark Integration • Unified system for SQL, graphs, machine learning • All share the same set of workers and caches

  22. Spark Streaming • Stateful, fault-tolerant stream processing with the same API as batch jobs • sc.twitterStream(...) .flatMap(tweet => tweet.text.split(“ ”)) .map(word => (word, 1)) .reduceByWindow(“5s”, _ + _)

  23. MLlib • Built-in library of machine learning algorithms • K-means clustering • Alternating least squares • Linear regression (with L1 / L2 reg.) • Logistic regression (with L1 / L2 reg.) • Naïve Bayes val points = sc.textFile(...).map(parsePoint)val model = KMeans.train(points, 10)

  24. Others • GraphX:Pregel-like graph processing and algorithm library, integrated directly in Spark • BlinkDB: approximate queries for Shark • SparkR: R API and library

  25. This Talk • Spark introduction & use cases • Other stack projects • The power of unification • Demo

  26. Big Data Systems Today Giraph Pregel Drill Dremel Tez MapReduce Impala GraphLab Storm … S4 General batchprocessing Specialized systems (iterative, interactive andstreaming apps)

  27. Spark’s Approach • Instead of specializing, generalize MapReduceto support new apps in same engine • Two changes (general task DAG & data sharing) are enough to express previous models! • Unification has big benefits • For the engine • For users Shark Streaming GraphX MLbase … Spark

  28. Code Size non-test, non-example source lines

  29. Code Size Streaming non-test, non-example source lines

  30. Code Size Shark* Streaming non-test, non-example source lines * also calls into Hive

  31. Code Size GraphX Shark* Streaming non-test, non-example source lines * also calls into Hive

  32. Performance Streaming SQL Graph

  33. What it Means for Users • Separate frameworks: HDFS read HDFS read HDFS read ETL train query HDFS write HDFS write HDFS write … Spark: Interactiveanalysis train query HDFS read ETL HDFS

  34. Combining Processing Types • val points = sc.runSql[Double, Double](“select latitude, longitude from historic_tweets”)val model = KMeans.train(points, 10)sc.twitterStream(...) .map(t => (model.closestCenter(t.location), 1)) .reduceByWindow(“5s”, _ + _)

  35. Demo

  36. Get Started • Visit spark.apache.org for videos & tutorials • Download Spark bundle for CDH • Easy to run on just your laptop • Free training talks and hands-onexercises:spark-summit.org

  37. Conclusion • Big data analytics is evolving to include: • More complex analytics (e.g. machine learning) • More interactive ad-hoc queries • More real-time stream processing • Spark is a fast platform that unifies these apps • Join us at Spark Summit 2014!June 30-July 2, San Francisco

More Related