the state of spark n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
The State of Spark PowerPoint Presentation
Download Presentation
The State of Spark

Loading in 2 Seconds...

play fullscreen
1 / 31

The State of Spark - PowerPoint PPT Presentation


  • 83 Views
  • Uploaded on

The State of Spark. And Where We’re Going Next. Matei Zaharia. Community Growth. Project History. Spark started as research project in 2009 Open sourced in 2010 1 st version was 1600 LOC, could run Wikipedia demo Growing community since Entered Apache Incubator in June 2013.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'The State of Spark' - lethia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
the state of spark

The State of Spark

And Where We’re Going Next

Matei Zaharia

project history
Project History
  • Spark started as research project in 2009
  • Open sourced in 2010
    • 1st version was 1600 LOC, could run Wikipedia demo
  • Growing community since
  • Entered Apache Incubator in June 2013
development community
Development Community
  • With over 100 developers and 25 companies, one of the most active communities in big data

Comparison: Storm (48), Giraph (52), Drill (18), Tez (12)

Past 6 months: more active devs than HadoopMapReduce!

development community1
Development Community
  • Healthy across the whole ecosystem
release growth
Release Growth
  • Spark 0.8:
  • YARN, MLlib, monitoring UI
  • 67 contributors
  • Spark 0.7:
  • Python API, Spark Streaming
  • 31 contributors
  • Spark 0.6:
  • Java API, Maven, standalone mode
  • 17 contributors

Oct ‘12

Feb ‘13

Sept ‘13

some community contributions
Some Community Contributions
  • YARN support (Yahoo!)
  • Columnar compression in Shark (Yahoo!)
  • Fair scheduling (Intel)
  • Metrics reporting (Intel, Quantifind)
  • New RDD operators (Bizo, ClearStory)
  • Scala 2.10 support (Imaginea)
projects built on spark
Projects Built on Spark

Spark Streaming

(real-time)

MLbase

(machine learning)

BlinkDB

GraphX

(graph)

Shark(SQL)

Spark

our view
Our View
  • While big data tools have advanced a lot, they are still far too difficult to tune and use
  • Goal: design big data systems that are as powerful & seamless as those for small data
current priorities
Current Priorities
  • Standard libraries
  • Deployment
  • Out-of-the-box usability
  • Enterprise use

+

standard libraries
Standard Libraries
  • While writing K-means in 30 lines is great, it’s even better to call it from a library!
  • Spark’s MLlib and GraphX will be standard libraries supported by core developers
    • MLlib in Spark 0.8 with 7 algorithms
    • GraphX coming soon
    • Both operate directly on RDDs
standard libraries1
Standard Libraries
  • valrdd: RDD[Array[Double]] = ...val model = KMeans.train(rdd, k = 10)val graph = Graph(vertexRDD, edgeRDD)val ranks = PageRank.run(graph, iters = 10)
standard libraries2
Standard Libraries
  • Beyond these libraries, Databricks is investing heavily in higher-level projects
  • Spark Streaming:easier 24/7 operation and optimizations coming in 0.9
  • Shark:calling Spark libs (e.g. MLlib), optimizer, Hive 0.11 & 0.12

Goal: a complete and interoperable stack

deployment
Deployment
  • Want Spark to easily run anywhere
  • Spark 0.8: much improved YARN, EC2 support
  • Spark 0.8.1: support for YARN 2.2
  • SIMR: launch Spark inMapReduce clusters as a Hadoop job (no installation needed!)
    • For experimenting; see talk by Ahir
ease of use
Ease of Use
  • Monitoring and metrics (0.8)
  • Better support for large # of tasks (0.8.1)
  • High availability for standalone mode (0.8.1)
  • External hashing & sorting (0.9)

Long-term: remove need to tune beyond defaults

next releases
Next Releases
  • Spark 0.8.1 (this month)
    • YARN 2.2, standalone mode HA, optimized shuffle, broadcast & result fetching
  • Spark 0.9 (Jan 2014)
    • Scala 2.10 support, configuration system, Spark Streaming improvements
big data systems today
Big Data Systems Today

Giraph

Pregel

Drill

Dremel

Tez

MapReduce

Impala

GraphLab

Storm

S4

General batchprocessing

Specialized systems

(iterative, interactive andstreaming apps)

spark s approach
Spark’s Approach
  • Instead of specializing, generalize MapReduceto support new apps in same engine
  • Two changes (general task DAG & data sharing) are enough to express previous models!
  • Unification has big benefits
    • For the engine
    • For users

Shark

Streaming

GraphX

MLbase

Spark

code size
Code Size

non-test, non-example source lines

code size1
Code Size

Streaming

non-test, non-example source lines

code size2
Code Size

Shark*

Streaming

non-test, non-example source lines

* also calls into Hive

code size3
Code Size

GraphX

Shark*

Streaming

non-test, non-example source lines

* also calls into Hive

performance
Performance

Streaming

SQL

Graph

what it means for users
What it Means for Users
  • Separate frameworks:

HDFS read

HDFS read

HDFS read

ETL

train

query

HDFS write

HDFS write

HDFS write

Spark:

Interactiveanalysis

train

query

HDFS read

ETL

HDFS

combining processing types
Combining Processing Types
  • val points = sc.runSql[Double, Double](“select latitude, longitude from historic_tweets”)val model = KMeans.train(points, 10)sc.twitterStream(...) .map(t => (model.closestCenter(t.location), 1)) .reduceByWindow(“5s”, _ + _)

From Scala:

combining processing types1
Combining Processing Types
  • GENERATEKMeans(tweet_locations) AS TABLE tweet_clusters// Scalatable generating function (TGF):objectKMeans {@Schema(spec = “x double, y double, cluster int”)def apply(points: RDD[(Double, Double)]) = { ... }}

From SQL (in Shark 0.8.1):

conclusion
Conclusion
  • Next challenge in big data will be complex and low-latency applications
  • Spark offers a unified engine to tackle and combine these apps
  • Best strength is the community: enjoy Spark Summit!
contributors
Contributors
  • Aaron Davidson
  • Alexander Pivovarov
  • Ali Ghodsi
  • AmeetTalwalkar
  • Andre Shumacher
  • Andrew Ash
  • Andrew Psaltis
  • Andrew Xia
  • AndreyKouznetsov
  • Andy Feng
  • Andy Konwinski
  • Ankur Dave
  • Antonio Lupher
  • Benjamin Hindman
  • Bill Zhao
  • Charles Reiss
  • Chris Mattmann
  • ChristophGrothaus
  • Christopher Nguyen
  • Chu Tong
  • Cliff Engle
  • Cody Koeninger
  • David McCauley
  • Denny Britz
  • DmitriyLyubimov
  • Edison Tung
  • Eric Zhang
  • Erik van Oosten
  • Ethan Jewett
  • Evan Chan
  • Evan Sparks
  • EwenCheslack-Postava
  • Fabrizio Milo
  • FernandPajot
  • Frank Dai
  • Gavin Li
  • Ginger Smith
  • Giovanni Delussu
  • Grace Huang
  • Haitao Yao
  • Haoyuan Li
  • Harold Lim
  • Harvey Feng
  • Henry Milner
  • Henry Saputra
  • Hiral Patel
  • Holden Karau
  • Ian Buss
  • Imran Rashid
  • Ismael Juma
  • James Phillpotts
  • Jason Dai
  • Jerry Shao
  • JeyKottalam
  • Joseph E. Gonzalez
  • Josh Rosen
  • Justin Ma
  • Kalpit Shah
  • Karen Feng
  • KarthikTunga
  • Kay Ousterhout
  • KodyKoeniger
  • Konstantin Boudnik
  • Lee Moon Soo
  • Lian Cheng
  • Liang-Chi Hsieh
  • Marc Mercer
  • MarekKolodziej
  • Mark Hamstra
  • Matei Zaharia
  • Matthew Taylor
  • Michael Heuer
  • Mike Potts
  • Mikhail Bautin
  • Mingfei Shi
  • MosharafChowdhury
  • MridulMuralidharan
  • Nathan Howell
  • Neal Wiggins
  • Nick Pentreath
  • Olivier Grisel
  • Patrick Wendell
  • Paul Cavallaro
  • Paul Ruan
  • Peter Sankauskas
  • Pierre Borckmans
  • Prabeesh K.
  • Prashant Sharma
  • Ram Sriharsha
  • Ravi Pandya
  • Ray Racine
  • ReynoldXin
  • Richard Benkovsky
  • Richard McKinley
  • RohitRai
  • Roman Tkalenko
  • Ryan LeCompte
  • S. Kumar
  • Sean McNamara
  • Shane Huang
  • ShivaramVenkataraman
  • Stephen Haberman
  • Tathagata Das
  • Thomas Dudziak
  • Thomas Graves
  • Timothy Hunter
  • Tyson Hamilton
  • VadimChekan
  • Wu Zeming
  • Xinghao Pan