1 / 6

Intro to Spark 0.7: PySpark and Streaming

Intro to Spark 0.7: PySpark and Streaming. February 21, 2013 www.spark -project.org. UC BERKELEY. Agenda. Introduction (Matei Zaharia) Python API (Josh Rosen) Spark Streaming ( Tathagata Das). What is Spark?. Fast, next-generation data analysis platform started at UC Berkeley

brina
Download Presentation

Intro to Spark 0.7: PySpark and Streaming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Intro to Spark 0.7:PySpark and Streaming February 21, 2013 www.spark-project.org UC BERKELEY

  2. Agenda • Introduction (Matei Zaharia) • Python API (Josh Rosen) • Spark Streaming (Tathagata Das)

  3. What is Spark? • Fast, next-generation data analysis platform started at UC Berkeley • Multiple emerging workloads: • Batch, interactive, streaming • Easy APIs in multiple languages: • Java, Scala, Python • Growing higher-level stack (e.g. Shark) . . . Shark (SQL) Learning Graph Streaming . . . Spark Mesos YARN EC2

  4. Spark 0.7: Statistics • Biggest release yet in terms of contributors • 30 people contributed (19 non-Berkeley) • 12 companies contributed code • 28K lines of patches, 700 commits

  5. Spark 0.7: Contributors • Mikhail Bautin* • Denny Britz • Paul Cavallaro* • Tathagata Das • Thomas Dudziak* • Harvey Feng • Stephen Haberman* • Tyson Hamilton* • Mark Hamstra* • Michael Heuer* • Shane Huang* • Andy Konwinski • Ryan LeCompte* • Haoyuan Li • Richard McKinley* • Sean McNamara* • Lee Moon Soo* • FernandPajot* • Nick Pentreath* • Andrew Psaltis* • Imran Rashid* • Charles Reiss • Josh Rosen • Peter Sankauskas* • PrashantSharma* • ShivaramVenkataraman • Patrick Wendell • ReynoldXin • Matei Zaharia • Eric Zhang* * = non-Berkeley

  6. Spark 0.7: Features • Two major additions: • Python API (PySpark) • Spark Streaming alpha • Many smaller ones: • Memory monitoring dashboard • Maven build & Debian packages • RDD checkpointing • Metadata cleanup (TTL) • Shuffle speedups • Improved EC2 scripts • … Expect the release in 3-4 days!

More Related