1 / 22

AMPCamp Introduction to Berkeley Data Analytics Systems (BDAS)

AMPCamp Introduction to Berkeley Data Analytics Systems (BDAS). March 15, 2013 AMPCamp @ ECNU, Shanghai, China. Algorithms, Machines, People Lab ( AMPLab ). 8 CS Faculty: Michael Franklin, Ion Stoica , Michael Jordon, David Patterson, Randy Katz, ... ~ 40 students, 3 software engineers

yakov
Download Presentation

AMPCamp Introduction to Berkeley Data Analytics Systems (BDAS)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AMPCampIntroduction to Berkeley Data Analytics Systems (BDAS) March 15, 2013 AMPCamp @ ECNU, Shanghai, China

  2. Algorithms, Machines, People Lab (AMPLab) • 8 CS Faculty: Michael Franklin, Ion Stoica, Michael Jordon, David Patterson, Randy Katz, ... • ~ 40 students, 3 software engineers • NSF + DARPA + 21 industrial sponsors • Goal: Next Generation of Analytics Data Stack for Industry & Research: • Berkeley Data Analytics Stack (BDAS) • Release as Open Source

  3. Previous AMPCamps • Aug 2012 at UC Berkeley • 150 people from industry and academia • 5000 signed up online to watch live stream • Feb 2013 at Strata Conference, Silicon Valley

  4. AMPCamp @ ECNU • Friday morning • Introduction • Parallel Programming with Spark • Shark • Machine Learning with Spark (K-Means) • Friday afternoon • Hands-on exercise • Saturday • Exercises + discussions

  5. Myself • Reynold Xin, PhD student at UC Berkeley AMPLab • 辛湜 shi2 • Work/intern experience at Google Research, IBM DB2, Altera

  6. Today’s Open Analytics Stack… • ..mostly focused on large on-disk datasets: great for batch but slow Data Processing Storage Application Infrastructure

  7. Goals Batch One stack to rule them all! Interactive Streaming • Easy to combine batch, streaming, and interactive computations • Easy to develop sophisticated algorithms • Compatible with existing open source ecosystem (Hadoop/HDFS)

  8. Our Approach: Support Interactive and Streaming Comp. • Aggressive use of memory • Why? • Memory transfer rates >> disk or even SSDs • Gap is growing especially w.r.t. disk • Many datasets already fit into memory • The inputs of over 90% of jobs in Facebook, Yahoo!, and Bing clusters fit into memory • E.g., 1TB = 1 billion records @ 1 KB each • Memory density (still) grows with Moore’s law • RAM/SSD hybrid memories at horizon 10Gbps 128-512GB 1-4GB/s (x4 disks) 0.2-1GB/s (x10 disks) 40-60GB/s 16 cores 10-30TB 1-4TB High end datacenter node

  9. Our Approach • Easy to combine batch, streaming, and interactive computations • Single execution model that supports all computation models • Easy to develop sophisticated algorithms • Powerful Python and Scala shells • High level abstractions for graph based, and ML algorithms • Compatible with existing open source ecosystem (Hadoop/HDFS) • Interoperate with existing storage and input formats (e.g., HDFS, Hive, Flume, ..) • Support existing execution models (e.g., Hive, GraphLab)

  10. Berkeley Data Analytics Stack (BDAS) • Existing stack components…. Data Processing HIVE Pig HBase Storm MPI Data Processing … Hadoop HDFS Data Management Data Mgmnt. Resource Mgmnt. Resource Management

  11. Mesos [Released, v0.9] • Management platform that allows multiple framework to share cluster • Compatible with existing open analytics stack • Deployed in production at Twitter on 3,500+ servers HIVE Pig HBase Storm MPI Data Processing … Hadoop HDFS Data Mgmnt. Resource Mgmnt. Mesos

  12. Spark [Release, v0.7] • In-memory framework for interactive and iterative computations • Resilient Distributed Dataset (RDD): fault-tolerance, in-memory storage abstraction • Scala interface, Java and Python APIs MPI Storm HIVE Pig Data Processing … Spark Hadoop HDFS Data Mgmnt. Resource Mgmnt. Mesos

  13. Spark Community • 3000 people attended online training in August • 500+ meetup members • 14 companies contributing • 31 contributors in the last release spark-project.org

  14. Spark Streaming [Alpha Release] • Large scale streaming computation • Ensure exactly one semantics • Integrated with Spark  unifies batch, interactive, and streaming computations! Spark Streaming MPI Storm HIVE Pig Data Processing … Spark Hadoop HDFS Data Mgmnt. Resource Mgmnt. Mesos

  15. Shark [Release, v0.2] • HIVE over Spark: SQL-like interface (supports Hive 0.9) • up to 100x faster for in-memory data, and 5-10x for disk • In tests on hundreds node cluster at Spark Streaming MPI Storm HIVE Pig Data Processing … Shark Spark Hadoop HDFS Data Mgmnt. Resource Mgmnt. Mesos

  16. Tachyon [Alpha Release, this Spring] • High-throughput, fault-tolerant in-memory storage • Interface compatible to HDFS • Support for Spark and Hadoop Spark Streaming MPI Storm HIVE Pig Data Processing … Shark Hadoop Spark HDFS Tachyon Data Mgmnt. Resource Mgmnt. Mesos

  17. BlinkDB [Alpha Release, this Spring] • Large scale approximate query engine • Allow users to specify error or time bounds • Preliminary prototype starting being tested at Facebook Spark Streaming BlinkDB Storm MPI Pig Data Processing … Shark HIVE Spark Hadoop HDFS Tachyon Data Mgmnt. Resource Mgmnt. Mesos

  18. SparkGraph [Alpha Release, this Spring] • GraphLab API and Toolkits on top of Spark • Fault tolerance by leveraging Spark Spark Streaming Spark Graph BlinkDB Storm MPI Pig Data Processing … Shark HIVE Spark Hadoop HDFS Tachyon Data Mgmnt. Resource Mgmnt. Mesos

  19. MLbase [In development] • Declarative approach to ML • Develop scalable ML algorithms • Make ML accessible to non-experts Spark Streaming Spark Graph MLbase BlinkDB Storm MPI Pig Data Processing … Shark HIVE Spark Hadoop HDFS Tachyon Data Mgmnt. Resource Mgmnt. Mesos

  20. Compatible with Open Source Ecosystem • Support existing interfaces whenever possible GraphLab API Hive Interface and Shell Spark Streaming Spark Graph MLbase BlinkDB Storm MPI Pig Data Processing … Shark HIVE Spark Hadoop Compatibility layer for Hadoop, Storm, MPI, etc to run over Mesos HDFS Tachyon Data Mgmnt. HDFS API Resource Mgmnt. Mesos

  21. Compatible with Open Source Ecosystem • Use existing interfaces whenever possible Acceptinputs from Kafka, Flume, Twitter, TCP Sockets, … Support Hive API Spark Streaming Spark Graph MLbase BlinkDB Storm MPI Pig Data Processing … Shark HIVE Support HDFS API, S3 API, and Hive metadata Spark Hadoop HDFS Tachyon Data Mgmnt. Resource Mgmnt. Mesos

  22. Summary Holistic approach to address next generation of Big Data challenges! • Support interactive and streaming computations • In-memory, fault-tolerant storage abstraction, low-latency scheduling,... • Easy to combine batch, streaming, and interactive computations • Spark execution engine supports all comp. models • Easy to develop sophisticated algorithms • Scala interface, APIs for Java, Python, Hive QL, … • New frameworks targeted to graph based and ML algorithms • Compatible with existing open source ecosystem • Open source (Apache/BSD) and fully committed to release high quality software • Three-person software engineering team lead by Matt Massie (creator of Ganglia, 5th Cloudera engineer) Batch Spark Interactive Streaming

More Related