1 / 13

CS 239 – Big Data Systems Fall 2019

CS 239 – Big Data Systems Fall 2019. Harry Xu UCLA. My Research Background. Compilers and systems Static and dynamic program analysis Compiler Runtime/operating systems Big Data Analytics Dataflow systems Graph systems Machine learning systems Some industrial experience

tyus
Download Presentation

CS 239 – Big Data Systems Fall 2019

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 239 – Big Data SystemsFall 2019 Harry Xu UCLA

  2. My Research Background • Compilers and systems • Static and dynamic program analysis • Compiler • Runtime/operating systems • Big Data Analytics • Dataflow systems • Graph systems • Machine learning systems • Some industrial experience • Microsoft – created and developed an optimizing compiler for Cosmos/Scope that improved the overall performance of production jobs by up to 3X • IBM – created and developed a series of profiling tools for large-scale systems Big Data system support for scalable program analysis system support for scalable analytics

  3. BigDatalog Application Circle Infrastructure Circle

  4. This Course: Big Data Systems • What it is about • Low-level infrastructures • Programming models • Runtimes • Scalability and efficiency • What it is NOT about • High-level applications • Workloads • Data collection and usage • An example • We are going to discuss some papers on machine learning systems • We are NOT going to discuss learning models and algorithms

  5. Industrial Relevance • Many papers came directly from industry • GFS, MapReduce, Bigtable, Spanner, TensorFlow (Google) • HDFS (Yahoo) • Azure, Trill, Dryad, Naiad (Microsoft) • Spark, Tachyon (Databricks) • Applications v.s. systems • Many people can develop applications • Few people can develop systems • Applications are specific to domains while skills required to build infrastructures are generic

  6. Goals to Achieve • Understand what systems are available for data analytics • Understand fundamental challenges in system design • Understand how to design a customized system for a certain workload • Gain experience with system development by proposing and implementing a new idea

  7. What This Course is Related To • Distributed systems • Database systems • Computer Architecture • Networking • Storage (memory, disk, file system,etc.) • Graph algorithms • Statistics • Machine learning

  8. Aspects of Big Data Processing • Where to put data? • How to process data at scale? • How to process different types of data? • Structured data • Unstructured data • Streaming data • Graph data • Data for model training • How to take advantage of technological advances • How to make processing efficient?

  9. Topics Covered (I) • Distributed storage systems • HDFS, GFS, Bigtable, Spanner, and Azure storage • Dataflow engines • MapReduce, Dryad, AsterixDB, Spark • Batch processing • Hive, Spark SQL, and SCOPE • Resource Management • Mesos, YARN, LATE, Borg, Sparrow

  10. Topics Covered (II) • Stream processing • Storm, Flink, Kafka, Naiad, Trill, SVE, Drizzle • Graph processing • Pregel, Ligra, GraphChi, Xstream, GridGraph • Machine learning • TensorFlow, Parameter Servers, Project Adam

  11. Why Do We Need Those Systems • Enablers • Better performance • Scalability • Efficiency • Energy • Easy/flexible programmability

  12. Course Structure • Paper critiques • Due before each presentation day • Presentation • 20-25 mins • Participation in active discussion • Project • 2-3 students form a group, working on an innovative idea in system development

  13. Things about Presentations/Critiques • Reuse slides as much as possible • A good rule of thumb is to follow this order • What problems does the paper solve? • Why are they (serious) problems? • Why aren’t they already solved? • What are the main challenges? • How did the authors overcome them? • What evidence did the authors show that the problems is solved? • Questions, concerns, opportunities for improvement

More Related