1 / 20

Accelerating Machine Learning Applications using Delite

Accelerating Machine Learning Applications using Delite. Anand Atreya, Kevin Brown, George Rossin Stanford University CS315A 1 st June, 2010. What is Machine Learning?. Learning patterns from data Regression Inference (e.g. Loopy Belief Propagation)

skip
Download Presentation

Accelerating Machine Learning Applications using Delite

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Accelerating Machine Learning Applications using Delite Anand Atreya, Kevin Brown, George Rossin Stanford University CS315A 1st June, 2010

  2. What is Machine Learning? • Learning patterns from data • Regression • Inference (e.g. Loopy Belief Propagation) • Adaptive control (e.g. Reinforcement Learning) • Neural networks (e.g. Restricted Boltzmann Machine) • A good domain for studying parallelism • Both throughput and latency are important • Many applications exhibit both data and task parallelism • Often at varying granularities • At the core of many emerging applications (speech recognition, robotics, data mining, etc.) • Many optimizations specific to the domain • e.g., Sacrificing accuracy for performance

  3. Domain Specific Languages • A language or library that exploits domain knowledge for productivity and efficiency • Widely used in many application areas • MATLAB, Verilog, OpenGL • Raises the level of abstraction higher than general purpose languages • Programmer describes what he wants to do rather than how he wants to do it • Allows for an implicitly parallel environment

  4. OptiML: A DSL for ML • Provides a familiar (MATLAB-like) language and API for writing ML applications • Embedded in Scala • Encodes common ML kernels as implicitly parallel operations • Matrix multiply, dot product, etc.

  5. What is Delite? • A dynamic parallel runtime • Domain Extracted Locality Informed Task Execution • Executes a task graph on parallel, heterogeneous hardware • CPUs, GPUs, etc. • Performs both static and dynamic scheduling • Integrates task and data parallelism in a single environment • Can apply dynamic domain-specific optimizations provided by a Domain-Specific Language

  6. Delite Execution Model Calls Matrix DSL methods DSL defers OP execution to Delite Delite applies generic & domain transformations and generates mapping

  7. Scheduling • An NP Hard problem in general • Very simple local clustering algorithm for general purpose scheduling • Checks for dependency on previous M OPs to minimize communication • Control flow hints • Allows for an efficient parallel for loop schedule when the loop iterations are independent without an explicit parallelFor construct • Data Parallel operations • Splits each OP into N chunks for N threads

  8. Integrating the GPU(s) • Portion of the task graph to be executed on the GPU is sent to a dedicated GPU scheduler • GPU scheduler identifies OP and sends appropriate CUDA kernel to GPU • Manages the GPU memory • Shipped data remains on GPU for fast re-use until memory overflows or CPU requests data

  9. Experimental Results • Performed using ML applications written in OptiML and using Delite • The application and Delite scheduler are run in a single thread + • Either N CPU worker threads • Or 1 GPU

  10. ML Kernel Tests • 3 Application Kernels • Gaussian Discriminant Analysis • Naïve Bayes • Weighted Linear Regression • System 1: Multi-Core CPU & GPU Tests • Intel Nehalem • 2 sockets, 8 cores, 16 threads • 24 GB DRAM • NVIDIA GTX 275 GPU • System 2: Scalability Tests • Sun Niagara T2+ • 4 sockets, 32 cores, 256 threads • 128 GB DRAM

  11. Gaussian Discriminant Analysis 2.4x 2.6x 3.4x 3.9x 13.1x 18.7x *Normalized to execution time for 1 CPU

  12. Naïve Bayes 2.2x 3.5x 5.6x 7.6x

  13. Weighted Linear Regression 1.1x 2.5x 3.3x 3.9x 4.3x 5.5x

  14. Multi-Core Scalability

  15. Overheads: GDA

  16. Deep Belief Networks (DBNs) • Very promising algorithms • Learns complex features • Shows great potential in solving difficult problems • Researched by Andrew Ng • Research is limited by compute power • Computation scales quadratically • Algorithm dominated by serial matrix multiplications

  17. DBN Current Results 3.1x 22.3x

  18. Conclusions • Domain knowledge facilitates implicit coarse-grained parallelism • Delite targets heterogeneous hardware automatically • Hits the sweet spot of ease-of-programming and scalable performance

  19. Future Work • Hardware scheduling acceleration • Dataflow processing could become more feasible due to the natural expression of coarse-grained tasks in Delite • Static analysis of task graph • Allows intelligent scheduling before runtime • Task graph optimizations

  20. Thank You! • Questions? • Thanks to Hassan Chafi, ArvindSujeeth, HyoukJoong Lee, Nathan Bronson, and KunleOlukotun

More Related