1 / 22

Accelerating Machine Learning Applications on Graphics Processors

Accelerating Machine Learning Applications on Graphics Processors. Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram. Big Picture. Searcher. Consumer Search. Application . Patterns. Frameworks. Face Search Developer. Feature Extraction & Classifier Application

tevin
Download Presentation

Accelerating Machine Learning Applications on Graphics Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram

  2. Big Picture Searcher Consumer Search Application Patterns Frameworks Face Search Developer Feature Extraction& Classifier Application Patterns CBIR Application Framework Application Framework Developer Pattern Language Map Reduce Programming Framework SW Infrastructure Map Reduce ProgrammingPattern Map Reduce Programming Framework Developer CUDA Computation & Communication Framework Barrier/Reduction Computation & Communication Patterns CUDA Framework Developer Nvidia G80 Platform Hardware Architect

  3. GPUs as proxy for manycore • GPUs are interesting architectures to program • Transitioning from highly specialized pipelines to general purpose • The only way to get performance from GPUs is through parallelism (No caching, branch prediction, prefetching etc.) • Can launch millions of threads in one call CS258 Parallel Computer Architecture

  4. GPUs are not for everyone • Memory coalescing is really important • Irregular memory accesses to even local stores is discouraged - up to 30% performance hit on some apps for local memory bank conflicts • Cannot forget that it is a SIMD machine • Memory consistency is non-existent & inter-SM synchronization is absent • Hardware scheduled threads • 20 us overhead for kernel call (20,000 instructions @ 1GHz) CS258 Parallel Computer Architecture

  5. NVIDIA G80 Architecture CS258 Parallel Computer Architecture

  6. NVIDIA GeForce 8800 GTX Specifications * measured values CS258 Parallel Computer Architecture

  7. GPU programming - CUDA • Each block can have upto 512 threads that synchronize • Millions of blocks can be issued • No synchronization between blocks • No control over scheduling CS258 Parallel Computer Architecture

  8. Support Vector Machines • A hugely popular machine learning technique for classification • Tries to find a hyperplane separating the different classes with “maximum margin” • Non-linear surfaces can be generated through non-linear kernel functions • Uses Quadratic Programming for training (specific set of constraints imply a wide variety of techniques for solving it) CS258 Parallel Computer Architecture

  9. SVM Training • Quadratic Program • Some kernel functions: Variables: α: Weight for each training point (determines classifier) Data: l: number of training points C: trades off error on training set for generalization performance y: Label (+/- 1) for each training point x: training points

  10. Choice of parallel algorithm(among chunking algorithms) Sequential Minimal Optimization (SMO) CS258 Parallel Computer Architecture

  11. Fitting SMO on a GPU • Shared memory constraints on the GPU fits the algorithm as only two vectors need to be shared among all the threads • Performance strongly dependent on the choice of the working set • Several heuristics proposed – two are popular (1st and 2nd order) • 2nd order heuristic is almost twice as costly, but saves on the number of iterations CS258 Parallel Computer Architecture

  12. Adaptive heuristic • Both heuristics can be expressed as a series of “Map Reduce” stages • A Map Reduce code generator was used to generate the code • Sample periodically and adapt depending on the most converging heuristic at any given time • Tightly coupled map-reduces are essential for machine learning algorithms • Cannot afford the overhead of general library call when called millions of times CS258 Parallel Computer Architecture

  13. Results Normalized to 1st order heuristic CS258 Parallel Computer Architecture

  14. Overall speedup compared to LIBSVM

  15. SVM Classification • SVM classification task involves finding which side of the hyperplane a point lies on • Specifically, where • Insight : Instead of doing this serially for all points, note that

  16. Restructuring the Classification problem Test Data SV Test Data SV Vs Output Output

  17. Results CS258 Parallel Computer Architecture

  18. Results CS258 Parallel Computer Architecture

  19. Is this compute or memory bound? • GPUs are better for memory bound jobs (Observed 7 GB/s vs 1 GB/s for other streaming-like apps) CS258 Parallel Computer Architecture

  20. Importance of memory coalescing • In order to avoid non-coalesced memory accesses, carried both Data and DataT into GPU memory • Letting 0.05% of memory accesses to be non-coalesced led to a 21% drop in performance for one case • Well written code should scale with GPU size (parallelism should be limited by problem size, not machine size) CS258 Parallel Computer Architecture

  21. Is SIMD becoming ubiquitous? • SIMD already important for performance on uniprocessor systems • Task Vs Data parallelism • Intel’s new GPU has wide SIMD • CUDA lesson - Runtime SIMD binding easier for programmers • Non-SIMD leads to performance penalty, not incorrect programs – prevents premature optimizations and keep code flexible CS258 Parallel Computer Architecture

  22. Conclusion • GPUs and Manycore CPUs are on a collision course • Data parallelism on GPUs or Task parallelism on CPUs • Rethink serial control and data structures • Sequential optimizations may harm parallelism • Machine learning can use a lot of parallel hardware if software engineered properly CS258 Parallel Computer Architecture

More Related