1 / 74

Holistic Approach to DNN Training Efficiency: Analysis and Optimizations

This article presents a holistic approach to improving the efficiency of deep neural network (DNN) training. It includes analysis techniques, optimizations, and benchmarking tools to identify and address performance bottlenecks. The focus is on diverse benchmarking with state-of-the-art models and key performance metrics.

nealy
Download Presentation

Holistic Approach to DNN Training Efficiency: Analysis and Optimizations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Holistic Approach to DNN Training Efficiency: Analysis and Optimizations Gennady Pekhimenko Assistant Professor Computer Systems and Networking Group (CSNG) EcoSystem Group

  2. Overview • Machine Learning Benchmarking and Analysis • Gist: Efficient Data Encoding for DNN Training • EcoRNN: Efficient Training of LSTM RNNs on GPUs • Priority-based Parameter Propagation for Distributed DNN Training

  3. 1. Machine Learning Benchmarking and Analysis

  4. Try a new framework? (TF, MXNet, PyTorch, …) A ML researcher Change hyper-parameters? Try a new library? Buy a new GPU? (V100, P100, 1080 Ti, Titan Xp …) Add/Remove a layer? OR … Never mind, you have to pay this much time anyway? Waiting for hours or days +

  5. A diverse benchmark suite with state-of-the-art models Understand performance bottlenecks in DNN Training Pin-pointing tools Key performance metrics

  6. A diverse benchmark suite with state-of-the-art models Understand performance bottlenecks in DNN Training Pin-pointing tools Key performance metrics

  7. Why Do We Need ML Benchmark Suite? Lack of a standard diverse benchmark set with the state-of-the-art models for DNN training • How training is different from inference: Our focus is on training

  8. Need for Benchmark Diversity Lack of a standard diverse benchmark set with state-of-the-art models for DNN training • Need for benchmark diversity: • DNNs have been widely successful • Most research used only image classification and CNN models • Performance characteristics are different for different DNNs

  9. State-of-the-art Models ? AlexNet (2012) VGG (2013) GoogleNet (2014) ResNet (2015) RCNN (2014) Fast RCNN (2015) Faster RCNN (2015) YOLO (2016) ? YOLO v2 (2017) State-of-the-art models are constantly evolving Old models can be quickly out-dated

  10. Training Benchmarks for DNNs (TBD) (Footnotes indicate available implementation: T for , M for , C for , P for ) https://github.com/tbd-ai/tbd-suite

  11. TBDvs. Prior Work Comparison against other DNN benchmark suites We aimed (back in late 2016) for a standard DNN benchmark suite like SPEC

  12. Our Focus: Benchmarking and Analysis http://tbd-suite.ai https://mlperf.org/ Building tools to analyze ML performance/efficiency Industry/Academia de-facto standard Our group owns the reference model implementation for speech recognition (inference): DeepSpeech2 from UofT

  13. MLPerf Results Announced (Dec. 12)

  14. Our Community Involvement • ASPLOS 2019 and ISCA 2019 tutorials • We lead ISCA 2019 • SysML 2019 demo • MLPerf Academics/Researchers group • Co-chairing

  15. A diverse benchmark suite with state-of-the-art models Understand performance bottlenecks in DNN Training Pin-pointing tools Key performance metrics

  16. Performance Metrics • Throughput # of data samples processed per second • Compute Utilization GPU busy time over Elapsed time • FP16/FP32/TensorCore Utilization Total FP32 instructions over Maximum FP32 instructions • Memory Breakdown Which data structures occupy how much memory

  17. Throughput # of data samples processed per second We assume that there exists such hyper-parameter configuration that guarantees training quality This is the metric that people truly care about Easy to measure Time-to-accuracy Throughput Need to handle samples with variant sizes Too expensive! Hyper-parameter tuning plays a big role

  18. Compute Utilization GPU busy time over Elapsed time • Indicate how well the non-GPU workloads overlap with GPU computation: • Data loading • Communication (PCIe, networking) • …… t1 t2 Compute Utilization = (t1 + t2) / t3 t3

  19. FP32/FP16/TensorCore Utilization • Indicate speed-up potential in kernel-level • Helps identify the “straggler” kernels (usually not MatMul or CNN kernels) • Average # ofinstructions executed per cycle over Maximum instructions per cycle • When GPU is busy, how well are the GPU cores utilized? • Most models are trained with single-precision floats Provided by nvprof

  20. Memory Breakdown • Goal: understand which data structures contribute how much to the total memory consumption • Memory usage can be broken down along two dimensions: Data Structures: Layer Types: • Weights • Gradients • Activations • Workspace • Dynamic • Conv • Recurrent • LSTM • Fully-connected • … Allocated before training starts Allocated and released during training

  21. A diverse benchmark suite with state-of-the-art models Understand performance bottlenecks in DNN Training Pin-pointing tools Key performance metrics

  22. Toolchain: How to get the required metrics?

  23. Toolchain: sampling, setup, warmup • Sampling • Fully Training a DNN takes days or weeks • Training algorithm is iterative, each iteration follows the same logic • Setup • Need to verify training accuracy • Different frameworks may use different hyper-parameters for the same models • Skipping warmup • Before training stably, a framework needs to: • Initialize dataflow, allocate memory, auto-tuning

  24. Toolchain: Overview Metrics Setup: make implementations comparable Memory consumption Training logs Memory profiler DNN model implementation FP32/FP16/TensorCore utilization .nvvp file nvprof Post Processing Short training period Warm-up & auto-tuning (excluded from data collection) Sampling Compute utilization .nvvp file nvprof Visual Profiler Training throughput

  25. A diverse benchmark suite with state-of-the-art models Understand performance bottlenecks in DNN Training Pin-pointing tools Key performance metrics

  26. Experimental setup • All results are carried out on the single-machine single-GPU environment • OS: Ubuntu 16.04 • Libraries: CUDA 9, cuDNN 7 • Frameworks: TensorFlow v1.8, MXNet v1.1.0, PyTorch v0.4.0, CNTK v2.0 • GPUs: Quadro P4000, 1080 Ti, Titan Xp, P100, 2080 Ti, Titan V, V100 • CPU: 28-core Intel Xeon E5-2680 • Networking: 1Gb/s ethernet, 100Gb/s infiniband, 12GB/s PCIe

  27. Results: Training Quality Expected training accuracy reached

  28. Results: Throughput Mini-batch size matters for training throughput Performance improves with larger mini-batches

  29. Results: Throughput Diversity Performance of RNN-based models does not saturate within GPU memory budget

  30. Results Analysis: GPU Compute Utilization Mini-batch size should be large enough to keep GPU busy GPU compute utilization is low for LSTM-based models

  31. Results Analysis: GPU FP32 Utilization Mini-batch size should be large enough to have high FP utilization

  32. Hardware Sensitivity Better GPU does NOT always mean better performance and utilization We need better system designs and libraries

  33. GPU Memory Profiling Feature maps are the dominant GPU memory consumers

  34. Results: Distributed Training Training ResNet-50 on MXNet (left: multi-machine; right: multi-GPU on single machine) Ethernet (eth) bw = 1Gb/s; InfiniBand (ib) bw = 100Gb/s; PCIe bw = 16GB/s Networking BW should be large enough for weight/gradient updates

  35. Project Status Github repo: github.com/tbd-ai TBD project website is live: tbd-suite.ai

  36. TBD Summary • A new benchmark suite for DNN training • Currently, 7 application domains, 9 state-of-the-art models • Comes with tools to analyze: • performance, efficiency, memory, and network consumption • Part of the community effort (MLPerf) to standardize benchmarking for machine learning

  37. 2. Gist: Efficient Data Encoding for Deep Neural Network Training 

  38. DNN Training vs. Inference Step 1 - Forward Pass (makes a prediction) Step 2 - Backward Pass (calculates error gradients) L1 L2 L3 L4 Ln Intermediate layer outputs Feature maps Generated in the forward pass Used in the backward pass DNN training requires stashing feature maps for the backward pass (not required in Inference)

  39. Training Deeper Networks Train larger networks on a single GPU by reducing memory footprint Feature Maps are a major consumer of GPU memory Larger minibatch size  potential crash/out-of-memory

  40. Limitations of Prior Work • Focus on DNN inference, i.e., weights • Apply pruning, quantization and Huffman encoding • However, weights are a small fraction of memory footprint • Additionally, techniques are not well suited for training • Training requires frequent weight updates • Map poorly on the GPU HW

  41. Our Insight Forward pass Backward pass Lx Ly Lz Timeline Large temporal gap between 2 uses Feature map Generated 1st use 2nd use Baseline Feature map stored in FP32 format Our approach Smaller format between 2 uses Encode() Decode()

  42. Layer-Specific Encodings • Key Idea: • Use layer-specific compression • Can be both fast and efficient • Can be even lossless • Usually difficult for FP32

  43. Relu Importance Significant footprint is due to Relu layer CNTK Profiling

  44. Relu -> Pool Relu Backward Propagation Binarize – 1 bit representation (Lossless)

  45. Relu/Pool -> Conv Sparse Storage Dense Compute (Lossless)

  46. Opportunity for Lossy Encoding Precision Reduction Error AlexNet : 16-bit doesn’t train L1 L2 L3 L4 Forward pass 2nd uses Precision reduction in forward pass quickly degrades accuracy Backward pass L1 L2 L3 L4 Restricting precision reduction to the 2nd use results in aggressive bit savings with no effect on accuracy

  47. Delayed Precision Reduction Training with Reduced Precision Delayed Precision Reduction (Lossy)

  48. Proposed System Architecture - Gist DNN Modified execution graph Identifies encoding opportunity Gist Efficient memory sharing Execution graph Memory allocation for new data structures

  49. Compression Ratio Up to 2X compression ratio With minimal performance overhead

  50. Gist Summary • Systematic memory breakdown analysis for image classification • Layer-specific lossless encodings • Binarizationand sparse storage/dense compute • Aggressive lossy encodings • With delayed precision reduction • Footprint reduction measured on real systems: • Up to 2X reduction with only 4% performance overhead • Further optimizations – more than 4X reduction

More Related