Replicable Parts for Large-scale Deep Learning 大规模深度机器学习中的组件

Replicable Parts forLarge-scaleDeep Learning大规模深度机器学习中的组件 Presenter: Tianqi Chen with DMLC Team dmlc.github.io TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA

Introduction: DMLC Projects • A Community developing portable, scalable and reliable libraries for distributed machine learning. • Contributors come from (in alphabetical order) Baidu, CMU, HKUST, Microsoft, NYU, PKU, UW, .. • Deep Learning related DMLC Projects • MShadow: CPU/GPU tensor library • mshadow-ps: unified asynchronize communication interface • CXXNet: concise and fast distributed deepnet • Minerva: dataflow based tensor operation engine

The Topic Today: Replicable Parts • Quotes from Game Civilization • Civ 5 • Civ 4 • What are replicable parts of machine learning (deep learning)? • How can the parts come together to build a concise, flexible and efficient solution?

Different Design Goals of Existing Toolkits • Zoo of Deep Learning • Optimize for speed and memory efficiency • cuda-convnet, caffe, cxxnet • Optimize for flexibility • Theano, Torch • Others? optimized for concurrency scheduling • Minerva, Purine2 • Elaborate later • What design choice did these tools make for the specific optimization

Outline • From Beginning • Layers, Kernels and BLAS • CXXNet: concise and scale up • Tensor and Expression Template • mshadow-ps: asynchronize parameter server • The shining piece in minervaand purine2 • Operator, scheduler and concurrency

From Beginning: Layers, BLAS and Kernels • Compose neural-nets by Layers • Layers connect nodes (Blobs) • Each Layer implements Forward and Backprop • Implement a CPU and GPU variant for each layer • Either implement GPU kernel, or call BLAS(cuBLAS) • Layer is the replicable parts of neural nets • Layer.Forward, Layer.Backprop

Sigmoid Layer: Hand Crafted Kernel • First time: I write my first GPU program! • Need to do this for Relu, pRelu, tanh, in CUDA and C++ …

Sigmoid Layer: Use a Math Library • One line in numpy: out = 1.0 / (1.0 + np.exp(-in)) • What was wrong on this approach? • There is efficiency and memory issue • A = np.exp(-in); B = 1.0 + A; C = 1.0 / A • This operation could have been chained, with no temp space for (inti = 0; i < size; ++i) { out[i] = 1.0 / (1.0 + exp(-in[i])); } • Go back to hand-crafted code again…

The Typical way of Vector Library(Tensor) • Implement vector plus: will cause temp memory allocation

Efficiency vs Simplicity Handcrafted kernels Expressions on Tensor Lib out = 1.0 / (1 + exp(-in)) Simpler Easier to program Easier to extend More efficient

One step Forward: Lazy Evaluation • Plus return a abstractsyntax tree • Evaluation happens atassignment • No temp memory is needed

Template and Composition Pattern • Return recursive abstract syntaxtree • Recursively evaluation function • The code will roll out atCompile Time • Template Plus inline • Only possible in static typingand template language(c++)

Mshadow: Expression Template Lib • Define sigmoid operator in mshadow • Device invariant: write one code for CPU and GPU • Auto generate kernel, write expressions and run as efficient as hand-craft kernels • For more, see https://github.com/dmlc/mshadow

Efficiency vs Simplicity: can have both! Handcrafted kernels Expression Template Simpler Easier to program Easier to extend As efficient as handcraft kernel!

Concise Code = Faster Development Code From Batch Normalization Layer: CXXNet

Problem for Scale-up: Synchronization • One possible way to do multiple GPU convnet • Split data into small batches • Compute gradient in each GPU using net.Forward/Backprop • Sum them up together(communication: cost time) • Do the update on summed gradient Pay a lot of communication time, gain only little speedup in practice

Asynchronize Communication Backprop at Iter k Forward at Iter k+1 push grad[3] Layer 3 Layer 3 pull sumg[3] update w[3] wait pull 3 wait pull 2 push grad[2] Layer 2 Layer 2 …. push grad[1] wait pull 1 Layer 1 Layer 1 pull sumg[1] update w[1] All the Push and Pull Request handled by background thread Overlaps communication with computation

Mshadow-PS: Async PS interface on GPU • Based on three functions: Push, PullReq and PullWait • Once we get the gradient, we call Push and PullReq • We will call PullWait before next time we need the weight in Forward Pass Backprop at Iter k Forward at Iter k+1 push grad[3] Layer 3 Layer 3 pull sumg[3] update w[3] wait pull 3 wait pull 2 push grad[2] Layer 2 Layer 2 …. push grad[1] wait pull 1 Layer 1 Layer 1 pull sumg[1] update w[1]

Speed Up Test: Using 3 GeForce 980 GTX • Good speed up on VGGNet • Linear speed up on GoogLeNet

Mshadow-ps: Unified Interface for All • Two level parameter server • Synchronized communication within a machine • Asynchronous across machine communication • Unified interface for all communication protocol Try https://github.com/dmlc/cxxnet/tree/master/example/multi-machine

Dependency Graph and Concurrency • Consider the following series of tensor operations B = A + 1 C = A - 2 D = B * C • The blue and red operation can run concurrently B Dependency Graph(Dataflow) D A C

Concurrency in Neural Nets • There are examples of concurrencies in deepnet • Mixed convolution via split-merge • Model parallelism pipeline • Data parallelism that does concurrentcommunication and computation • Do it manually to utilize these concurrency • Manually create concurrent streams • Use asynchronize API(mshadow-ps) • Or, use a dag scheduler (minerva, purine2) This is more general and easier, once we have the engine

General DAG Scheduler • Run the each operations once the dependency is satisfied • Two general types of DAG Engine • Do planning statically before execution (purine2, theano) • Like compiled language (C++), allow more optimization space • Do planning dynamically as we run (minverva) • More like JIT compiler or interpreter(python), more flexible Step 2 Step 3 Step 1 B B B A D A D A D C C C

Compare Layer(cxxnet), DAG (minerva) Tensor operations drived by DAG Engine Layers (with Expression Template) out = 1 / (1 + exp(-in)) • Layer contain a bulk efficient kernel • Forward/Backpropmutatepre-allocated space • Memory efficient • Need to manually exploit concurrency • Composed of small operations • The states needs to be imuttable • May require a dynamic memory allocator • Automatically exploit concurrency Not too hard, except no concurrency More memory efficient Simpler and more Flexible Get back issues of small operations in tensor lib

The Conflicted Choice Tensor operations drived by DAG Engine Layers (with Expression Template) out = 1 / (1 + exp(-in)) • Makes static optimization easy • Concise and efficient code • Very flexible • Hard to optimize • write a compiler or JIT for CUDA Can we combineboth ends together?

MXNet: Combine What Learnt so Far • Designed by authors of cxxnet, minerva and purine2 • in progress at DMLC • Try to combine good nature of both • Allow easily extend new layers with python scripting • Reuse the well optimized existing net structure • Use DAG scheduler to schedule optimized static layer code • i.e. CXXNet Layer operations • plus tensor based flexible operations

MXNet: Combine things together Hybrid Scheduling of bulk and small operations Layers (with Expression Template) out = sigmoid_layer.forward(in) grad = out – label in_grad = sigmoid_layer.back(grad) Expose to the DAG Engine Being more open Static optimized net + some dynamic component creates new challenge for scheduler

Take Home Message • Some useful components for large-scale deep learning • Write tensor expression templates code • Use asynchronize parameter communication • Smart scheduler can make your life possibly easier • Identifying and use these parts gives flexible, fast and concise code • Share your wisdom and create other parts for largescale ML • Contributing to DMLC

Outline • From Beginning • Layers, Kernels and BLAS • CXXNet: concise and scale up • Tensor and Expression Template • mshadow-ps: asynchronize parameter server • The shining piece in minervaand purine2 • Operator, scheduler and concurrency • Appendix: distributed data loading API • How can we benefit from DMLC Projects (Example)

How DMLC Projects can Help • DMLC projects provides useful libraries for building distributed machine learning programs • These are common replicable parts  • Use these APIs allows your program to directly read • Common Libraries for external-memory prefetching, thread buffering and more … • Contribute back your piece of wisdom to the project • Build concise, replicable and scalable system together

Data IO Module for Distributed Training • Compressed, as small as possible • Same experience as single machine training • Simply change file to hdfs://data/image-net, done • No need to manual copy data to local machine • Hide IO time cost, prefetching and on the fly augmentation • Reduce RAM consuming

Data Preparation in Parallel Imagenet.rec.000 Packing Packing Packing Process 0 Imagenet.rec.001 Process 1 Process 2 Imagenet.rec.002

Simple API to Read from Part of the Data

RecordIO: Splitable Binary Data Format • Easily locate to the start of a record by matching kMagic • Allow different record length

Image RecordIO Add Header Resize Compact

What you can get by using DMLC API • Distributed Data Partition Reading from HDFS/S3/NFS • mydata = InputSplit(“hdfs://mydata”, myrank, nworker) • Automatic thread-based prefetching, pipelining • ImageNet Reading performance • ImageNet packing size: • ImageNet 1k: 20-30 G • ImageNet-full: 240 G

The Layout of Projects (Entire Stack) … DNN LDA RNN LSTM Algorithms CNN GBDT LBFGS CXXNet Minerva Local execution engine Communication Layer PS MPI Rabit Process Management MPI Yarn Our goal: being open, build useful parts and assemble them together DMLC-core Merged into MXNet File System HDFS Lustre local S3

Acknowledgement Work comes from all contributors of DMLC

Replicable Parts for Large-scale Deep Learning 大规模深度机器学习中的组件