Efficient Techniques for Smaller Neural Nets

Tips and Tricks for Developing Smaller Neural Nets FORREST IANDOLA Co-founder and CEO, DeepScale Thanks to: Sammy Sidhu, Paras Jain, Paden Tomasello, Matt Moskewicz, Ben Landen, Kurt Keutzer, Amir Gholami, Kiseok Kwon, and Bichen Wu deepscale.ai

TECHNIQUES FOR • Creating fast & energy-efficient DNNs • Original Net Design • Distillation • Model Compression • This Talk New Layer Types Efficient Implementation Design Space Exploration student data loss teacher

Outline • Why develop Small Deep Neural Nets? • Tips and tricks for designing Small DNNs • Proof-points on the benefits of Small DNNs • Why Small DNNs will only become more important

Why develop small neural nets?

Smaller DNNs are more energy efficient CHIP • Memory accesses require much more energy than computation • So, if you can develop a small neural net that fits in on-chip cache, you save a ton of energy PROCESSOR Addition: 0.18 pJ (1x) Multiplication: 0.62 pJ (3.4x) CACHE Memory Access: 8 pJ (44x) MAIN MEMORY Memory Access: 640 pJ (3600x) [1] Ardavan Pedram, Stephen Richardson, Sameh Galal, Shahar Kvatinsky, and Mark Horowitz. Dark Memory and Accelerator-Rich System Optimization in the Dark Silicon Era. IEEE Design & Test, 2017.

THE SERVER SIDE • Deep Learning Processors have arrived! Uh-oh… Processors are improving much faster than Memory. [1] https://www.nvidia.com/content/PDF/kepler/Tesla-K20-Passive-BD-06455-001-v05.pdf [2] http://www.nvidia.com/content/PDF/Volta-Datasheet.pdf (PCIe version)

MOBILE PLATFORMS • Deep Learning Processors have arrived! [1] https://indico.cern.ch/event/319744/contributions/1698147/attachments/616065/847693/gdb_110215_cesini.pdf [2] https://www.androidauthority.com/huawei-announces-kirin-970-797788 [3] https://blogs.nvidia.com/blog/2018/01/07/drive-xavier-processor/ [4] https://developer.nvidia.com/jetson-xavier

MOBILE PLATFORMS • Deep Learning Processors have arrived! • So, you don't want to spend the rest of your life waiting for memory to load? • Small neural nets to the rescue! [1] https://indico.cern.ch/event/319744/contributions/1698147/attachments/616065/847693/gdb_110215_cesini.pdf [2] https://www.androidauthority.com/huawei-announces-kirin-970-797788 [3] https://blogs.nvidia.com/blog/2018/01/07/drive-xavier-processor/ [4] https://developer.nvidia.com/jetson-xavier

Tips and tricks for developing small DNNs • inspired by the design choices of MobileNet, SqueezeNet, SqueezeNext, etc. • Original Net Design • This Talk

1. Replace Fully-Connected Layers with Convolutions In AlexNet and VGG, the majority of the parameters are in the FC layers. The FC7 layer in AlexNet has 4096 input channels and 4096 filters  67MB of params The mere presence of fully-connected layers is not the culprit for the high model size; the problem is that some FC layers of VGG and AlexNet have a huge number of channels and filters.

channels channels 3 1 x numFilt x numFilt REDUCING THE HEIGHT AND WIDTH OF FILTERS 1 3 • 2. Kernel Reduction While 1x1 filters cannot see outside of a 1-pixel radius, they retain the ability to combine and reorganize information across channels. SqueezeNet (2016): we found that we could replace half the 3x3 filters with 1x1's without diminishing accuracy SqueezeNext (2018): eliminate most of the 3x3 filters – we use mix of 1x1, 3x1, and 1x3 filters (and still retain accuracy)

256 128 REDUCING THE NUMBER OF FILTERS AND CHANNELS 3 3 • 3. Channel Reduction x numFilt x numFilt 3 3 OLD layer Li+1 NEW layer Li+1 If we halve the number of filters in layer Li  this halves the number of input channels in layer Li+1  up to 4x reduction in number of parameters

256 ALSO CALLED: "GROUP CONVOLUTIONS" or "CARDINALITY" 1 • 4. Depthwise Separable Convolutions 3 3 x numFilt x numFilt 3 3 Each 3x3 filter has 1 channel Each filter gets applied to a different channel of the input used in recent papers such as MobileNets and ResNeXt

Proof-points for small DNNs

SMALL DNNs WITH • AlexNet/SqueezeNet-level accuracy https://github.com/DeepScale/SqueezeNet https://github.com/amirgholami/SqueezeNext [1] A. Krizhevsky, I. Sutskever, G.E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NIPS, 2012. [2] F.N. Iandola, M. Moskewicz, K. Ashraf, S. Han, W. Dally, K. Keutzer. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size. arXiv, 2016. [3] A. Gholami, K. Kwon, B. Wu, Z. Tai, X. Yue, P. Jin, S. Zhao, K. Keutzer. SqueezeNext: Hardware-Aware Neural Network Design. CVPR ECV Workshop, 2018.

SMALL DNNs WITH • MobileNet/SqueezeNext-level accuracy How is it that SqueezeNext has more computation (MMACs) than MobileNet, but SqueezeNext is slightly faster? • SqueezeNext has fewer parameters than MobileNet, which reduces off-chip memory accesses • Unlike MobileNet, SqueezeNext doesn't use fully-grouped convolutions (so, SqueezeNext spends less time waiting on memory accesses) [1] A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv, 2017. [2] A. Gholami, K. Kwon, B. Wu, Z. Tai, X. Yue, P. Jin, S. Zhao, K. Keutzer. SqueezeNext: Hardware-Aware Neural Network Design. CVPR ECV Workshop, 2018.

Object Detection with small DNNs [1] L. HengFui, Y. Nimmagadda, W. YengLiong. Fire SSD: Wide Fire Modules based Single Shot Detector on Edge Device. arXiv, June 2018.

Not Hotdog • using SqueezeNet and MobileNets SqueezeNet powers Version 2 of the Not Hotdog app from the Silicon Valley TV show. A variant of MobileNets powers Version 3.

6x savings in parameters by replacing 3x3 convolutions w/ Shift operator Original + • Style Transfer using Shift operator Shift [1] B. Wu, A.Wan, X. Yue , P. Jin, S. Zhao, N. Golmant, A. Gholaminejad, J. Gonzalez, K. Keutzer. Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions. CVPR, 2018. [2] J. Johnson, A. Alahi, and L. Fei-Fei. "Perceptual losses for real-time style transfer and super-resolution." ECCV, 2016.

One last thought: • The "right" DNN is increasingly platform-dependent [1] https://www.androidauthority.com/huawei-announces-kirin-970-797788 [2] https://www.anandtech.com/show/11815/huawei-mate-10-and-mate-10-pro-launch-on-october-16th-more-kirin-970-details [3] http://www.nvidia.com/content/PDF/Volta-Datasheet.pdf(PCIe version) [4] https://www.nvidia.com/en-us/data-center/tensorcore [5] https://www.anandtech.com/show/12429/google-cloud-announces-cloud-tpu-beta-availability

of fast & efficient DNNs • The future • Old conventional wisdom: reduce FLOPS, and speedups will happen • New conventional wisdom: FLOPS are cheap, and memory accesses are expensive • Next-generation conventional wisdom: some types of FLOPS (e.g. 4x4 matmul) are super cheap, other FLOPS are pretty cheap, memory accesses are expensive • And remember… DNNs comprise an infinite design space. Never stop exploring!

THANK YOU For more info fi@deepscale.ai We're hiring http://jobs.deepscale.ai deepscale.ai

Efficient Techniques for Smaller Neural Nets

Efficient Techniques for Smaller Neural Nets

Presentation Transcript