1 / 22

deepscale.ai

Tips and Tricks for Developing Smaller Neural Nets. FORREST IANDOLA Co-founder and CEO, DeepScale. Thanks to: Sammy Sidhu, Paras Jain, Paden Tomasello, Matt Moskewicz, Ben Landen, Kurt Keutzer, Amir Gholami, Kiseok Kwon, and Bichen Wu. deepscale.ai. TECHNIQUES FOR.

nelizabeth
Download Presentation

deepscale.ai

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tips and Tricks for Developing Smaller Neural Nets FORREST IANDOLA Co-founder and CEO, DeepScale Thanks to: Sammy Sidhu, Paras Jain, Paden Tomasello, Matt Moskewicz, Ben Landen, Kurt Keutzer, Amir Gholami, Kiseok Kwon, and Bichen Wu deepscale.ai

  2. TECHNIQUES FOR • Creating fast & energy-efficient DNNs • Original Net Design • Distillation • Model Compression • This Talk New Layer Types Efficient Implementation Design Space Exploration student data loss teacher

  3. Outline • Why develop Small Deep Neural Nets? • Tips and tricks for designing Small DNNs • Proof-points on the benefits of Small DNNs • Why Small DNNs will only become more important

  4. Why develop small neural nets?

  5. Smaller DNNs are more energy efficient CHIP • Memory accesses require much more energy than computation • So, if you can develop a small neural net that fits in on-chip cache, you save a ton of energy PROCESSOR Addition: 0.18 pJ (1x) Multiplication: 0.62 pJ (3.4x) CACHE Memory Access: 8 pJ (44x) MAIN MEMORY Memory Access: 640 pJ (3600x) [1] Ardavan Pedram, Stephen Richardson, Sameh Galal, Shahar Kvatinsky, and Mark Horowitz. Dark Memory and Accelerator-Rich System Optimization in the Dark Silicon Era. IEEE Design & Test, 2017.

  6. THE SERVER SIDE • Deep Learning Processors have arrived! Uh-oh… Processors are improving much faster than Memory. [1] https://www.nvidia.com/content/PDF/kepler/Tesla-K20-Passive-BD-06455-001-v05.pdf [2] http://www.nvidia.com/content/PDF/Volta-Datasheet.pdf (PCIe version)

  7. MOBILE PLATFORMS • Deep Learning Processors have arrived! [1] https://indico.cern.ch/event/319744/contributions/1698147/attachments/616065/847693/gdb_110215_cesini.pdf [2] https://www.androidauthority.com/huawei-announces-kirin-970-797788 [3] https://blogs.nvidia.com/blog/2018/01/07/drive-xavier-processor/ [4] https://developer.nvidia.com/jetson-xavier

  8. MOBILE PLATFORMS • Deep Learning Processors have arrived! • So, you don't want to spend the rest of your life waiting for memory to load? • Small neural nets to the rescue! [1] https://indico.cern.ch/event/319744/contributions/1698147/attachments/616065/847693/gdb_110215_cesini.pdf [2] https://www.androidauthority.com/huawei-announces-kirin-970-797788 [3] https://blogs.nvidia.com/blog/2018/01/07/drive-xavier-processor/ [4] https://developer.nvidia.com/jetson-xavier

  9. Tips and tricks for developing small DNNs • inspired by the design choices of MobileNet, SqueezeNet, SqueezeNext, etc. • Original Net Design • This Talk

  10. 1. Replace Fully-Connected Layers with Convolutions In AlexNet and VGG, the majority of the parameters are in the FC layers. The FC7 layer in AlexNet has 4096 input channels and 4096 filters  67MB of params The mere presence of fully-connected layers is not the culprit for the high model size; the problem is that some FC layers of VGG and AlexNet have a huge number of channels and filters.

  11. channels channels 3 1 x numFilt x numFilt REDUCING THE HEIGHT AND WIDTH OF FILTERS 1 3 • 2. Kernel Reduction While 1x1 filters cannot see outside of a 1-pixel radius, they retain the ability to combine and reorganize information across channels. SqueezeNet (2016): we found that we could replace half the 3x3 filters with 1x1's without diminishing accuracy SqueezeNext (2018): eliminate most of the 3x3 filters – we use mix of 1x1, 3x1, and 1x3 filters (and still retain accuracy)

  12. 256 128 REDUCING THE NUMBER OF FILTERS AND CHANNELS 3 3 • 3. Channel Reduction x numFilt x numFilt 3 3 OLD layer Li+1 NEW layer Li+1 If we halve the number of filters in layer Li  this halves the number of input channels in layer Li+1  up to 4x reduction in number of parameters

  13. 256 ALSO CALLED: "GROUP CONVOLUTIONS" or "CARDINALITY" 1 • 4. Depthwise Separable Convolutions 3 3 x numFilt x numFilt 3 3 Each 3x3 filter has 1 channel Each filter gets applied to a different channel of the input used in recent papers such as MobileNets and ResNeXt

  14. Proof-points for small DNNs

  15. SMALL DNNs WITH • AlexNet/SqueezeNet-level accuracy https://github.com/DeepScale/SqueezeNet https://github.com/amirgholami/SqueezeNext [1] A. Krizhevsky, I. Sutskever, G.E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NIPS, 2012. [2] F.N. Iandola, M. Moskewicz, K. Ashraf, S. Han, W. Dally, K. Keutzer. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size. arXiv, 2016. [3] A. Gholami, K. Kwon, B. Wu, Z. Tai, X. Yue, P. Jin, S. Zhao, K. Keutzer. SqueezeNext: Hardware-Aware Neural Network Design. CVPR ECV Workshop, 2018.

  16. SMALL DNNs WITH • MobileNet/SqueezeNext-level accuracy How is it that SqueezeNext has more computation (MMACs) than MobileNet, but SqueezeNext is slightly faster? • SqueezeNext has fewer parameters than MobileNet, which reduces off-chip memory accesses • Unlike MobileNet, SqueezeNext doesn't use fully-grouped convolutions (so, SqueezeNext spends less time waiting on memory accesses) [1] A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv, 2017. [2] A. Gholami, K. Kwon, B. Wu, Z. Tai, X. Yue, P. Jin, S. Zhao, K. Keutzer. SqueezeNext: Hardware-Aware Neural Network Design. CVPR ECV Workshop, 2018.

  17. Object Detection with small DNNs [1] L. HengFui, Y. Nimmagadda, W. YengLiong. Fire SSD: Wide Fire Modules based Single Shot Detector on Edge Device. arXiv, June 2018.

  18. Not Hotdog • using SqueezeNet and MobileNets SqueezeNet powers Version 2 of the Not Hotdog app from the Silicon Valley TV show. A variant of MobileNets powers Version 3.

  19. 6x savings in parameters by replacing 3x3 convolutions w/ Shift operator Original + • Style Transfer using Shift operator Shift [1] B. Wu, A.Wan, X. Yue , P. Jin, S. Zhao, N. Golmant, A. Gholaminejad, J. Gonzalez, K. Keutzer. Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions. CVPR, 2018. [2] J. Johnson, A. Alahi, and L. Fei-Fei. "Perceptual losses for real-time style transfer and super-resolution." ECCV, 2016.

  20. One last thought: • The "right" DNN is increasingly platform-dependent [1] https://www.androidauthority.com/huawei-announces-kirin-970-797788 [2] https://www.anandtech.com/show/11815/huawei-mate-10-and-mate-10-pro-launch-on-october-16th-more-kirin-970-details [3] http://www.nvidia.com/content/PDF/Volta-Datasheet.pdf(PCIe version) [4] https://www.nvidia.com/en-us/data-center/tensorcore [5] https://www.anandtech.com/show/12429/google-cloud-announces-cloud-tpu-beta-availability

  21. of fast & efficient DNNs • The future • Old conventional wisdom: reduce FLOPS, and speedups will happen • New conventional wisdom: FLOPS are cheap, and memory accesses are expensive • Next-generation conventional wisdom: some types of FLOPS (e.g. 4x4 matmul) are super cheap, other FLOPS are pretty cheap, memory accesses are expensive • And remember… DNNs comprise an infinite design space. Never stop exploring!

  22. THANK YOU For more info fi@deepscale.ai We're hiring http://jobs.deepscale.ai deepscale.ai

More Related