1 / 19

FloatPIM : In-Memory Acceleration of Deep Neural Network Training with High Precision

FloatPIM : In-Memory Acceleration of Deep Neural Network Training with High Precision. Mohsen Imani , Saransh Gupta, Yeseong Kim, Tajana Rosing University of California San Diego System Energy Efficiency Lab. Deep Learning.

hestrada
Download Presentation

FloatPIM : In-Memory Acceleration of Deep Neural Network Training with High Precision

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. FloatPIM: In-Memory Acceleration of Deep Neural Network Training with High Precision Mohsen Imani, Saransh Gupta, Yeseong Kim, Tajana Rosing University of California San Diego System Energy Efficiency Lab.

  2. Deep Learning Deep learning is the state-of-the-art approach for video analysis “Training a single AI model can emit as much carbon as five cars in their lifetimes“ MIT Technology Review Videos are 70% of today’s internet traffic Over 300 hours of video uploaded to YouTube every minute Over 500 million hours of video surveillance collected every day Slide from: V. Sze presentation, MIT‘17

  3. Computing Challenges Data movement is very expensive! Slide from: V. Sze, et.al., “Hardware for Machine Learning: Challenges and Opportunies”, 2017

  4. DNN Challenges in Training Don’t support full training due to energy inefficiency How about using existing PIM architectures? Nervana Hawaii NPU Apple AI TFLite DNN/CNN Training Highly Parallel Architecture 1 High Precision Computation 2 Large Data Movement 3

  5. Digital-based Processing In-Memory Operations Examples NOR, AND, XOR, … Bitwise Addition, Multiplication Arithmetic Exact/Nearest Search Search-based Advantages Memory Architecture Works on digital data No ADC/DAC Eliminates data movements In-place computation where big data is stored High Parallelism Simultaneous computation in all memory blocks Flexible operations Fixed or Floating Point operations

  6. Digital PIM Operations C=A×B NOR(A,B) A C=A+B B Arithmetic Row-parallel Addition Detector Row Driver Row-parallel Multiplication Q Search-based Exact Search Detector Row Driver Q

  7. Digital PIM Operations C=A×B A C=A+B B Arithmetic Detector Row Driver Row-parallel Addition Row-parallel Multiplication Q Exact Search Detector Row Driver Q

  8. Neural Networks aj Zi Weight Matrix Zj Activation Function (g) Weight Wij Derivative Activation (g’) g’(aj) Back Propagation Feed Forward

  9. Vector-Matrix Multiplication Multiplication a1 Doesn’t support row-level addition a2 a3 Addition a4 Input Weight Matrix Transposed Weight Transposed Input a1 a2 a4 a3 a1 a2 a4 a3 a1 a2 a4 a3 a1 a2 a4 a3 Row-Parallel Copy Addition Multiplication

  10. Neural Network: Convolution Layer Convolution Input Weight Matrix * Z1 Z1 Z1 Z2 Z2 Z2 Z3 Z3 Z3 w2 w1 Z4 Z4 Z4 Z5 Z5 Z5 Z6 Z6 Z6 w4 w3 Z7 Z7 Z7 Z8 Z8 Z8 Z9 Z9 Z9 How to move convolution windows in memory Expand weights Write in memory is too slow! Shifter Input w1 w1 w1 w2 w2 w2 w3 w3 w3 w4 w4 w4 Addition Addition Multiplication

  11. Neural Network: Back Propagation Feed Forward aj Zi Weight Matrix Zj Activation Function (g) Weight Wij Derivative Activation (g’) g’(aj) Back Propagation η  δk  δj Weight Matrix Updated Weight Weight Wjk ηδjZi Weight Wij Weight Update Error Backward

  12. Memory Layout: Back Propagation  δk Weight Matrix Updated Weight  δj Weight Wjk ηδjZi Weight Wij Weight Update Error Backward Stored during Feed Forward Switch ηZj g’(aj) PIM Reserved  δj  δk Copies WTjk Update next layer weights ηZi  δj Copies PIM Reserved WTij g’(ai)  δi Stored during Feed Forward

  13. Digital PIM Architecture How does data move between the block? Digital PIM Architecture z g g Block 2 Block 1 Switch g g Block 3 Block 4 Switch Example Network Data Transfer Computing Mode Computing Mode

  14. FloatPIM Parallelism Serialized Computation Parallel Computation

  15. FloatPIM Architecture 32 Tiles 256 Blocks/Tile 1K*1K Block Size • Controller per tile • 11.5% of area • 9.7% of power! • Crossbar array: 1K*1k • 99% of area • 89% of power! • 6-levels barrel shifter • 0.5% of area • ~10% of power! • Switches • 6.3% of area • 0.9 % of power!

  16. Deep Learning Acceleration Four popular networks over large-scaled ImageNet dataset

  17. FloatPIM: Fixed vs. Floating Point • FloatPIM efficiency using bFloat as compared to • Float-32: 2.9× speedup and 2.5× energy savings • Fixed-32: 1.5× speedup and 1.42× energy savings

  18. FloatPIM Efficiency 48X 303X 4.3X faster than Analog PIM 16X more energy efficient than Analog PIM • FloatPIM vs. NVIDIA 1080 GTX GPU and PipeLayer [HPCA’17]: • FloatPIM efficiency comes from: • Higher density • Lower data movement • Faster computation in a lower bitwidth

  19. Conclusion • Several existing challenges in analog-based computing in today’s PIM technology • Proposed the idea of digital-based PIM architecture • Exploits analog characteristics of NVMs to support row-parallel NOR-operations • Extends it to row-parallel arithmetic; addition/multiplication • Maps the entire DNN training/inference to a crossbar memory with minimal changes in the memory • Results as compared to: • NVIDIA GTX 1080 GPU : 302X faster and 48X more energy efficient • Analog PIM[HPCA’17]: 4.3X faster and 16X more energy efficient

More Related