FloatPIM : In-Memory Acceleration of Deep Neural Network Training with High Precision

FloatPIM: In-Memory Acceleration of Deep Neural Network Training with High Precision Mohsen Imani, Saransh Gupta, Yeseong Kim, Tajana Rosing University of California San Diego System Energy Efficiency Lab.

Deep Learning Deep learning is the state-of-the-art approach for video analysis “Training a single AI model can emit as much carbon as five cars in their lifetimes“ MIT Technology Review Videos are 70% of today’s internet traffic Over 300 hours of video uploaded to YouTube every minute Over 500 million hours of video surveillance collected every day Slide from: V. Sze presentation, MIT‘17

Computing Challenges Data movement is very expensive! Slide from: V. Sze, et.al., “Hardware for Machine Learning: Challenges and Opportunies”, 2017

DNN Challenges in Training Don’t support full training due to energy inefficiency How about using existing PIM architectures? Nervana Hawaii NPU Apple AI TFLite DNN/CNN Training Highly Parallel Architecture 1 High Precision Computation 2 Large Data Movement 3

Digital-based Processing In-Memory Operations Examples NOR, AND, XOR, … Bitwise Addition, Multiplication Arithmetic Exact/Nearest Search Search-based Advantages Memory Architecture Works on digital data No ADC/DAC Eliminates data movements In-place computation where big data is stored High Parallelism Simultaneous computation in all memory blocks Flexible operations Fixed or Floating Point operations

Digital PIM Operations C=A×B NOR(A,B) A C=A+B B Arithmetic Row-parallel Addition Detector Row Driver Row-parallel Multiplication Q Search-based Exact Search Detector Row Driver Q

Digital PIM Operations C=A×B A C=A+B B Arithmetic Detector Row Driver Row-parallel Addition Row-parallel Multiplication Q Exact Search Detector Row Driver Q

Neural Networks aj Zi Weight Matrix Zj Activation Function (g) Weight Wij Derivative Activation (g’) g’(aj) Back Propagation Feed Forward

Vector-Matrix Multiplication Multiplication a1 Doesn’t support row-level addition a2 a3 Addition a4 Input Weight Matrix Transposed Weight Transposed Input a1 a2 a4 a3 a1 a2 a4 a3 a1 a2 a4 a3 a1 a2 a4 a3 Row-Parallel Copy Addition Multiplication

Neural Network: Convolution Layer Convolution Input Weight Matrix * Z1 Z1 Z1 Z2 Z2 Z2 Z3 Z3 Z3 w2 w1 Z4 Z4 Z4 Z5 Z5 Z5 Z6 Z6 Z6 w4 w3 Z7 Z7 Z7 Z8 Z8 Z8 Z9 Z9 Z9 How to move convolution windows in memory Expand weights Write in memory is too slow! Shifter Input w1 w1 w1 w2 w2 w2 w3 w3 w3 w4 w4 w4 Addition Addition Multiplication

Neural Network: Back Propagation Feed Forward aj Zi Weight Matrix Zj Activation Function (g) Weight Wij Derivative Activation (g’) g’(aj) Back Propagation η δk δj Weight Matrix Updated Weight Weight Wjk ηδjZi Weight Wij Weight Update Error Backward

Memory Layout: Back Propagation δk Weight Matrix Updated Weight δj Weight Wjk ηδjZi Weight Wij Weight Update Error Backward Stored during Feed Forward Switch ηZj g’(aj) PIM Reserved δj δk Copies WTjk Update next layer weights ηZi δj Copies PIM Reserved WTij g’(ai) δi Stored during Feed Forward

Digital PIM Architecture How does data move between the block? Digital PIM Architecture z g g Block 2 Block 1 Switch g g Block 3 Block 4 Switch Example Network Data Transfer Computing Mode Computing Mode

FloatPIM Parallelism Serialized Computation Parallel Computation

FloatPIM Architecture 32 Tiles 256 Blocks/Tile 1K*1K Block Size • Controller per tile • 11.5% of area • 9.7% of power! • Crossbar array: 1K*1k • 99% of area • 89% of power! • 6-levels barrel shifter • 0.5% of area • ~10% of power! • Switches • 6.3% of area • 0.9 % of power!

Deep Learning Acceleration Four popular networks over large-scaled ImageNet dataset

FloatPIM: Fixed vs. Floating Point • FloatPIM efficiency using bFloat as compared to • Float-32: 2.9× speedup and 2.5× energy savings • Fixed-32: 1.5× speedup and 1.42× energy savings

FloatPIM Efficiency 48X 303X 4.3X faster than Analog PIM 16X more energy efficient than Analog PIM • FloatPIM vs. NVIDIA 1080 GTX GPU and PipeLayer [HPCA’17]: • FloatPIM efficiency comes from: • Higher density • Lower data movement • Faster computation in a lower bitwidth

Conclusion • Several existing challenges in analog-based computing in today’s PIM technology • Proposed the idea of digital-based PIM architecture • Exploits analog characteristics of NVMs to support row-parallel NOR-operations • Extends it to row-parallel arithmetic; addition/multiplication • Maps the entire DNN training/inference to a crossbar memory with minimal changes in the memory • Results as compared to: • NVIDIA GTX 1080 GPU : 302X faster and 48X more energy efficient • Analog PIM[HPCA’17]: 4.3X faster and 16X more energy efficient

FloatPIM : In-Memory Acceleration of Deep Neural Network Training with High Precision