Accelerating Multiplication and Parallelizing Operations in Non-Volatile Memory

Accelerating Multiplication and Parallelizing Operations inNon-Volatile Memory Mohsen Imani, Saransh Gupta, Tajana S. Rosing University of California San Diego System Energy Efficiency Lab

Big Data Processing • Internet of Things (IoT): Billions-trillions of interconnected devices • Critical requirement of IoT applications: Big Data processing • e.g., signal processing, machine learning, graph processing • Can the today’s system process Big Data? General Purpose Processor Core Core Core Core → Large energy consumption & performance degradation Core Core Core Core Inefficient! Data Movements Large Memory for Big Data Memory

Cost of Operations DRAM consumes 170x more energy than FPU Mult Ref: Dally, Tutorial, NIPS’15

Processing In Memory • Processing In Memory (PIM):Performing a part of computation tasks inside the memory General Purpose Processor Core Core Core Core Core Core Core Core Large Memory for Big Data Large Memory for Big Data Computational logic

Supporting In-Memory Operations Bitwise Addition/ Multiplication Search Operation Supported Operations OR, AND, XOR Multiple row Matrix Multiplication Search/ Nearest Search Example of Operations HD computing Graph processing Query processing Deep learning Security Multimedia Classification Clustering Database Applications

PIM for Addition/Multiplication • First work to support in-memory multiplication • Enables in-memory addition/multiplication using emerging NVM technology (memristor devices) • Does not need to change memory sense amplifiers • Significantly speeds up in-memory processing • Works on both precise and approximate mode Ref: Imani et al. DAC’17

Crossbar NOR Operation Out = NOR (in1, in2, … , inn) 0 0 0 Z = NOR (W, X, Y) 0: high resistance (ROFF~inf) 1: low resistance (RON~0) V0 Ref: Kvatinsky et al. TCASII’14

NOR-based Addition • Crossbar memory supports NOR operation • Can we implement 1-bit full adder only using only NOR operation? • Cout: 4 NOR • S: 3 NOT, 5 NOR • NOT operation is implemented as a NOR operation with 1 input • 12N + 1 cycles to add two N-bit numbers Ref: Talati et al. TNano’16

In-Memory Multiplication • N×N multiplication: • Partial product generation: creates partial products for multiplication • Fast addition: reduces N numbers to 2 • Product generation: adds two numbers generated by fast adder and outputs the product of N ×N multiplication Fast addition and product generation

Fast Addition • Carry Save Adder (CSA): • Makes additions independent and in parallel with no carry propagation • Propagates carry at the last stage • 3 inputs to 2 outputs (3:2) reduction provides the same latency as 1-bit addition (13 cycles) • Last stage depends on N and runs in 12N+1 cycles 13 cycles 12N+14 cycles 12N+1 cycles

Configurable Interconnect • Problem with CSA: Too many shift operations • Divide the crossbar into multiple blocks connected via configurable interconnects • Use interconnects to perform shift operations

Product Generation • Propagation of carry in order to generate the final answer • Final operands are 2N-bit long • Requires 13*2N cycles to compute the result: latency is dominant  • Approximate product generation: dramatically speeds it up if a fully accurate result is not desired

Experimental Setup • C++ cycle-accurate simulator to model the APIM functionality • Circuit level simulation is performed using 45 nm CMOS technology using Cadence Virtuoso • VTEAM memristor model [*] for simulation of our memory design: • RON and ROFF of 10kΩ and 10MΩ respectively • Six general OpenCL applications: • Sobel, Robert, Fast Fourier transform (FFT), DwHaar1D, Sharpen,Quasi Random • Compared with state-of-the-art AMD Radeon R9 390 GPU with 8GB memory • Hioki 3334 power meter to measure the power consumption of GPU [*] Kvatinsky et al. TCASII’15

APIM Efficiency • Average improvement over six applications as compared to the GPU: • Performance speedup: by reducing data movement • Energy improvement: both data movement and computation efficiency • On average over six applications: 28× energy efficiency and 4.8x speedup Robert Filter DwtHaar1D 34.5x less energy 4.6x speedup 24.7x less energy 3.6x speedup

Supporting In-Memory Operations Bitwise Addition/ Multiplication Search Operation Supported Operations OR, AND, XOR Multiple row Matrix Multiplication Search/ Nearest Search Example of Operations HD computing Graph processing Query processing Deep learning Security Multimedia Classification Clustering Database Applications

Nearest Search In-Memory • Conventionally content addressable memories (CAMs) just support exact matches • Cannot be used to implement even simple queries like min/max • We enable nearest distance search in usual crossbar memory • Our new CAM now supports • Hamming distance search: Hyperdimensional computing • Absolute distance search: kNN, kmeans, query processing Ref: Imani et al. HPCA’17, ISLPED’17, TCAD’18

In-Memory Computing Accelerator Classification Clustering Hyperdimensional Classification [HPCA’17] Support both Training and Testing Kmeans Adaboost[ICCAD’17] Hyperdimensional Clustering DNN, CNN [DATE’17] Decision Tree kNN [ICRC’17] Database Graph Processing Query Processing [ISLPED’17] [TCAD’18] Graph Processing

Neural Network PIM (NNPIM) • Uses simple crossbar memory and 2-level memristor devices rather than multi-level memory cells • Supports all neural networks operations in-memory including: • Weighted Accumulation • Activation function • Pooling • Software support (weight sharing) reduces computations • Can achieve on an average 4.9x energy efficiency and 5.7x speedup as compared to the state-of-the-art accelerators.

Query Processing in NVMs • A novel query processing accelerator (NVQuery) • Uses a memristor-based memory to process queries including comparison, aggregation, prediction functions among others • Provides 49.3x performance speedup and 32.9x energy savings as compared to traditional processor. Ref: Imani et al. ISLPED’17, TCAD’18

Hyperdimensional Computing Training Cat hypervector In-memory implementation: Provides 746x EDP improvement as compared to ASIC implementation Encoding -1 +1 -1 -1 +1 . . . +1 Encoding Dog hypervector -1 -1 +1 -1 -1 . . . +1 Similarity Check Brain Checks for Similarity! Testing Encoding +1 +1 +1 -1 -1 . . . +1 Query hypervector Ref: Imani et al. HPCA’17

Conclusion • We are working to accelerate a wide range of applications in memory • At circuit level, we are working to support more operations in memory and to make the existing operations more efficient • At architecture and system level, we are designing new application specific accelerators • At application level, we are designing libraries which provides interface to programmers for accelerating their applications using PIM

References • Dally Tutorial, NIPS’15 • Mohsen Imani, Saransh Gupta, and TajanaRosing. "Ultra-efficient processing in-memory for data intensive applications." In Proceedings of the 54th Annual Design Automation Conference 2017, p. 6. ACM, 2017. • ShaharKvatinsky, Dmitry Belousov, Slavik Liman, Guy Satat, Nimrod Wald, Eby G. Friedman, AvinoamKolodny, and Uri C. Weiser. "MAGIC—Memristor-aided logic." IEEE Transactions on Circuits and Systems II: Express Briefs 61, no. 11 (2014): 895-899. • NishilTalati, Saransh Gupta, Pravin Mane, and ShaharKvatinsky. "Logic design within memristive memories using memristor-aided loGIC (MAGIC)." IEEE Transactions on Nanotechnology 15, no. 4 (2016): 635-650. • ShaharKvatinsky, Misbah Ramadan, Eby G. Friedman, and AvinoamKolodny. "VTEAM: A general model for voltage-controlled memristors." IEEE Transactions on Circuits and Systems II: Express Briefs 62, no. 8 (2015): 786-790. • Mohsen Imani, Abbas Rahimi, Deqian Kong, TajanaRosing, and Jan M. Rabaey. "Exploring hyperdimensional associative memory." In High Performance Computer Architecture (HPCA), 2017 IEEE International Symposium on, pp. 445-456. IEEE, 2017. • Mohsen Imani, Saransh Gupta, Atl Arredondo, and TajanaRosing. "Efficient query processing in crossbar memory." In Low Power Electronics and Design (ISLPED, 2017 IEEE/ACM International Symposium on, pp. 1-6. IEEE, 2017.

References • Mohsen Imani, Saransh Gupta, Sahil Sharma, and TajanaRosing. ”NVQuery: Efficient query processing in non-volatile memory." IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems (2018), in press. • Yeseong Kim, Mohsen Imani, and Tajana Rosing. "Orchard: Visual object recognition accelerator based on approximate in-memory processing." In Computer-Aided Design (ICCAD), 2017 IEEE/ACM International Conference on, pp. 25-32. IEEE, 2017. • Mohsen Imani, Daniel Peroni, Yeseong Kim, Abbas Rahimi, and Tajana Rosing. "Efficient neural network acceleration on gpgpu using content addressable memory." In 2017 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1026-1031. IEEE, • Mohammad SamraghRazlighi, Mohsen Imani, FarinazKoushanfar, and TajanaRosing. "Looknn: Neural network with no multiplication." In 2017 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1775-1780. IEEE, 2017. • Mohsen Imani, Yeseong Kim, and TajanaRosing. "NNgine: Ultra-Efficient Nearest Neighbor Accelerator Based on In-Memory Computing." In Rebooting Computing (ICRC), 2017 IEEE International Conference on, pp. 1-8. IEEE, 2017. • Mohsen Imani, Deqian Kong, Abbas Rahimi, and TajanaRosing. "Voicehd: Hyperdimensional computing for efficient speech recognition." In Rebooting Computing (ICRC), 2017 IEEE International Conference on, pp. 1-8. IEEE, 2017.

Accelerating Multiplication and Parallelizing Operations in Non-Volatile Memory

Accelerating Multiplication and Parallelizing Operations in Non-Volatile Memory

Presentation Transcript

Memory Hierarchies and Optimizations: Case Study in Matrix Multiplication

Blackcomb: Hardware-Software Co-design for Non-Volatile Memory in Exascale Systems

EECS598 Non-Volatile Storage

Semiconductor Manufacturing - Emerging Non-Volatile Memory

Circuit Modeling of Non-volatile Memory Devices

Memory Hierarchies and Optimizations: Case Study in Matrix Multiplication

Parallelizing Computations

Non-Volatile Memory

Non-Volatile Memory (NVM) Market to Reach $82 Billion, Globally, by 2022

Global Non-Volatile Memory Market Analysis By Applications and Types

Global Non-Volatile Memory Market Growth

Non-Volatile Memory Market Technology, Major Trends and Market Opportunities Forecast to 2021

Non-Volatile Memory Market by Manufacturers, Types, Regions and Applications Research Report Forecast to 2023

Non-Volatile Memory (NVM) Market

Design of Non-Volatile Latch using Resistive Memory Technology

Global Non-Volatile Memory Express (NVMe) Market Size, Trends & Analysis - Forecasts To 2026

Non volatile dual in-line memory module market Trends and Share To 2026

Non-volatile Memory Express (NVMe) Market worth $163.5 billion by 2025

Non-By Volatile Memory Express (NVMe) Global Market 2024 - By Growth 2033