Performance and Power Optimization through Data Compression in Network-on-Chip Architectures

Performance and Power Optimization through Data Compression inNetwork-on-Chip Architectures Reetuparna Das, Asit K Mishra, Chrysostomos Nicopoulos, Dongkook Park, Vijaykrishnan Narayanan, Ravi Iyer*, Mazin S. Yousif *, Chita R. Das *

Why On-Chip Networks ? Relative Delays based on the 2005 ITRS • Global interconnect delays not scaling … • No longer single cycle global wire traversal! • March to Multicores … • We need controlled structured low-power communication fabric 250 nm Gate Delay Global Wiring 32 nm Global Wiring Delay Dominates

What is Network-on-Chip ? R R R R R R R R R R R R R R R R R R R R R R R R

Is Network Latency Critical ? • in-net : on-chip interconnect • off-chip : main memory access • other : cache access and queuing Network Intensive ? Memory Intensive ? Balanced ? Average Memory Response Time Break-down Up to 50 % of Memory Latency can come from Network !

NUCA specific Network-on-Chip Design Challenges… • NoC have high bandwidth demands. • NoC specific to NUCA are latency critical • Latency directly effects memory access time • NoC consumes significant system power • NoC design is area constrained • Need to Reduce buffering requirements. • Goal is to use compression to minimize latency, power, bandwidth and buffer requirements

Compression ? • Is compression a viable solution for NoC ? • Application profiling shows extensive value locality in NUCA Data Traffic (e.g. > 50% zero patterns!) • Associated Overheads • Compression / Decompression Latency • Increased cache access latency to load/store variable length cache blocks

Outline • Network-On-Chip Background • Compression • Algorithm • Cache Compression (CC) Scheme • NIC Compression (NC) Scheme • Results • Network Results • CPI/Memory Response Time • Scalability • Conclusion *

Compressing Frequent Data Patterns • One 32 bit segment in a cache block is compressed at a time • Fewer bits encode a frequently occurring pattern • Variable Length encoding • Each pattern => unique prefix + significant bits • Compress in single cycle • Parallel Compressor Circuit • Decompression takes five cycle latency • Serialized due to variable length encoding Alameldeen et al, ISCA 2005

Compressing Frequent Data Patterns CACHE BLOCK / NETWORK PACKET (512 bits) PATTERN MATCH ? COMPRESSED SEGMENT (3 -37 bits) PREFIX (3 bit) DATA(0-32bits) UNCOMPRESSED SEGMENT (32 bits)

Compressing Frequent Data Patterns CACHE BLOCK / NETWORK PACKET PREFIX UNIQUELY IDENTIFIES PATTERNS COMPRESSED BLOCK PARALLEL PATTERN MATCH ? FIVE STAGE DECOMPRESSION PIPELINE DATA SEGMENTS VARIABLE LENGTH --- CANNOT MAKE DECOMPRESSION PARALLEL NEED TO KNOW STARTING ADDRESS OFEACH COMPRESSED SEGMENT OPTIMIZED FIVE STAGE PIPELINE 1 cycle 5 cycle

Frequent Data Patterns • Frequent Value Locality • Zhang et al, ASPLOS 2000 , MICRO 2000

NICCompression(NC) Cache Compression (CC) CPU CPU L1 L1 NoC L2 L2 L2

Cache Compression (CC) vs. NIC Compression(NC) • NIC Compression • <+> Send Compressed Data over Network • <-> Overhead (1) Compressor (2) Decompressor • Cache compression • <+> Send Compressed Data over Network • Store the Compressed Data in L2 cache • <+> Increase Cache Capacity • <-> Overhead (1) Compressor (2) Decompressor (3) Variable Line Cache Architecture

Cache Compression (CC) L2 Stores compressed variable length cache blocks R R R R R R R R R CPU L1 R R R Compression/ Decompression R R R NIC R R R

Cache Compression (CC) CPU Node L2 Cache Bank Node • Compressor penalty for L1 write backs ( 1 cycle) • Network communicates compressed data • L2 Cache stores compressed variable length cache blocks • 2 cycle additional L2 hit latency to store/load • Decompression penalty for every L1 misses ( 5 cycle) CPU Variable length cache blocks L1 NIC Comp/ Decomp R Router R NIC Router Compressed Data

Variable Line Cache (CC Scheme) UNCOMPRESSED CACHE SET COMPRESSED CACHE SET LINE 0 LINE 1 LINE 2 LINE 3 LINE 4 LINE 5 LINE 6 FIXED (LINE SIZE 64 bytes) : ADDR LINE 2 = BASE + 0x80 VARIABLE : ADDR LINE 2 = BASE + LENGTH LINE 0 + LENGTH LINE 1 All Lines need to be contiguous -- or address calculation will not work Need to do compaction on eviction or fat writes

Variable Line Cache (CC Scheme) 8 way set Ξ 16x8 segments in set

Variable Line Cache (CC Scheme) Off Critical Path Overhead : 2 cycles to Hit Latency

NIC Compression (NC) R R R R R R R R R R R R R R R R R R R R R R R R

NIC Compression (NC) CPU Node Cache Bank Node • No Modifications to L2 cache, • stores uncompressed data • Network communicates compressed Data • Decompression penalty for each Data Network transfer ( 5 cycle) • Compressor for each Data Network transfer ( 1 cycle) L2 CPU Uncompressed Data L1 Comp/ Decomp in NIC Uncompressed Data Comp/ Decomp in NIC Compressed Data R R

NC Decompression Optimization TO L1/PROCESSOR CYCLES 0 1 2 3 4 5 6 7 8 9 FLITS DECOMPRESSION CYCLES TO L1/PROCESSOR CYCLES 0 1 2 3 4 5 6 FLITS PRECOMPUTATION STAGES (STAGES 1,2 and 3) STAGES 4 and 5 SAVINGS PER MSSG. (3 CYCLES) Overlap of Decompression & Communication Latency : An Example Time Line for a Five Flit Message (Reduce Decompression Latency to 2 cycles from 5 cycles)

Experimental Platform • Detailed trace-driven cycle-accurate hybrid NoC / Cache simulator for CMP architectures • MESI-based coherence protocol with distributed directories • Router models two-stage pipeline, wormhole flow control, finite input buffering and deterministic X-Y routing. • Interconnect power estimations from synthesis in Synopsys Design Compiler using a TSMC 90 nm standard cell library • Memory Traces generated from Simics Full system simulator

Workloads • Commercial • TPCW • SPECJBB • Static Web serving: Apache and Zeus • SPECOMP Benchmarks • SPLASH-2 • MediaBench-II

System Configuration

Compressed Packet Length Compression ratio for packet up to 60% !

Packet Latency Interconnect Latency reduction up to 33%

Network Latency Breakdown Interconnect Latency Breakdown

Network Latency Breakdown Major reductions in queuing latency and blocking latency Average reduction of 21%

Network Power Network Power Reduction up to 22%

Buffer Utilization Normalized Router Buffer Utilization

Memory Response Time Memory Response Time reduction up to 20%

System Performance Normalized CPI Reduction up to 15%

Scalability Study All scalability results for SPECJBB

Conclusions • Compression is a simple and useful technique to optimize performance and power of OCN’s. • CC (NC) scheme provides on an average 21% (20%) reduction in network latency with maximum savings of 33%(32%). • Network power consumption is minimized by an average of 7% (23% maximum) • Average 7% reduction in CPI compared to the generic case by reducing network latency.

Thank you! Questions ?

Backup

Basics : Router Architecture Input Port with Buffers VC Identifier Control Logic VC 0 From East Routing Unit VC 1 ( RC ) VC Allocator VC 2 ( VA ) Switch VC 0 From West Allocator ( SA ) VC 1 VC 2 To East VC 0 From North VC 1 To West VC 2 To North To South VC 0 From South To PE VC 1 VC 2 Crossbar ( 5 x 5 ) Crossbar VC 0 From PE VC 1 VC 2

L2 Miss Rates

Decompression Pipeline

Scalability with flit width All scalability results for SPECJBB

Scalability with System Size All scalability results for SPECJBB

Scalability with System Size 16p CMP larger network detrimental for compression. 16p CMP more processors higher injection/load on network, compression should help. 8p - 24 8p - 40 8p - 72 16p - 32 16p - 48 16p - 80 Network Size :

Scalability with System Size CC relative to NC Does better for more processors. Cache Compression helps more.

Performance and Power Optimization through Data Compression in Network-on-Chip Architectures

Performance and Power Optimization through Data Compression in Network-on-Chip Architectures

Presentation Transcript

Network-on-chip

On - Chip Communication Architectures

Network-on-Chip

A High-Voltage On-Chip Power Distribution Network

Network on Chip (NoC)

NETWORK ON CHIP ROUTER

Network On Chip Platform

Performance Evaluation of On-Chip Sensor Network in MPSoC

Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches

Performance and Power Optimization through Data Compression in Network-on-Chip Architectures

Data Structure Optimization for Power-Efficient IP Lookup Architectures

Network-on-Chip

Offline Compression for On-Chip RAM

On-Chip Interconnect Trend and Design Optimization

FlexiBuffer : Reducing Leakage Power in On-Chip Network Routers

Power, Temperature, Reliability and Performance - Aware Optimizations in On-Chip SRAMs

Improving Application Performance through Swap Compression

On-Chip Power Network Optimization with Decoupling Capacitors and Controlled-ESRs

QuT : A Low-Power Optical Network-on-chip

NETWORK ON CHIP ROUTER

Using Compression to Improve Chip Multiprocessor Performance

Offline Compression for On-Chip RAM