performance and power optimization through data compression in network on chip architectures l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Performance and Power Optimization through Data Compression in Network-on-Chip Architectures PowerPoint Presentation
Download Presentation
Performance and Power Optimization through Data Compression in Network-on-Chip Architectures

Loading in 2 Seconds...

play fullscreen
1 / 48

Performance and Power Optimization through Data Compression in Network-on-Chip Architectures - PowerPoint PPT Presentation


  • 205 Views
  • Uploaded on

Performance and Power Optimization through Data Compression in Network-on-Chip Architectures. Reetuparna Das , Asit K Mishra, Chrysostomos Nicopoulos, Dongkook Park, Vijaykrishnan Narayanan , Ravi Iyer * , Mazin S. Yousif * , Chita R. Das. *. Why On-Chip Networks ?.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Performance and Power Optimization through Data Compression in Network-on-Chip Architectures' - abner


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
performance and power optimization through data compression in network on chip architectures

Performance and Power Optimization through Data Compression inNetwork-on-Chip Architectures

Reetuparna Das, Asit K Mishra, Chrysostomos Nicopoulos, Dongkook Park,

Vijaykrishnan Narayanan, Ravi Iyer*, Mazin S. Yousif *, Chita R. Das

*

why on chip networks
Why On-Chip Networks ?

Relative Delays based on the 2005 ITRS

  • Global interconnect delays not scaling …
  • No longer single cycle global wire traversal!
  • March to Multicores …
  • We need controlled structured low-power communication fabric

250

nm

Gate Delay

Global Wiring

32

nm

Global Wiring Delay Dominates

what is network on chip
What is Network-on-Chip ?

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

is network latency critical
Is Network Latency Critical ?
  • in-net : on-chip interconnect
  • off-chip : main memory access
  • other : cache access and queuing

Network Intensive ?

Memory Intensive ?

Balanced ?

Average Memory Response Time Break-down

Up to 50 % of Memory Latency can come from Network !

nuca specific network on chip design challenges
NUCA specific Network-on-Chip Design Challenges…
  • NoC have high bandwidth demands.
  • NoC specific to NUCA are latency critical
    • Latency directly effects memory access time
  • NoC consumes significant system power
  • NoC design is area constrained
    • Need to Reduce buffering requirements.
  • Goal is to use compression to minimize latency, power, bandwidth and buffer requirements
compression
Compression ?
  • Is compression a viable solution for NoC ?
    • Application profiling shows extensive value locality in NUCA Data Traffic (e.g. > 50% zero patterns!)
  • Associated Overheads
    • Compression / Decompression Latency
    • Increased cache access latency to load/store variable length cache blocks
outline
Outline
  • Network-On-Chip Background
  • Compression
    • Algorithm
    • Cache Compression (CC) Scheme
    • NIC Compression (NC) Scheme
  • Results
    • Network Results
    • CPI/Memory Response Time
    • Scalability
  • Conclusion

*

compressing frequent data patterns
Compressing Frequent Data Patterns
  • One 32 bit segment in a cache block is compressed at a time
    • Fewer bits encode a frequently occurring pattern
  • Variable Length encoding
    • Each pattern => unique prefix + significant bits
  • Compress in single cycle
    • Parallel Compressor Circuit
  • Decompression takes five cycle latency
    • Serialized due to variable length encoding

Alameldeen et al, ISCA 2005

compressing frequent data patterns9
Compressing Frequent Data Patterns

CACHE BLOCK /

NETWORK PACKET (512 bits)

PATTERN MATCH ?

COMPRESSED SEGMENT (3 -37 bits)

PREFIX (3 bit)

DATA(0-32bits)

UNCOMPRESSED SEGMENT (32 bits)

compressing frequent data patterns10
Compressing Frequent Data Patterns

CACHE BLOCK /

NETWORK PACKET

PREFIX UNIQUELY IDENTIFIES PATTERNS

COMPRESSED BLOCK

PARALLEL

PATTERN MATCH ?

FIVE STAGE

DECOMPRESSION PIPELINE

DATA SEGMENTS

VARIABLE LENGTH ---

CANNOT MAKE DECOMPRESSION PARALLEL

NEED TO KNOW STARTING ADDRESS OFEACH COMPRESSED SEGMENT

OPTIMIZED FIVE STAGE PIPELINE

1 cycle

5 cycle

frequent data patterns
Frequent Data Patterns
  • Frequent Value Locality
  • Zhang et al, ASPLOS 2000 , MICRO 2000
frequent data patterns12
Frequent Data Patterns
  • Frequent Value Locality
  • Zhang et al, ASPLOS 2000 , MICRO 2000
outline13
Outline
  • Network-On-Chip Background
  • Compression
    • Algorithm
    • Cache Compression (CC) Scheme
    • NIC Compression (NC) Scheme
  • Results
    • Network Results
    • CPI/Memory Response Time
    • Scalability
  • Conclusion

*

nic compression nc
NICCompression(NC)

Cache

Compression (CC)

CPU

CPU

L1

L1

NoC

L2

L2

L2

cache compression cc vs nic compression nc
Cache Compression (CC) vs. NIC Compression(NC)
  • NIC Compression
    • <+> Send Compressed Data over Network
    • <-> Overhead (1) Compressor (2) Decompressor
  • Cache compression
    • <+> Send Compressed Data over Network
    • Store the Compressed Data in L2 cache
    • <+> Increase Cache Capacity
    • <-> Overhead (1) Compressor (2) Decompressor

(3) Variable Line Cache Architecture

cache compression cc
Cache Compression (CC)

L2 Stores compressed

variable length cache blocks

R

R

R

R

R

R

R

R

R

CPU

L1

R

R

R

Compression/

Decompression

R

R

R

NIC

R

R

R

cache compression cc17
Cache Compression (CC)

CPU Node

L2 Cache Bank Node

  • Compressor penalty for L1 write backs ( 1 cycle)
  • Network communicates compressed data
  • L2 Cache stores compressed variable length cache blocks
    • 2 cycle additional L2 hit latency to store/load
  • Decompression penalty for every L1 misses ( 5 cycle)

CPU

Variable length cache blocks

L1

NIC

Comp/

Decomp

R

Router

R

NIC

Router

Compressed Data

variable line cache cc scheme
Variable Line Cache (CC Scheme)

UNCOMPRESSED CACHE SET

COMPRESSED CACHE SET

LINE 0

LINE 1

LINE 2

LINE 3

LINE 4

LINE 5

LINE 6

FIXED (LINE SIZE 64 bytes) :

ADDR LINE 2 = BASE + 0x80

VARIABLE :

ADDR LINE 2 = BASE + LENGTH LINE 0 + LENGTH LINE 1

All Lines need to be contiguous -- or address calculation will not work

Need to do compaction on eviction or fat writes

variable line cache cc scheme19
Variable Line Cache (CC Scheme)

8 way set Ξ 16x8 segments in set

variable line cache cc scheme20
Variable Line Cache (CC Scheme)

Off Critical Path

Overhead : 2 cycles to

Hit Latency

outline21
Outline
  • Network-On-Chip Background
  • Compression
    • Algorithm
    • Cache Compression (CC) Scheme
    • NIC Compression (NC) Scheme
  • Results
    • Network Results
    • CPI/Memory Response Time
    • Scalability
  • Conclusion

*

nic compression nc22
NIC Compression (NC)

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

nic compression nc23
NIC Compression (NC)

CPU Node

Cache Bank Node

  • No Modifications to L2 cache,
  • stores uncompressed data
  • Network communicates compressed Data
  • Decompression penalty for each Data Network transfer ( 5 cycle)
  • Compressor for each Data Network transfer ( 1 cycle)

L2

CPU

Uncompressed

Data

L1

Comp/

Decomp

in NIC

Uncompressed

Data

Comp/

Decomp

in NIC

Compressed Data

R

R

nc decompression optimization
NC Decompression Optimization

TO L1/PROCESSOR

CYCLES

0

1

2

3

4

5

6

7

8

9

FLITS

DECOMPRESSION CYCLES

TO L1/PROCESSOR

CYCLES

0

1

2

3

4

5

6

FLITS

PRECOMPUTATION STAGES

(STAGES 1,2 and 3)

STAGES 4 and 5

SAVINGS PER MSSG.

(3 CYCLES)

Overlap of Decompression & Communication Latency : An Example Time Line

for a Five Flit Message (Reduce Decompression Latency to 2 cycles from 5 cycles)

outline25
Outline
  • Network-On-Chip Background
  • Compression
    • Algorithm
    • Cache Compression (CC) Scheme
    • NIC Compression (NC) Scheme
  • Results
    • Network Results
    • CPI/Memory Response Time
    • Scalability
  • Conclusion

*

experimental platform
Experimental Platform
  • Detailed trace-driven cycle-accurate hybrid NoC / Cache simulator for CMP architectures
  • MESI-based coherence protocol with distributed directories
  • Router models two-stage pipeline, wormhole flow control, finite input buffering and deterministic X-Y routing.
  • Interconnect power estimations from synthesis in Synopsys Design Compiler using a TSMC 90 nm standard cell library
  • Memory Traces generated from Simics Full system simulator
workloads
Workloads
  • Commercial
    • TPCW
    • SPECJBB
    • Static Web serving: Apache and Zeus
  • SPECOMP Benchmarks
  • SPLASH-2
  • MediaBench-II
compressed packet length
Compressed Packet Length

Compression ratio for packet up to 60% !

packet latency
Packet Latency

Interconnect Latency reduction up to 33%

network latency breakdown
Network Latency Breakdown

Interconnect Latency Breakdown

network latency breakdown32
Network Latency Breakdown

Interconnect Latency Breakdown

network latency breakdown33
Network Latency Breakdown

Major reductions in queuing latency and blocking latency

Average reduction of 21%

network power
Network Power

Network Power Reduction up to 22%

buffer utilization
Buffer Utilization

Normalized Router Buffer Utilization

memory response time
Memory Response Time

Memory Response Time reduction up to 20%

system performance
System Performance

Normalized CPI Reduction up to 15%

scalability study
Scalability Study

All scalability results for SPECJBB

conclusions
Conclusions
  • Compression is a simple and useful technique to optimize performance and power of OCN’s.
  • CC (NC) scheme provides on an average 21% (20%) reduction in network latency with maximum savings of 33%(32%).
  • Network power consumption is minimized by an average of 7% (23% maximum)
  • Average 7% reduction in CPI compared to the generic case by reducing network latency.
thank you

Thank you!

Questions ?

basics router architecture
Basics : Router Architecture

Input Port with Buffers

VC Identifier

Control Logic

VC

0

From East

Routing Unit

VC

1

(

RC

)

VC Allocator

VC

2

(

VA

)

Switch

VC

0

From West

Allocator

(

SA

)

VC

1

VC

2

To East

VC

0

From North

VC

1

To West

VC

2

To North

To South

VC

0

From South

To PE

VC

1

VC

2

Crossbar

(

5

x

5

)

Crossbar

VC

0

From PE

VC

1

VC

2

scalability with flit width
Scalability with flit width

All scalability results for SPECJBB

scalability with system size
Scalability with System Size

All scalability results for SPECJBB

scalability with system size47
Scalability with System Size

16p CMP larger network detrimental for compression.

16p CMP more processors higher injection/load on network, compression should help.

8p - 24 8p - 40 8p - 72

16p - 32 16p - 48 16p - 80

Network Size :

scalability with system size48
Scalability with System Size

CC relative to NC

Does better for

more processors.

Cache Compression

helps more.