using partial tag comparison in low power snoop based chip multiprocessors
Download
Skip this Video
Download Presentation
Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors

Loading in 2 Seconds...

play fullscreen
1 / 21

Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors - PowerPoint PPT Presentation


  • 92 Views
  • Uploaded on

Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors. Ali Shafiee Narges Shahidi Amirali Baniasadi Sharif University of Technology University of Victoria. This Work: Improving Snoop Coherency. Goal: Improving energy efficiency in snoop-based CMPs.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors' - aleta


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
using partial tag comparison in low power snoop based chip multiprocessors

Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors

Ali Shafiee Narges Shahidi Amirali Baniasadi

Sharif University of Technology

University of Victoria

this work improving snoop coherency
This Work: Improving Snoop Coherency

Goal: Improving energy efficiency in snoop-based CMPs.

Motivation: Broadcasting/processing entire tag is inefficient.

Our Solution: Using Partial Tag Comparison (PTC) prior to snoop.

Key Results

Performance (2.9%)

Tag array power (52%)

Bandwidth utilization (78.5%)

our solution ptc vs conventional
Our Solution (PTC) vs. Conventional

Conventional

Our solution

D$

D$

D$

….

D$

D$

….

D$

Interconnect

Interconnect

Upper Level Cache

Upper Level Cache

Fast ++ (early miss detection)

Power & Bandwidth Efficient +

Fast +

Power & Bandwidth −

conventional snooping
Conventional Snooping

CPU

CPU

3

D$

D$

4

1

Redundant (miss): ~70%

Address Bus

2

Snoop Bus

controller

Command Bus

5

4

4

D$

D$

3

3

CPU

CPU

snoop filters
Snoop Filters

Goal: Eliminate redundant snoop requests.

Example: RegionScout (ISCA’05), CGCT(ISCA’05), SSP (ASPLOS’08)

PTC:

(1) Early miss detection using subset of tag bits.

(2) Once a miss is detected, snoop is avoided.

How often is that possible?

how often using n bits is enough to detect a miss
How often using n bits is enough to detect a miss?

95+% of misses can be detected using 8 bits.

ptc filter
PTC-Filter

D$

PTC-Filter

LSB

LSB

LSB

hit

miss

Avoid Snoop

Access Upper Level

Snoop Potential

Targets

Address Bus

ptc filter1
PTC-Filter

1

2

0

3

4-way D$

4-way D$

4-way D$

4-way D$

PTC-Filter

Filter

Filter

Filter

LSB

D

V

8 bits

Core1’s LSB

Core2’s LSB

Core3’s LSB

ptc filter miss
PTC: Filter Miss

CPU

CPU

D$

D$

1

2

Address Bus

3

Snoop Bus

controller

Command Bus

D$

D$

CPU

CPU

ptc filter hit
PTC: Filter Hit

CPU

CPU

D$

4

D$

1

5

2

Address Bus

3

Snoop Bus

controller

Command Bus

6

D$

D$

CPU

CPU

filter maintenance
Filter Maintenance

Core 0

Core i

CPU

Snoop Controller

Request =A

1

Pending Request Table

…..

…..

6

PTC- Filter

2

4

A

0

1

1

miss A. place it in

position of tag F

6

5

Place A, insert in

Way 1 of core 0

Command Bus

3

Address Bus

{Address=A, C=0,W=1, D=1}

methodology
Methodology
  • SESC simulator 4-way CMP
  • SPLASH-2 benchmarks
  • CACTI 6.0
performance
Performance

Average: 2.9%

bandwidth
Bandwidth

Average: 78.5%

tag power
Tag Power

Average: 52%

discussion
Discussion
  • Why do benchmarks show different performance improvement?
    • Different cache miss frequency
    • Different early miss detection frequency
    • Not all cache misses are on the critical path
  • Filter overhead:
    • Timing: 1 cycle
    • Power: 78.5% of single tag array access
summary
Summary
  • PTC:
      • Using subset of tag bits to improve bandwidth/power efficiency.
  • Results:
      • Performance: 2.9%
      • Tag Power: 52%
      • Bandwidth: 78.5%
global vs local miss
Global vs. Local Miss

Have

B?

Have

B?

NO

NO

NO

YES

NO

  • local miss detection  better power/bandwidth profile
  • Remote miss detection (source-based approach) vs. (destination-based filter)

D$

D$

D$

D$

D$

D$

D$

….

….

Interconnect

interconnect

Upper Level Cache

Upper Level Cache

Global Miss

Local Miss

ad