Using partial tag comparison in low power snoop based chip multiprocessors
Download
1 / 21

Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors - PowerPoint PPT Presentation


  • 92 Views
  • Uploaded on

Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors. Ali Shafiee Narges Shahidi Amirali Baniasadi Sharif University of Technology University of Victoria. This Work: Improving Snoop Coherency. Goal: Improving energy efficiency in snoop-based CMPs.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors' - aleta


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Using partial tag comparison in low power snoop based chip multiprocessors

Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors

Ali Shafiee Narges Shahidi Amirali Baniasadi

Sharif University of Technology

University of Victoria


This work improving snoop coherency
This Work: Improving Snoop Coherency

Goal: Improving energy efficiency in snoop-based CMPs.

Motivation: Broadcasting/processing entire tag is inefficient.

Our Solution: Using Partial Tag Comparison (PTC) prior to snoop.

Key Results

Performance (2.9%)

Tag array power (52%)

Bandwidth utilization (78.5%)


Our solution ptc vs conventional
Our Solution (PTC) vs. Conventional

Conventional

Our solution

D$

D$

D$

….

D$

D$

….

D$

Interconnect

Interconnect

Upper Level Cache

Upper Level Cache

Fast ++ (early miss detection)

Power & Bandwidth Efficient +

Fast +

Power & Bandwidth −


Conventional snooping
Conventional Snooping

CPU

CPU

3

D$

D$

4

1

Redundant (miss): ~70%

Address Bus

2

Snoop Bus

controller

Command Bus

5

4

4

D$

D$

3

3

CPU

CPU


Snoop filters
Snoop Filters

Goal: Eliminate redundant snoop requests.

Example: RegionScout (ISCA’05), CGCT(ISCA’05), SSP (ASPLOS’08)

PTC:

(1) Early miss detection using subset of tag bits.

(2) Once a miss is detected, snoop is avoided.

How often is that possible?


How often using n bits is enough to detect a miss
How often using n bits is enough to detect a miss?

95+% of misses can be detected using 8 bits.


Ptc filter
PTC-Filter

D$

PTC-Filter

LSB

LSB

LSB

hit

miss

Avoid Snoop

Access Upper Level

Snoop Potential

Targets

Address Bus


Ptc filter1
PTC-Filter

1

2

0

3

4-way D$

4-way D$

4-way D$

4-way D$

PTC-Filter

Filter

Filter

Filter

LSB

D

V

8 bits

Core1’s LSB

Core2’s LSB

Core3’s LSB


Ptc filter miss
PTC: Filter Miss

CPU

CPU

D$

D$

1

2

Address Bus

3

Snoop Bus

controller

Command Bus

D$

D$

CPU

CPU


Ptc filter hit
PTC: Filter Hit

CPU

CPU

D$

4

D$

1

5

2

Address Bus

3

Snoop Bus

controller

Command Bus

6

D$

D$

CPU

CPU


Filter maintenance
Filter Maintenance

Core 0

Core i

CPU

Snoop Controller

Request =A

1

Pending Request Table

…..

…..

6

PTC- Filter

2

4

A

0

1

1

miss A. place it in

position of tag F

6

5

Place A, insert in

Way 1 of core 0

Command Bus

3

Address Bus

{Address=A, C=0,W=1, D=1}


Methodology
Methodology

  • SESC simulator 4-way CMP

  • SPLASH-2 benchmarks

  • CACTI 6.0


Performance
Performance

Average: 2.9%


Bandwidth
Bandwidth

Average: 78.5%


Tag power
Tag Power

Average: 52%


Discussion
Discussion

  • Why do benchmarks show different performance improvement?

    • Different cache miss frequency

    • Different early miss detection frequency

    • Not all cache misses are on the critical path

  • Filter overhead:

    • Timing: 1 cycle

    • Power: 78.5% of single tag array access


Summary
Summary

  • PTC:

    • Using subset of tag bits to improve bandwidth/power efficiency.

  • Results:

    • Performance: 2.9%

    • Tag Power: 52%

    • Bandwidth: 78.5%


  • Global vs local miss
    Global vs. Local Miss

    Have

    B?

    Have

    B?

    NO

    NO

    NO

    YES

    NO

    • local miss detection  better power/bandwidth profile

    • Remote miss detection (source-based approach) vs. (destination-based filter)

    D$

    D$

    D$

    D$

    D$

    D$

    D$

    ….

    ….

    Interconnect

    interconnect

    Upper Level Cache

    Upper Level Cache

    Global Miss

    Local Miss




    ad