design exploration of an instruction based shared markov table on cmps n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Design Exploration of an Instruction-Based Shared Markov Table on CMPs PowerPoint Presentation
Download Presentation
Design Exploration of an Instruction-Based Shared Markov Table on CMPs

Loading in 2 Seconds...

play fullscreen
1 / 19

Design Exploration of an Instruction-Based Shared Markov Table on CMPs - PowerPoint PPT Presentation


  • 116 Views
  • Uploaded on

Design Exploration of an Instruction-Based Shared Markov Table on CMPs. Design Exploration of an Instruction-Based Shared Markov Table on CMPs. Karthik Ramachandran & Lixin Su. Outline. Motivation Multiple cores on single chip Commercial workloads Our study

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Design Exploration of an Instruction-Based Shared Markov Table on CMPs' - lottie


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
outline
Outline
  • Motivation
    • Multiple cores on single chip
    • Commercial workloads
  • Our study
    • Start from Instruction sharing pattern analysis
      • Our experiments
    • Move onto Instruction cache miss pattern analysis
      • Our experiments
  • Conclusions
motivation
Motivation
  • Technology push: CMPs
    • Lower access latency to other processors
  • Application pull: Commercial workloads
    • OS behavior
    • Database applications
  • Opportunities for shared structures
    • Markov based sharing structure
    • Address large instruction footprint VS. small fast I caches
instruction sharing analysis
Instruction Sharing Analysis
  • How instruction sharing may occur ?
    • OS: multiple processes, scheduling
    • DB: concurrent transactions, repeated queries, multiple threads
  • How can CMP’s benefit from instruction sharing ?
    • Snoop/grab instruction from other cores
    • Shared structures
  • Let’s investigate it.
methodology
Methodology
  • Two-step approach
    • Experiment I
      • Targets Instruction trace analysis
      • How much sharing occurs ?
    • Experiment II
      • Targets I cache miss stream analysis
      • Examine the potential of a shared Markov structure
experiment i
Experiment I
  • Add instrumentation code to analyze committed instructions
  • Focus on repeated sequences of 2, 3, 4, and 5 instructions across 16P
  • Histogram-based approach

How do we Count ?

P1 : 3 times

P2 : 1 time

P3 : 0 times

P4 : 2 times

Total : 10 times

P1

P2

P3

P4

{A,B} {A,B} {A,B} {A,B}

{A,B} {A,B} {A,B}

{A,B} {A,B}

{A,B}

results experiment i
Results - Experiment I

Q.) Is there any Instruction sharing ?

A.) Maybe, observe the number of times the sequences 2-5 repeat (~13000 -17000)

Q.) But why does the numbers for a sequence pattern of 5 Instructions not differ much from a sequence pattern of 2 Instructions ?

A.) Spin Loops!!

For non warm-up case : 50%

For warm-up case : 30%

experiment ii
Experiment II
  • Focus on instruction cache misses
    • Is there sharing involved here too?
    • Upper bound performance benefit of a shared Markov table?
  • Experiment setup
    • 16K-entry fully associative shared Markov table
    • Each entry has two consecutive misses from same processor
    • Atomic lookup and hit/miss counter update when a processor has two consecutive I $ misses.
    • On a miss, Insert a new entry to LRU head
    • On a hit, Record distance from the LRU head and move the hit entry to LRU head
design block diagram
Design Block Diagram
  • Small fast shared Markov table
  • Prefetch when I$ miss occurs

P

P

I$

I$

Markov Table

L2 $

table lookup hit ratio
Table Lookup Hit Ratio

Q1.) Is there a lot of miss sharing?

Q2.) Does constructive interference pattern exist to help a CMP?

Q3.) Do equal opportunities exist for all the P?

let s answer the questions
Let’s Answer the Questions?

A1.) Yes Of course

A2.) Definitely a constructive interference pattern exists as you see from the figure

A3.) Yes. Hit/miss ratio remains pretty stable across processor in spite of variance in the number of I cache misses.

how big should the table be
How Big Should the Table Be ?
  • About 60% of hits are within 4K entries away from LRU head.
  • A shared Markov table can fairly utilize I cache miss sharing.
  • What about snooping and grabbing instructions from other I caches?
real design issues
Real Design Issues
  • Associativity and size of the table
  • Choose the right path if multiple paths exist
  • Separate address directory from data entries for the table and have multiple address directories
  • What if a sequential prefetcher exists?
conclusions
Conclusions
  • Instruction sharing on CMPs exists. Spin loops occur frequently with current workloads.
  • Markov-based structure for storing I cache misses may be helpful on CMPs.
comparison with real markov prefetching
Comparison with Real Markov Prefetching

Cnt

5

LRU head

2

3

LRU Tail

P

  • Miss to A and prefetch along A, B & C

P

  • Misses to A & C and then look up in the table
  • Update hit/miss counters and change/record LRU
lookup example i
Lookup Example I

P

LRU head

LRU head

Look up

LRU head

LRU Tail

lookup example ii
Lookup Example II

P

LRU head

LRU head

Look up

LRU head

LRU Tail