1 / 19

Design Exploration of an Instruction-Based Shared Markov Table on CMPs

Design Exploration of an Instruction-Based Shared Markov Table on CMPs. Design Exploration of an Instruction-Based Shared Markov Table on CMPs. Karthik Ramachandran & Lixin Su. Outline. Motivation Multiple cores on single chip Commercial workloads Our study

galen
Download Presentation

Design Exploration of an Instruction-Based Shared Markov Table on CMPs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design Exploration of an Instruction-Based Shared Markov Table on CMPs Design Exploration of an Instruction-Based Shared Markov Table on CMPs Karthik Ramachandran & Lixin Su

  2. Outline • Motivation • Multiple cores on single chip • Commercial workloads • Our study • Start from Instruction sharing pattern analysis • Our experiments • Move onto Instruction cache miss pattern analysis • Our experiments • Conclusions

  3. Motivation • Technology push: CMPs • Lower access latency to other processors • Application pull: Commercial workloads • OS behavior • Database applications • Opportunities for shared structures • Markov based sharing structure • Address large instruction footprint VS. small fast I caches

  4. Instruction Sharing Analysis • How instruction sharing may occur ? • OS: multiple processes, scheduling • DB: concurrent transactions, repeated queries, multiple threads • How can CMP’s benefit from instruction sharing ? • Snoop/grab instruction from other cores • Shared structures • Let’s investigate it.

  5. Methodology • Two-step approach • Experiment I • Targets Instruction trace analysis • How much sharing occurs ? • Experiment II • Targets I cache miss stream analysis • Examine the potential of a shared Markov structure

  6. Experiment I • Add instrumentation code to analyze committed instructions • Focus on repeated sequences of 2, 3, 4, and 5 instructions across 16P • Histogram-based approach How do we Count ? P1 : 3 times P2 : 1 time P3 : 0 times P4 : 2 times Total : 10 times P1 P2 P3 P4 {A,B} {A,B} {A,B} {A,B} {A,B} {A,B} {A,B} {A,B} {A,B} {A,B}

  7. Results - Experiment I Q.) Is there any Instruction sharing ? A.) Maybe, observe the number of times the sequences 2-5 repeat (~13000 -17000) Q.) But why does the numbers for a sequence pattern of 5 Instructions not differ much from a sequence pattern of 2 Instructions ? A.) Spin Loops!! For non warm-up case : 50% For warm-up case : 30%

  8. Experiment II • Focus on instruction cache misses • Is there sharing involved here too? • Upper bound performance benefit of a shared Markov table? • Experiment setup • 16K-entry fully associative shared Markov table • Each entry has two consecutive misses from same processor • Atomic lookup and hit/miss counter update when a processor has two consecutive I $ misses. • On a miss, Insert a new entry to LRU head • On a hit, Record distance from the LRU head and move the hit entry to LRU head

  9. Design Block Diagram • Small fast shared Markov table • Prefetch when I$ miss occurs P P I$ I$ Markov Table L2 $

  10. Table Lookup Hit Ratio Q1.) Is there a lot of miss sharing? Q2.) Does constructive interference pattern exist to help a CMP? Q3.) Do equal opportunities exist for all the P?

  11. Let’s Answer the Questions? A1.) Yes Of course A2.) Definitely a constructive interference pattern exists as you see from the figure A3.) Yes. Hit/miss ratio remains pretty stable across processor in spite of variance in the number of I cache misses.

  12. How Big Should the Table Be ? • About 60% of hits are within 4K entries away from LRU head. • A shared Markov table can fairly utilize I cache miss sharing. • What about snooping and grabbing instructions from other I caches?

  13. Real Design Issues • Associativity and size of the table • Choose the right path if multiple paths exist • Separate address directory from data entries for the table and have multiple address directories • What if a sequential prefetcher exists?

  14. Conclusions • Instruction sharing on CMPs exists. Spin loops occur frequently with current workloads. • Markov-based structure for storing I cache misses may be helpful on CMPs.

  15. Questions?

  16. Comparison with Real Markov Prefetching Cnt 5 LRU head 2 3 LRU Tail P • Miss to A and prefetch along A, B & C P • Misses to A & C and then look up in the table • Update hit/miss counters and change/record LRU

  17. Lookup Example I P LRU head LRU head Look up LRU head LRU Tail

  18. Lookup Example II P LRU head LRU head Look up LRU head LRU Tail

More Related