1 / 20

Instruction prefetching in SMT(Simultaneous Multithreading) system and impact on the performance

Instruction prefetching in SMT(Simultaneous Multithreading) system and impact on the performance. by Choi, Jun-Shik Park, Joo Hyung. Contents. Purpose Background Theory Simulation Results Conclusion. 1. Purpose. For speed up

nevina
Download Presentation

Instruction prefetching in SMT(Simultaneous Multithreading) system and impact on the performance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Instruction prefetching in SMT(Simultaneous Multithreading) system and impact on the performance by Choi, Jun-Shik Park, Joo Hyung

  2. Contents • Purpose • Background • Theory • Simulation • Results • Conclusion

  3. 1. Purpose • For speed up • To take advantage of ILP(Instruction level parallelism) and TLP(Thread level parallelism), SMT has considered. • To reduce cache miss penalty and to use memory BW efficiently, prefetch has used.

  4. 2. Background • Traditional Processor • Out-of-order Execution • Cache Prefetching

  5. Traditional Processor • A traditional processor would stall during memory latency from the time data miss happen to the time data arrival. Memory Latency Stall time L1 Miss Data Arrival

  6. Out-of-order Execution • Because data and control dependencies must be observed, the processor will still stall at some point if memory latency is long. Memory Latency Stall time L1 Miss Data Arrival Independent Instr. Dependent Instr.

  7. Cache Prefetching • Cache prefetching overcomes this restriction by bringing data to the L1 cache or an on-chip buffer to avoid as much as possible of the cache miss penalty. Memory Latency time Prefetch Data Arrival Dependent Instr. L1 Miss

  8. 3. Theory • Simultaneous Multithreading(SMT) • Plenty of resources • Instruction level parallelism • Thread level parallelism • Markov prefetcher

  9. Prefetch Methods • Stride Prefetcher • Memory reference separated by constant stride • Recursive Prefetcher • Designed for linked data structure as the pattern • Markov Prefetcher • Based on miss address • Etc…

  10. Basic Markov <Example> 1,2,3,4,3,5,1,3,6,6,5,1,1,2,3,4,5,1,2,3,4,3 1 2 (20%) (100%) 1 2 3 (60%) (100%) 3 6 (20%) (100%) 1 2 history

  11. The Address Sequence in Prefetch • Miss address (IL1-cache miss) stream as a prediction source • Too wide bandwidth for CPU demand • L1 cache could make the miss address sequence less frequently

  12. Problem in Realizing Pure Markov Prediction • Programs reference millions of addresses and it is impossible to record all references in a single table

  13. State (1-history) 1 prediction 2 prediction 3 prediction 4 prediction 1 2 1 3 - 2 3 - - - 3 4 5 - - 4 3 5 - - 5 1 - - - 6 5 6 - - Prefetch Table

  14. Address request Prefetch Buffer L1 Cache Prefetcher Miss address Prefetch Table L2 Cache Memory Prefetch Diagram

  15. CPU Address Request (L1 Miss) N(not matched) Look up table Prefetch Table Y(matched) Store Prefetch Data on L2 Examining cache look up - Data Transfer from L2 to CPU - Update or Insert Information to Prefetch Buffer Prefetch Algorithm

  16. 4. Simulation • Modified Code: ss_smt-1.0 • Specification • Thread: 2 • Cache: L1(64KB), L2 • Number of Instructions: 100 millions • Used Benchmark • MCF(Integer) and ART(Floating point) • GCC(Integer) and MESA(Floating point)

  17. 5. Result • Testbenches for the 2 threads • MCF and ART • L1 miss rate = 0.0794, 0.0921 • Number of L1 miss = Number of access to PFB: 23, 7 • GCC and MESA • L1 miss rate = 0.0010, 0.0009 • Number of L1 miss = Number of access to PFB: 15, 13

  18. Benchmark Reference 1 • The following benchmarks grow quickly to their target sizes (expressed in megabytes) and then stay there -----> - ART max max num num rsz vsz obs unchanged stable? art 3.7 4.3 157 37 x - MCF

  19. Benchmark Reference 2 • Change size over time - GCC - MESA max max num num rsz vsz obs unchanged stable? mesa 9.4 23.1 132 131 stable

  20. 6. Conclusion • A prefetcher using Markov algorithm has simulated. • To make Markov Prefetcher efficient in the system, it should have enough training time and L1 misses, because the prefetcher is operated on the basis of the L1 miss address sequence history. • Disadvantage of Markov prefetcher • High hardware cost, not a good stand-alone prefetcher

More Related