1 / 33

Prefetching Using a Global History Buffer

Prefetching Using a Global History Buffer. Kyle J. Nesbit and James E. Smith. Outline. Motivation Related Work Global History Buffer Prefetching Results Conclusion. Motivation. D-Cache misses to main memory are of increasing importance

hernando
Download Presentation

Prefetching Using a Global History Buffer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith

  2. Outline • Motivation • Related Work • Global History Buffer Prefetching • Results • Conclusion

  3. Motivation • D-Cache misses to main memory are of increasing importance • Main memory is getting farther away (in clock cycles) • Many demanding, memory intensive workloads • Computation is inexpensive compared to data accesses • Good opportunity to reevaluate prefetching data structures • Simple computation can supplement table information • We consider prefetches from main memory to lowest level cache (L2 cache in this study)

  4. Markov Prefetching • Markov prefetching forms address correlations • Joseph and Grunwald (ISCA ‘97) • Uses global memory addresses as states in the Markov graph • Correlation Table approximates Markov graph Miss Address Stream A B C A B C B C . . . Markov Graph Correlation Table 1 1st predict. 2nd predict. miss address A B B A C B .5 1 A B C .5 C

  5. Correlation Prefetching • Distance Prefetching forms delta correlations • Kandiraju and Sivasubramaniam (ISCA ‘02) • Delta-based prefetching leads to much smaller table than “classical” Markov Prefetching • Delta-based prefetching can remove compulsory misses Markov Prefetching Distance Prefetching Miss Address Stream Global Delta Stream 1 1 -2 1 1 -1 1 27 28 29 27 28 29 28 29 1st predict. 2nd predict. 1st predict. 2nd predict. miss address global delta 27 1 28 -2 28 1 29 -1 29 1 -1 -2 28 29

  6. Global History Buffer (GHB) • Holds miss address history in FIFO order • Linked lists within GHB connect related addresses • Same static load • Same global miss address • Same global delta Global History Buffer Index Table Load PC • Linked list walk is short compared with L2 miss latency FO FI miss addresses

  7. GHB - Example Miss Address Stream 29 27 28 29 27 28 29 28 Index Table Global History Buffer pointer miss address pointer 27 27 Global Miss Address 28 28 29 29 27 28 head pointer 29 28 29 Key => Current => Prefetches

  8. 1 4 8 1 4 8 8 4 GHB – Deltas Miss Address Stream 71 27 28 36 44 45 49 53 54 62 70 Global Delta Stream 1 1 8 8 1 4 4 1 8 8 Markov Graph Hybrid Width Depth .3 .7 .7 .7 .3 .3 Prefetches 71 + 8 => 79 Key Prefetches Prefetches 71 + 4 => 75 => Current 71 + 8 => 79 71 + 4 => 75 => Prefetches 79 + 8 => 87 79 + 4 => 79

  9. GHB – Hybrid Delta • Width prefetching suffers from poor accuracy and short look-ahead • Depth prefetching has good look-ahead, but may miss prefetch opportunities when a number of “next” addresses have similar probability • The hybrid method combines depth and width

  10. 1 8 8 8 4 8 4 GHB - Hybrid Example Miss Address Stream 71 27 28 36 44 45 49 53 54 62 70 Global Delta Stream 1 1 8 8 1 4 4 1 8 8 Index Table Global History Buffer Global Delta pointer miss address pointer 1 27 28 4 36 8 Prefetches 44 71 + 8 => 79 71 + 4 => 75 45 49 79 + 8 => 87 79 + 4 => 79 head pointer 53 54 62 70 Key 71 => Current => Prefetches

  11. Simulation Methodology • Simulated SPEC CPU2000 benchmarks • Fast forwarded 1 billion instructions and simulated 1 billion instructions • Used peak binaries compiled -O4 optimization • Results include all benchmarks that have at least a 5% IPC improvement with an ideal L2 cache

  12. Simulation Methodology • Table walk - one cycle per access • IT size reduces table conflicts • GHB size reflects prefetch history working set • In general, the GHB prefetching requires less history

  13. Results • Our results compare: • IPC Improvement (harmonic mean) vs. Prefetch Degree • Increase in Memory Traffic per instruction (arithmetic mean) vs. Prefetch Degree • Prefetch Accuracy – The percent of prefetches that are used by the program

  14. Distance Prefetching (Performance) 35% Table (width) GHB (width) GHB (depth) GHB (hybrid) 25% IPC Improvement 15% 5% 1 2 4 8 16 Prefetch Degree

  15. Distance Prefetching (Performance) 110% Table (width) (~300%) GHB (width) 90% GHB (depth) GHB (hybrid) 70% 50% IPC Improvement 30% 10% -10% art vpr gap mcf apsi twolf applu swim bzip2 lucas mgrid galgel parser ammp hmean wupwise

  16. Distance Prefetching (Memory Traffic) 180% Table (width) GHB (width) GHB (depth) 150% GHB (hybrid) 120% 90% Increase in Memory Traffic 60% 30% 0% 1 2 4 8 16 Prefetch Degree

  17. Distance Prefetching (Memory Traffic) 180% Table (width) GHB (width) GHB (depth) 150% GHB (hybrid) 120% 90% Increase in Memory Traffic 60% 30% 0% 1 2 4 8 16 Prefetch Degree

  18. Conclusions • More complete picture of history • Allows width, depth, and hybrid • Also can improve other prefetching methods (covered in depth in the paper) • Eliminates stale history in a natural way • FIFO discards old history to make room for new history • In a conventional table, old history can remain for a very long time and trigger inaccurate prefetches

  19. Acknowledgements • This research was funded by: • An Intel Undergraduate Research scholarship. • A University of Wisconsin Hilldale Undergraduate Research fellowship. • The National Science Foundation under grants CCR-0311361 and EIA-0071924.

  20. Backup Slides

  21. Prefetching Metrics • Accuracy is the percent of prefetches that are actually used. • Coverage is the percent of memory references prefetched rather than demand fetched. • Timeliness indicates if prefetched data arrives early enough to prevent the processor from stalling.

  22. 1 4 8 1 1 4 4 8 1 8 GHB – Deltas Miss Address Stream 71 27 28 36 44 45 49 53 54 62 70 Global Delta Stream 1 1 8 8 1 4 4 1 8 8 Markov Graph .3 .7 .7 .7 .3 .3 Key => Current => Prefetches

  23. Prefetch Taxonomy • To simplify the discussion and illustrate the relation between prefetching methods we introduce a consistent naming convention. • Each name is a X/Y pair. • X is the key used for localizing the address stream. • Y is the method for detecting address patterns.

  24. Prefetch Taxonomy • We study two localizing methods • No localization or global (G) • Program Counter (PC) • And three pattern detection methods • Address Correlation • Delta Correlation • Constant Stride

  25. Prefetch Taxonomy • Markov Prefetching - G/AC • Distance Prefetching - G/DC • Stride Prefetching - PC/CS

  26. Stride Prefetching • Table tracks the local history of loads. • If a constant stride is detected in a load’s local history, then n + s, n + 2s, …, n + ds are prefetched. • n is the current target address • s is the detected stride • d is the prefetch degree or aggressiveness of the prefetching.

  27. Stride Prefetching Reference Prediction Table Tag Last Address Stride State PC of Load Target Address sub add Prefetch Address

  28. GHB – Stride Prefetching • GHB-Stride uses the PC to access the index table. • The linked lists contain the local history of each load. • Compare the last two local strides. If the same then prefetch n + s, n + 2s, …, n + ds. Index Table Global History Buffer pointer miss address pointer A PC B 1 C A =? B head pointer C 1 B C

  29. GHB – Local Delta Correlation • Form delta correlations within each load’s local history. • For example, consider the local miss address stream:

More Related