1 / 33

Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

Meeting Midway: Improving CMP Performance with Memory-Side Prefetching. Praveen Yedlapalli , Jagadish Kotra , Emre Kultursay , Mahmut Kandemir , Chita R. Das and Anand Sivasubramaniam The Pennsylvania State University. Summary.

vienna
Download Presentation

Meeting Midway: Improving CMP Performance with Memory-Side Prefetching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Meeting Midway: Improving CMP Performance with Memory-Side Prefetching Praveen Yedlapalli, JagadishKotra, EmreKultursay, MahmutKandemir, Chita R. Das and AnandSivasubramaniam The Pennsylvania State University

  2. Summary • In modern multi-core systems, increasing number of cores share common resources • “Memory Wall” • Application/Core contention Interference • Average 10% improvement in application performance Proposal A novel memory-side prefetching scheme Mitigates interference while exploiting row buffer locality

  3. Outline • Background • Motivation • Memory-Side Prefetching • Evaluation • Conclusion

  4. Network On-Chip based CMP MC1 MC0 RequestMessage ResponseMessage L2 L1 L2 C L1 R MC3 MC2

  5. Memory Controller Row Buffer Hit Row Buffer Conflict Precharge row A Activate row B Bank 0 Bank 1 B4 F21 G12 C41 B5 H22 B4 A B B MC CPU DRAM

  6. Outline • Background • Motivation • Memory-Side Prefetching • Evaluation • Conclusion

  7. Impact of Interference • How to handle this negative interference?

  8. Latency Breakdown of L2 Miss Off-chip latency is the majority part chip Queueing On - chip Off- Off- chip Access

  9. Observations • Memory requests from multiple cores interleave at the memory controllers • Row buffer locality of individual apps is lost • Off-chip latency is the majority part in a memory access • On-chip network and caches are critical • Cannot afford to pollute them

  10. What about Cache Prefetching? • Not effective for large CMPs • Agnostic to memory state • Gap between caches and memory (62% latency increase) • On-chip resource pollution • Both caches and network (22% network latency increase) • Difficulty of stream detection in S-NUCA • Each L2 bank caters to only a portion of the address space • Each L2 bank gets requests from multiple L1s • Our memory-side prefetching scheme can work along with core-side prefetching

  11. Outline • Background • Motivation • Memory-Side Prefetching • Evaluation • Conclusion

  12. Memory-Side Prefetching • Objective 1 • Reduce off-chip access latency • Objective 2 • With out increasing on-chip resource contention

  13. Memory-Side Prefetching What to Prefetch? When to Prefetch? Where to Prefetch?

  14. What to Prefetch? • Prefetch from an open row • Minimizes overhead • Looked at the line access patterns within a row

  15. What to Prefetch?

  16. What to Prefetch? In general, next-line locality is good

  17. When to Prefetch?

  18. Where to Prefetch? • Should be stored on-chip • Prefetch buffers in the memory controllers • To avoid on-chip resource pollution • Organization • Per-core • Shared

  19. Memory-Side Prefetching Optimizations • Applications vary in memory behavior • Prefetch Throttling • Feedback • Precharge on Prefetch • Less likely to get a request • Avert Costly Prefetchets • Waiting demand requests

  20. Memory-Side Prefetching: Example A11 Bank 0 Bank 1 Row Buffer Hit Prefetch from A A11, A12, A13, A14 A10 F21 G12 C41 C26 A B H22 A10 A11 MC CPU DRAM

  21. Memory-Side Prefetching: Comparison

  22. Implementation • Prefetch Buffer Implementation • Organized as n per-core prefetch buffers • 256 KB per Memory Controller (<3% compared to LLC) • < 1% Area and Power overhead • Prefetch Request Timing • Requests are generated internally by the memory controller along with a read row buffer hit request

  23. Outline • Background • Motivation • Memory-Side Prefetching • Evaluation • Conclusion

  24. Evaluation Platform • Cores: 32 at 2.4 GHz • Network: 8x4 2D mesh • Caches: 32KB L1I; 32KB L1D; 1MB L2 per core • Memory: 16GB DDR3-1600 with 4 Memory Channels • GEMS simulator with GARNET

  25. Evaluation Methodology • Benchmarks: • Multi-programmed: SPEC 2006 (WL1 to WL5) • Multi-threaded: SPECOMP 2001 (WL6 & WL7) • Metrics: • Harmonic IPC • Off-chip and On-chip Latencies

  26. IPC 33.2 10%

  27. Latency -48.5%

  28. Latency -48.5%

  29. L2 Hitrate

  30. Row Buffer Hitrate

  31. Outline • Background • Motivation • Memory-Side Prefetching • Evaluation • Conclusion

  32. Conclusion • Proposed a new memory-side prefetcher • Opportunistic • Instantaneous knowledge of memory state • Prefetching Midway • Doesn’t pollute on-chip resources • Reduces the off-chip latency by 48.5% and improves performance by 6.2% on average • Our technique can be combined with core-side prefetching to amplify the benefits

  33. Thank You • Questions?

More Related