1 / 29

Adaptive History-Based Memory Schedulers

Adaptive History-Based Memory Schedulers. Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin. Memory system performance is not increasing as fast as CPU performance Latency: Use caches, prefetching, … Bandwidth: Use parallelism inside memory system. Memory Bottleneck.

istas
Download Presentation

Adaptive History-Based Memory Schedulers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin

  2. Memory system performance is not increasing as fast as CPU performance Latency: Use caches, prefetching, … Bandwidth: Use parallelism inside memory system Memory Bottleneck

  3. DRAM Bank 0 Read Bank 0 Bank 1 Read Bank 0 Bank 2 bank conflict Bank 3 Read Bank 1 Read Bank 0 Read Bank 1 better order Read Bank 0 How to Increase Memory Command Parallelism? • Similar to instruction scheduling, can reorder commands for higher bandwidth time

  4. Inside the Memory System not FIFO Read Queue FIFO Memory Queue DRAM arbiter caches Write Queue Memory Controller not FIFO the arbiter schedules memory operations

  5. Our Work • Study memory command scheduling in the context of the IBM Power5 • Present new memory arbiters • 20% increased bandwidth • Very little cost: 0.04% increase in chip area

  6. Outline • The Problem • Characteristics of DRAM • Previous Scheduling Methods • Our approach • History-based schedulers • Adaptive history-based schedulers • Results • Conclusions

  7. Understanding the Problem:Characteristics of DRAM • Multi-dimensional structure • Banks, rows, and columns • IBM Power5: ranks and ports as well • Access time is not uniform • Bank-to-Bank conflicts • Read after Write to the same rank conflict • Write after Read to different port conflict • …

  8. Previous Scheduling Approaches: FIFO Scheduling caches DRAM Read Queue caches arbiter Memory Queue (FIFO) Write Queue

  9. Memoryless Scheduling Adapted from Rixner et al, ISCA2000 caches DRAM Read Queue caches arbiter Memory Queue (FIFO) Write Queue long delay

  10. DC B A C 5 8 D is better D 3 7 What we really want • Keep the pipeline full; don’t hold commands in the reorder queues until conflicts are totally resolved • Forward them to memory queue in an order to minimize future conflicts • To do this we need to know history of the commands memory queue Read/Write Queues arbiter

  11. Another Goal: Match Application’s Memory Command Behavior • Arbiter should select commands from queues roughly in the ratio in which the application generates them • Otherwise, read or write queue may be congested • Command history is useful here too

  12. Our Approach: History-Based Memory Schedulers Benefits: • Minimize contention costs • Consider multiple constraints • Match application’s memory access behavior • 2 Reads per Write? • 1 Read per Write? • … • The Result: less congested memory system, i.e. more bandwidth

  13. How does it work? • Use a Finite State Machine (FSM) • Each state in the FSM represents one possible history • Transitions out of a state are prioritized • At any state, scheduler selects the available command with the highest priority • FSM is generated at design time

  14. An Example available commands in reorder queues next state First Preference current state Second Preference Third Preference Fourth Preference most appropriate command to memory

  15. How to determine priorities? • Two criteria: • A: Minimize contention costs • B: Satisfy program’s Read/Write command mix • First Method : Use A, break ties with B • Second Method : Use B, break ties with A • Which method to use? • Combine two methods probabilistically (details in the paper)

  16. Limitation of the History-Based Approach • Designed for one particular mix of Read/Writes • Solution: Adaptive History-Based Schedulers • Create multiple state machines: one for each Read/Write mix • Periodically select most appropriate state machine

  17. Arbiter1 Arbiter2 Arbiter3 Read Counter Write Counter Cycle Counter Arbiter Selection Logic select Adaptive History-Based Schedulers 2R:1W1R:1W 1R:2W

  18. Evaluation • Used a cycle accurate simulator for the IBM Power5 • 1.6 GHz, 266-DDR2, 4-rank, 4-bank, 2-port • Evaluated and compared our approach with previous approaches with data intensive applications: Stream, NAS, and microbenchmarks

  19. The IBM Power5 • 2 cores on a chip • SMT capability • Large on-chip L2 cache • Hardware prefetching • 276 million transistors Memory Controller (1.6% of chip area)

  20. Results 1: Stream Benchmarks

  21. Results 2: NAS Benchmarks (1 core active)

  22. Results 3: Microbenchmarks

  23. 12 concurrent commands caches DRAM Read Queue caches arbiter Memory Queue (FIFO) Write Queue

  24. DRAM Utilization Memoryless Approach Our Approach Number of Active Commands in DRAM

  25. Why does it work? detailed analysis in the paper Read Queue Memory Queue DRAM arbiter caches Write Queue Memory Controller Low Occupancy in ReorderQueues Busy Memory System Full Memory Queue Full Reorder Queues

  26. Other Results • We obtain >95% performance of the perfect DRAM configuration (no conflicts) • Results with higher frequency, and no data prefetching are in the paper • History size of 2 works well

  27. Conclusions • Introduced adaptive history-based schedulers • Evaluated on a highly tuned system, IBM Power5 • Performance improvement Over FIFO : Stream 63% NAS 11% Over Memoryless : Stream 19% NAS 5% • Little cost: 0.04% chip area increase

  28. Conclusions (cont.) • Similar arbiters can be used in other places as well, e.g. cache controllers • Can optimize for other criteria, e.g. power or power+performance.

  29. Thank you

More Related