1 / 58

ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers

ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers. Yoongu Kim Dongsu Han Onur Mutlu Mor Harchol-Balter. Motivation. Modern multi-core systems employ multiple memory controllers Applications contend with each other in multiple controllers

Download Presentation

ATLAS A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ATLASA Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers Yoongu Kim Dongsu Han Onur Mutlu MorHarchol-Balter

  2. Motivation • Modern multi-core systems employ multiple memory controllers • Applications contend with each other in multiple controllers • How to perform memory scheduling for multiple controllers?

  3. Desired Properties of Memory Scheduling Algorithm • Maximize system performance • Without starving any cores • Configurable by system software • To enforce thread priorities and QoS/fairness policies • Scalable to a large number of controllers • Should not require significant coordination between controllers Multiple memory controllers No previous scheduling algorithm satisfies all these requirements

  4. Multiple Memory Controllers Single-MC system Multiple-MC system Core Core MC Memory MC Memory MC Memory Difference? The need for coordination

  5. Thread Ranking in Single-MC Assume all requests are to the same bank Thread 1’s request Thread 2’s request MC 1 T1 T2 T2 Memory service timeline STALL Thread 1 Optimal average stall time: 2T Thread 2 STALL Execution timeline # of requests: Thread 1 <Thread 2 Thread 1  Shorter job Thread ranking: Thread 1 >Thread 2 Thread 1  Assigned higher rank

  6. Thread Ranking in Multiple-MC Uncoordinated Coordinated Coordination MC 1 MC 1 T2 T2 T1 T2 T2 T1 MC 1’s shorter job: Thread 1 Global shorter job: Thread 2 MC 1 incorrectly assigns higher rank to Thread 1 Global shorter job: Thread 2 MC 1 correctly assigns higher rank to Thread 2 MC 2 MC 2 T1 T1 T1 T1 T1 T1 Avg. stall time: 2.5T Avg. stall time: 3T STALL STALL Thread 1 Thread 1 SAVED CYCLES! STALL STALL Thread 2 Thread 2 Coordination  Better scheduling decisions

  7. Coordination Limits Scalability MC-to-MC MC 1 MC 2 Coordination? Consumes bandwidth Meta-MC MC 3 MC 4 Meta-MC • To be scalable, coordination should: • exchange little information • occur infrequently

  8. The Problem and Our Goal Problem: • Previous memory scheduling algorithms are not scalable to many controllers • Not designed for multiple MCs • Require significant coordination Our Goal: • Fundamentally redesign the memory scheduling algorithm such that it • Provides high system throughput • Requires little or no coordination among MCs

  9. Outline • Motivation • Rethinking Memory Scheduling • Minimizing Memory Episode Time • ATLAS • Least Attained Service Memory Scheduling • Thread Ranking • Request Prioritization Rules • Coordination • Evaluation • Conclusion

  10. Rethinking Memory Scheduling A thread alternates between two states (episodes) • Compute episode: Zero outstanding memory requests  High IPC • Memory episode: Non-zero outstanding memory requests  Low IPC Outstanding memory requests Time Memory episode Compute episode Goal: Minimize time spent in memory episodes

  11. How to Minimize Memory Episode Time Prioritize thread whose memory episode will end the soonest • Minimizes time spent in memory episodes across all threads • Supported by queueing theory: • Shortest-Remaining-Processing-Time scheduling is optimal in single-server queue Remaining length of a memory episode? Outstanding memory requests How much longer? Time

  12. Predicting Memory Episode Lengths We discovered: past is excellent predictor for future Large attained service  Large expected remaining service Q: Why? A: Memory episode lengths are Pareto distributed… Outstanding memory requests Time Remaining service FUTURE Attained service PAST

  13. Pareto Distribution of Memory Episode Lengths Memory episode lengths of SPEC benchmarks 401.bzip2 Pareto distribution Pr{Mem. episode > x} The longer an episode has lasted  The longer it will last further x (cycles) Attained service correlates with remaining service Favoring least-attained-service memory episode =Favoring memory episode which will end the soonest

  14. Outline • Motivation • Rethinking Memory Scheduling • Minimizing Memory Episode Time • ATLAS • Least Attained Service Memory Scheduling • Thread Ranking • Request Prioritization Rules • Coordination • Evaluation • Conclusion

  15. Least Attained Service (LAS) Memory Scheduling Our Approach Queueing Theory Prioritize the memory episode with least-remaining-service Prioritize the job with shortest-remaining-processing-time Provably optimal • Remaining service: Correlates with attained service • Attained service: Tracked by per-thread counter Prioritize the memory episode with least-attained-service However, LAS does not consider long-term thread behavior Least-attained-service (LAS) scheduling: Minimize memory episode time

  16. Long-Term Thread Behavior Thread 1 Thread 2 Long memory episode Short memory episode Short-termthread behavior > priority Mem.episode < priority Long-termthread behavior Mem.episode Compute episode Compute episode Prioritizing Thread 2 is more beneficial: results in very long stretches of compute episodes

  17. Quantum-Based Attained Service of a Thread Short-termthread behavior Outstanding memory requests Time Attained service Quantum(millions of cycles) … Outstanding memory requests Long-termthread behavior Time Attained service We divide time into large, fixed-length intervals: quanta(millions of cycles)

  18. Outline • Motivation • Rethinking Memory Scheduling • Minimizing Memory Episode Time • ATLAS • Least Attained Service Memory Scheduling • Thread Ranking • Request Prioritization Rules • Coordination • Evaluation • Conclusion

  19. LAS Thread Ranking During a quantum Each thread’s attained service (AS) is tracked by MCs ASi = A thread’s AS during only the i-th quantum End of a quantum Each thread’s TotalAScomputed as: TotalASi= α· TotalASi-1+ (1- α) · ASi High α More bias towards history Threads are ranked, favoring threads with lower TotalAS Next quantum Threads are serviced according to their ranking

  20. Outline • Motivation • Rethinking Memory Scheduling • Minimizing Memory Episode Time • ATLAS • Least Attained Service Memory Scheduling • Thread Ranking • Request Prioritization Rules • Coordination • Evaluation • Conclusion

  21. ATLAS Scheduling Algorithm ATLAS • Adaptive per-Thread Least Attained Service • Request prioritization order 1. Prevent starvation: Over threshold request 2. Maximize performance: Higher LAS rank 3. Exploit locality: Row-hit request 4. Tie-breaker: Oldest request How to coordinate MCs to agree upon a consistent ranking?

  22. Outline • Motivation • Rethinking Memory Scheduling • Minimizing Memory Episode Time • ATLAS • Least Attained Service Memory Scheduling • Thread Ranking • Request Prioritization Rules • Coordination • Evaluation • Conclusion

  23. ATLAS Coordination Mechanism During a quantum: • Each MC increments the local AS of each thread End of a quantum: • Each MC sends local AS of each thread to centralized meta-MC • Meta-MC accumulates local AS and calculates ranking • Meta-MC broadcasts ranking to all MCs  Consistent thread ranking

  24. Coordination Cost in ATLAS How costly is coordination in ATLAS?

  25. Properties of ATLAS Goals Properties of ATLAS • LAS-ranking • Bank-level parallelism • Row-buffer locality • Very infrequent coordination • Scale attained service with thread weight (in paper) • Low complexity: Attained service requires a single counter per thread in each MC • Maximize system performance • Scalable to large number of controllers • Configurable by system software

  26. Outline • Motivation • Rethinking Memory Scheduling • Minimizing Memory Episode Time • ATLAS • Least Attained Service Memory Scheduling • Thread Ranking • Request Prioritization Rules • Coordination • Evaluation • Conclusion

  27. Evaluation Methodology • 4, 8, 16, 24, 32-core systems • 5 GHz processor, 128-entry instruction window • 512 Kbyte per-core private L2 caches • 1, 2, 4, 8, 16-MC systems • 128-entry memory request buffer • 4 banks, 2Kbyte row buffer • 40ns (200 cycles) row-hit round-trip latency • 80ns (400 cycles) row-conflict round-trip latency • Workloads • Multiprogrammed SPEC CPU2006 applications • 32 program combinations for 4, 8, 16, 24, 32-core experiments

  28. Comparison to Previous Scheduling Algorithms • FCFS, FR-FCFS[Rixner+, ISCA00] • Oldest-first, row-hit first • Low multi-core performance Do not distinguish between threads • Network Fair Queueing[Nesbit+, MICRO06] • Partitions memory bandwidth equally among threads • Low system performanceBank-level parallelism, locality not exploited • Stall-time Fair Memory Scheduler [Mutlu+, MICRO07] • Balances thread slowdowns relative to when run alone • High coordination costsRequires heavy cycle-by-cycle coordination • Parallelism-Aware Batch Scheduler [Mutlu+, ISCA08] • Batches requests and performs thread ranking to preserve bank-level parallelism • High coordination costs  Batch duration is very short

  29. System Throughput: 24-Core System System throughput = ∑ Speedup 3.5% 5.9% 8.4% System throughput 9.8% 17.0% # of memory controllers • ATLAS consistently provides higher system throughput than all previous scheduling algorithms

  30. System Throughput: 4-MC System # of cores increases  ATLAS performance benefit increases 10.8% 8.4% 4.0% System throughput 3.5% 1.1% # of cores

  31. Other Evaluations In Paper • System software support • ATLAS effectively enforces thread weights • Workload analysis • ATLAS performs best for mixed-intensity workloads • Effect of ATLAS on fairness • Sensitivity to algorithmic parameters • Sensitivity to system parameters • Memory address mapping, cache size, memory latency

  32. Outline • Motivation • Rethinking Memory Scheduling • Minimizing Memory Episode Time • ATLAS • Least Attained Service Memory Scheduling • Thread Ranking • Request Prioritization Rules • Coordination • Evaluation • Conclusion

  33. Conclusions • Multiple memory controllers require coordination • Need to agree upon a consistent ranking of threads • ATLAS is a fundamentally new approach to memory scheduling • Scalable: Thread ranking decisions at coarse-grained intervals • High-performance:Minimizes system time spent in memory episodes (Least Attained Service scheduling principle) • Configurable:Enforces thread priorities • ATLAS provides the highest system throughput compared to five previous scheduling algorithms • Performance benefit increases as the number of cores increases

  34. Thank you.Questions?

  35. ATLASA Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers Yoongu Kim Dongsu Han Onur Mutlu MorHarchol-Balter

  36. Hardware Cost • Additional hardware storage: • For a 24-core, 4-MC system: 9kb • Not on critical path of execution

  37. System Configuration • DDR2-800 memory timing

  38. Benchmark Characteristics • Benchmarks have a wildly varying level of memory intensity • Orders of magnitude difference • Prioritizing the “light” threads over “heavy” threads is critical

  39. Workload-by-workload Result • Workloads • Results

  40. Result on 24-Core System • 24-core system with varying number of controllers. • PAR-BSc • Ideally coordinated version of PAR-BS • Instantaneous knowledge of every other controller • ATLAS outperforms this unrealistic version of PAR-BSc

  41. Result on X-Core, Y-MC System • System where both the numbers of cores and controllers are varied. • PAR-BSc • Ideally coordinated version of PAR-BS • Instantaneous knowledge of every other controller • ATLAS outperforms this unrealistic version of PAR-BSc

  42. Varying Memory Intensity • Workloads that are composed of varying number of memory intensive threads • ATLAS performs the best for a balanced-mix workload • Distinguishes a memory-non-intensive threads and prioritizes them over memory-intensive threads

  43. ATLAS with and w/o Coordination • ATLAS provides ~1% improvement over ATLASu • ATLAS is specifically designed to operate with little coordination • PAR-BSc provides ~3% improvement over PAR-BS.

  44. ATLAS Algorithmic Parameters • Empirically determined quantum length: 10M cycles • When quantum length is large: • Infrequent coordination • Even long memory episodes can fit within the quantum • Thread ranking is stable

  45. ATLAS Algorithmic Parameters • Empirically determined history weight: 0.875 • When history weight is large: • Attained service is retained for longer • TotalAS changes slowly over time • Better distinguish memory intensive threads

  46. ATLAS Algorithmic Paramters • Empirically determined starvation threshold: 100K cycles • When T is too low: • System performance is degraded (similar to FCFS) • When T is too high: • Starvation-freedom guarantee is lost

  47. Multi-threaded Applications • In our evaluations, all threads are single-threaded applications • For multi-threaded applications, we can generalize the notion of attained service • Attained service of multiple threads = Sum or average of each of their attained service • Within-application ranking • Future work

  48. Unfairness in ATLAS • Assuming threads are equal priority, ATLAS maximizes system throughput by prioritizing shorter jobs at expense to the longer jobs • ATLAS increases unfairness • Maximum slowdown: 20.3% • Harmonic speedup: -10.3% • If threads are of different priorities, system software can ensure fairness/QoS by appropriate configuring thread priorities supported by ATLAS

  49. System Software Support • ATLAS enforces system priorities, or thread weights. • Linear relationship between thread weight and speedup.

  50. System Parameters • ATLAS performance on systems with varying cache sizes and memory timing • ATLAS performance benefit increases as contention for memory increases

More Related