Stall-Time Fair Memory Access Scheduling

Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research

Multi-Core Systems Multi-Core Chip CORE 0 CORE 1 CORE 2 CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE unfairness Shared DRAM Memory System DRAM MEMORY CONTROLLER . . . DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank 7

DRAM Bank Operation Access Address (Row 0, Column 0) Access Address (Row 1, Column 0) Access Address (Row 0, Column 9) Access Address (Row 0, Column 1) Columns Row address 0 Row address 1 Row decoder Rows Row Buffer CONFLICT ! HIT HIT Row 1 Row 0 Empty Column address 0 Column address 9 Column address 1 Column decoder Column address 0 Data

DRAM Controllers • A row-conflict memory access takes significantly longer than a row-hit access • Current controllers take advantage of the row buffer • Commonly used scheduling policy (FR-FCFS) [Rixner, ISCA’00] (1) Row-hit (column) first: Service row-hit memory accesses first (2) Oldest-first: Then service older accesses first • This scheduling policy aims to maximize DRAM throughput • But, it is unfair when multiple threads share the DRAM system

Outline • The Problem • Unfair DRAM Scheduling • Stall-Time Fair Memory Scheduling • Fairness definition • Algorithm • Implementation • System software support • Experimental Evaluation • Conclusions

The Problem • Multiple threads share the DRAM controller • DRAM controllers are designed to maximize DRAM throughput • DRAM scheduling policies are thread-unaware and unfair • Row-hit first: unfairly prioritizes threads with high row buffer locality • Streaming threads • Threads that keep on accessing the same row • Oldest-first: unfairly prioritizes memory-intensive threads

The Problem Row decoder T0: Row 0 T0: Row 0 T0: Row 0 T0: Row 0 T0: Row 0 T0: Row 0 T0: Row 0 T1: Row 5 T1: Row 111 T0: Row 0 T1: Row 16 T0: Row 0 Request Buffer Row Buffer Row 0 Row 0 Column decoder Row size: 8KB, cache block size: 64B 128 requests of T0 serviced before T1 T0: streaming thread T1: non-streaming thread Data

Consequences of Unfairness in DRAM DRAM is the only shared resource 7.74 • Vulnerability to denial of service [Moscibroda & Mutlu, Usenix Security’07] • System throughput loss • Priority inversion at the system/OS level • Poor performance predictability 4.72 1.85 1.05

Fairness in Shared DRAM Systems • A thread’s DRAM performance dependent on its inherent • Row-buffer locality • Bank parallelism • Interference between threads can destroy either or both • A fair DRAM scheduler should take into account all factors affecting each thread’s DRAM performance • Not solely bandwidth or solely request latency • Observation: A thread’s performance degradation due to interference in DRAM mainly characterized by the extra memory-related stall-time due to contention with other threads

Stall-Time Fairness in Shared DRAM Systems • A DRAM system is fair if it slows down equal-priority threads equally • Compared to when each thread is run alone on the same system • Fairness notion similar to SMT[Cazorla, IEEE Micro’04][Luo, ISPASS’01], SoEMT [Gabor, Micro’06],and shared caches[Kim, PACT’04] • Tshared: DRAM-related stall-time when the thread is running with other threads • Talone: DRAM-related stall-time when the thread is running alone • Memory-slowdown = Tshared/Talone • The goal of the Stall-Time Fair Memory scheduler (STFM) is to equalize Memory-slowdown for all threads, without sacrificing performance • Considers inherent DRAM performance of each thread

STFM Scheduling Algorithm (1) • During each time interval, for each thread, DRAM controller • Tracks Tshared • Estimates Talone • At the beginning of a scheduling cycle, DRAM controller • Computes Slowdown = Tshared/Talone for each thread with an outstanding legal request • Computes unfairness = MAX Slowdown / MIN Slowdown • If unfairness <  • Use DRAM throughput oriented baseline scheduling policy • (1) row-hit first • (2) oldest-first

STFM Scheduling Algorithm (2) • If unfairness ≥  • Use fairness-oriented scheduling policy • (1) requests from thread with MAX Slowdown first • (2) row-hit first • (3) oldest-first • Maximizes DRAM throughput if it cannot improve fairness • Does NOT waste useful bandwidth to improve fairness • If a request does not interfere with any other, it is scheduled

How Does STFM Prevent Unfairness? T0: Row 0 T1: Row 5 T0: Row 0 T0: Row 0 T1: Row 111 T0: Row 0 T0: Row 0 T0: Row 0 T0: Row 0 T1: Row 16 Row Buffer Row 16 Row 111 Row 0 Row 0 T0 Slowdown 1.04 1.07 1.04 1.00 1.10 1.03 T1 Slowdown 1.14 1.08 1.03 1.06 1.11 1.06 1.00 Unfairness 1.06 1.03 1.03 1.04 1.06 1.04 1.03 1.00 Data  1.05

Implementation • Tracking Tshared • Relatively easy • The processor increases a counter if the thread cannot commit instructions because the oldest instruction requires DRAM access • Estimating Talone • Moreinvolvedbecause thread is not running alone • Difficult to estimate directly • Observation: • Talone = Tshared - Tinterference • Estimate Tinterference: Extra stall-time due to interference

Extra Bank Access Latency Tinterference(C) += # Banks Servicing C’s Requests Estimating Tinterference(1) • When a DRAM request from thread C is scheduled • Thread C can incur extra stall time: • The request’s row buffer hit status might be affected by interference • Estimate the row that would have been in the row buffer if the thread were running alone • Estimate the extra bank access latency the request incurs Extra latency amortized across outstanding accesses of thread C (memory level parallelism)

Bus Transfer Latency of Scheduled Request Tinterference(C’) += Estimating Tinterference(2) • When a DRAM request from thread C is scheduled • Any other thread C’ with outstanding requests incurs extra stall time • Interference in the DRAM data bus • Interference in the DRAM bank (see paper) Bank Access Latency of Scheduled Request Tinterference(C’) += * K # Banks Needed by C’ Requests

Hardware Cost • <2KB storage cost for • 8-core system with 128-entry memory request buffer • Arithmetic operations approximated • Fixed point arithmetic • Divisions using lookup tables • Not on the critical path • Scheduler makes a decision only every DRAM cycle • More details in paper

Support for System Software • Supporting system-level thread weights/priorities • Thread weights communicated to the memory controller • Larger-weight threads should be slowed down less • Each thread’s slowdown is scaled by its weight • Weighted slowdown used for scheduling • Favors threads with larger weights • OS can choose thread weights to satisfy QoS requirements • : Maximum tolerable unfairness set by system software • Don’t need fairness? Set  large. • Need strict fairness? Set  close to 1. • Other values of : trade-off fairness and throughput

Evaluation Methodology • 2-, 4-, 8-, 16-core systems • x86 processor model based on Intel Pentium M • 4 GHz processor, 128-entry instruction window • 512 Kbyte per core private L2 caches • Detailed DRAM model based on Micron DDR2-800 • 128-entry memory request buffer • 8 banks, 2Kbyte row buffer • Row-hit round-trip latency: 35ns (140 cycles) • Row-conflict latency: 70ns (280 cycles) • Benchmarks • SPEC CPU2006 and some Windows Desktop applications • 256, 32, 3 benchmark combinations for 4-, 8-, 16-core experiments

Comparison with Related Work • Baseline FR-FCFS [Rixner et al., ISCA’00] • Unfairly penalizes non-intensive threads with low-row-buffer locality • FCFS • Low DRAM throughput • Unfairly penalizes non-intensive threads • FR-FCFS+Cap • Static cap on how many younger row-hits can bypass older accesses • Unfairly penalizes non-intensive threads • Network Fair Queueing (NFQ)[Nesbit et al., Micro’06] • Per-thread virtual-time based scheduling • A thread’s private virtual-time increases when its request is scheduled • Prioritizes requests from thread with the earliest virtual-time • Equalizes bandwidth across equal-priority threads • Does not consider inherent performance of each thread • Unfairly prioritizes threads with non-bursty access patterns (idleness problem) • Unfairly penalizes threads with unbalanced bank usage (in paper)

Serviced Serviced Serviced Serviced Idleness/Burstiness Problem in Fair Queueing Only Thread 2 serviced in interval [t1,t2] since its virtual time is smaller than Thread 1’s Only Thread 3 serviced in interval [t2,t3] since its virtual time is smaller than Thread 1’s Only Thread 4 serviced in interval [t3,t4] since its virtual time is smaller than Thread 1’s Thread 1’s virtual time increases even though no other thread needs DRAM Non-bursty thread suffers large performance loss even though it fairly utilized DRAM when no other thread needed it

Unfairness on 4-, 8-, 16-core Systems Unfairness = MAX Memory Slowdown / MIN Memory Slowdown 1.26X 1.27X 1.81X

5.8% 4.1% 4.6% System Performance

Hmean-speedup (Throughput-Fairness Balance) 10.8% 9.5% 11.2%

Conclusions • A new definition of DRAM fairness: stall-time fairness • Equal-priority threads should experience equal memory-related slowdowns • Takes into account inherent memory performance of threads • New DRAM scheduling algorithm enforces this definition • Flexible and configurable fairness substrate • Supports system-level thread priorities/weights  QoS policies • Results across a wide range of workloads and systems show: • Improving DRAM fairness also improves system throughput • STFM provides better fairness and system performance than previously-proposed DRAM schedulers

Thank you. Questions?

Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research

Backup

Structure of the STFM Controller

Comparison using NFQ QoS Metrics • Nesbit et al. [MICRO’06] proposed the following target for quality of service: • A thread that is allocated 1/Nthof the memory system bandwidth will run no slower than the same thread on a private memory system running at 1/Nth of the frequency of the shared physical memory system • Baseline with memory bandwidth scaled down by N • We compared different DRAM schedulers’ effectiveness using this metric • Number of violations of the above QoS target • Harmonic mean of IPC normalized to the above baseline

Violations of the NFQ QoS Target

Hmean Normalized IPC using NFQ Baseline 7.3% 5.9% 5.1% 10.3% 7.8% 9.1%

Shortcomings of the NFQ QoS Target • Low baseline (easily achievable target) for equal-priority threads • N equal-priority threads  a thread should do better than on a system with 1/Nth of the memory bandwidth • This target is usually very easy to achieve • Especially when N is large • Unachievable target in some cases • Consider two threads always accessing the same bank in an interleaved fashion  too much interference • Baseline performance very difficult to determine in a real system • Cannot scale memory frequency arbitrarily • Not knowing baseline performance makes it difficult to set thread priorities (how much bandwidth to assign to each thread)

A Case Study Unfairness: 7.28 2.07 2.08 1.87 1.27 Memory Slowdown

Windows Desktop Workloads

Enforcing Thread Weights

Effect of 

Effect of Banks and Row Buffer Size

Stall-Time Fair Memory Access Scheduling

Stall-Time Fair Memory Access Scheduling

Presentation Transcript

Stall-Time Fair Memory Access Scheduling

Memory Access Scheduling

Fair Share Scheduling

Scalable Transactional Memory Scheduling

Parallel Application Memory Scheduling

Staged Memory Scheduling

Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior

Scheduling Memory Transactions

Scalable Transactional Memory Scheduling

Real-Time Scheduling

Surplus Fair Scheduling

Time-Memory Scheduling and Code Generation of Real-Time Embedded Software

Memory Access Scheduling and Binding Considering Energy Minimization in Multi-Bank Memory Systems

Back-End: Instruction Scheduling, Memory Access Instructions, and Clusters

Fair Share Scheduling

Fair Share Scheduling

Fair Access

Fair Share Scheduling

Fair Share Scheduling

Scheduling Memory Transactions

Time-Memory Scheduling and Code Generation of Real-Time Embedded Software

Wireless Fair Scheduling