stall time fair memory access scheduling l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Stall-Time Fair Memory Access Scheduling PowerPoint Presentation
Download Presentation
Stall-Time Fair Memory Access Scheduling

Loading in 2 Seconds...

play fullscreen
1 / 44

Stall-Time Fair Memory Access Scheduling - PowerPoint PPT Presentation


  • 211 Views
  • Uploaded on

Stall-Time Fair Memory Access Scheduling. Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research. Multi-Core Systems. Multi-Core Chip. CORE 0. CORE 1. CORE 2. CORE 3. L2 CACHE. L2 CACHE. L2 CACHE. L2 CACHE. unfairness. Shared DRAM Memory System.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Stall-Time Fair Memory Access Scheduling' - garren


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
stall time fair memory access scheduling

Stall-Time Fair Memory Access Scheduling

Onur Mutlu and Thomas Moscibroda

Computer Architecture Group

Microsoft Research

multi core systems
Multi-Core Systems

Multi-Core

Chip

CORE 0

CORE 1

CORE 2

CORE 3

L2

CACHE

L2

CACHE

L2

CACHE

L2

CACHE

unfairness

Shared DRAM

Memory System

DRAM MEMORY CONTROLLER

. . .

DRAM

Bank 0

DRAM

Bank 1

DRAM

Bank 2

DRAM

Bank 7

dram bank operation
DRAM Bank Operation

Access Address

(Row 0, Column 0)

Access Address

(Row 1, Column 0)

Access Address

(Row 0, Column 9)

Access Address

(Row 0, Column 1)

Columns

Row address 0

Row address 1

Row decoder

Rows

Row Buffer

CONFLICT !

HIT

HIT

Row 1

Row 0

Empty

Column address 0

Column address 9

Column address 1

Column decoder

Column address 0

Data

dram controllers
DRAM Controllers
  • A row-conflict memory access takes significantly longer than a row-hit access
  • Current controllers take advantage of the row buffer
  • Commonly used scheduling policy (FR-FCFS) [Rixner, ISCA’00]

(1) Row-hit (column) first: Service row-hit memory accesses first

(2) Oldest-first: Then service older accesses first

  • This scheduling policy aims to maximize DRAM throughput
    • But, it is unfair when multiple threads share the DRAM system
outline
Outline
  • The Problem
    • Unfair DRAM Scheduling
  • Stall-Time Fair Memory Scheduling
    • Fairness definition
    • Algorithm
    • Implementation
    • System software support
  • Experimental Evaluation
  • Conclusions
the problem
The Problem
  • Multiple threads share the DRAM controller
  • DRAM controllers are designed to maximize DRAM throughput
  • DRAM scheduling policies are thread-unaware and unfair
    • Row-hit first: unfairly prioritizes threads with high row buffer locality
      • Streaming threads
      • Threads that keep on accessing the same row
    • Oldest-first: unfairly prioritizes memory-intensive threads
the problem7
The Problem

Row decoder

T0: Row 0

T0: Row 0

T0: Row 0

T0: Row 0

T0: Row 0

T0: Row 0

T0: Row 0

T1: Row 5

T1: Row 111

T0: Row 0

T1: Row 16

T0: Row 0

Request Buffer

Row Buffer

Row 0

Row 0

Column decoder

Row size: 8KB, cache block size: 64B

128 requests of T0 serviced before T1

T0: streaming thread

T1: non-streaming thread

Data

consequences of unfairness in dram
Consequences of Unfairness in DRAM

DRAM is the only shared resource

7.74

  • Vulnerability to denial of service [Moscibroda & Mutlu, Usenix Security’07]
  • System throughput loss
  • Priority inversion at the system/OS level
  • Poor performance predictability

4.72

1.85

1.05

outline9
Outline
  • The Problem
    • Unfair DRAM Scheduling
  • Stall-Time Fair Memory Scheduling
    • Fairness definition
    • Algorithm
    • Implementation
    • System software support
  • Experimental Evaluation
  • Conclusions
fairness in shared dram systems
Fairness in Shared DRAM Systems
  • A thread’s DRAM performance dependent on its inherent
    • Row-buffer locality
    • Bank parallelism
  • Interference between threads can destroy either or both
  • A fair DRAM scheduler should take into account all factors affecting each thread’s DRAM performance
    • Not solely bandwidth or solely request latency
  • Observation: A thread’s performance degradation due to interference in DRAM mainly characterized by the extra memory-related stall-time due to contention with other threads
stall time fairness in shared dram systems
Stall-Time Fairness in Shared DRAM Systems
  • A DRAM system is fair if it slows down equal-priority threads equally
    • Compared to when each thread is run alone on the same system
    • Fairness notion similar to SMT[Cazorla, IEEE Micro’04][Luo, ISPASS’01], SoEMT [Gabor, Micro’06],and shared caches[Kim, PACT’04]
  • Tshared: DRAM-related stall-time when the thread is running with other threads
  • Talone: DRAM-related stall-time when the thread is running alone
  • Memory-slowdown = Tshared/Talone
  • The goal of the Stall-Time Fair Memory scheduler (STFM) is to equalize Memory-slowdown for all threads, without sacrificing performance
    • Considers inherent DRAM performance of each thread
outline12
Outline
  • The Problem
    • Unfair DRAM Scheduling
  • Stall-Time Fair Memory Scheduling
    • Fairness definition
    • Algorithm
    • Implementation
    • System software support
  • Experimental Evaluation
  • Conclusions
stfm scheduling algorithm 1
STFM Scheduling Algorithm (1)
  • During each time interval, for each thread, DRAM controller
    • Tracks Tshared
    • Estimates Talone
  • At the beginning of a scheduling cycle, DRAM controller
    • Computes Slowdown = Tshared/Talone for each thread with an outstanding legal request
    • Computes unfairness = MAX Slowdown / MIN Slowdown
  • If unfairness < 
    • Use DRAM throughput oriented baseline scheduling policy
      • (1) row-hit first
      • (2) oldest-first
stfm scheduling algorithm 2
STFM Scheduling Algorithm (2)
  • If unfairness ≥ 
    • Use fairness-oriented scheduling policy
      • (1) requests from thread with MAX Slowdown first
      • (2) row-hit first
      • (3) oldest-first
  • Maximizes DRAM throughput if it cannot improve fairness
  • Does NOT waste useful bandwidth to improve fairness
    • If a request does not interfere with any other, it is scheduled
how does stfm prevent unfairness
How Does STFM Prevent Unfairness?

T0: Row 0

T1: Row 5

T0: Row 0

T0: Row 0

T1: Row 111

T0: Row 0

T0: Row 0

T0: Row 0

T0: Row 0

T1: Row 16

Row Buffer

Row 16

Row 111

Row 0

Row 0

T0 Slowdown

1.04

1.07

1.04

1.00

1.10

1.03

T1 Slowdown

1.14

1.08

1.03

1.06

1.11

1.06

1.00

Unfairness

1.06

1.03

1.03

1.04

1.06

1.04

1.03

1.00

Data

1.05

outline16
Outline
  • The Problem
    • Unfair DRAM Scheduling
  • Stall-Time Fair Memory Scheduling
    • Fairness definition
    • Algorithm
    • Implementation
    • System software support
  • Experimental Evaluation
  • Conclusions
implementation
Implementation
  • Tracking Tshared
    • Relatively easy
    • The processor increases a counter if the thread cannot commit instructions because the oldest instruction requires DRAM access
  • Estimating Talone
    • Moreinvolvedbecause thread is not running alone
    • Difficult to estimate directly
    • Observation:
      • Talone = Tshared - Tinterference
    • Estimate Tinterference: Extra stall-time due to interference
estimating t interference 1

Extra Bank Access Latency

Tinterference(C) +=

# Banks Servicing C’s Requests

Estimating Tinterference(1)
  • When a DRAM request from thread C is scheduled
    • Thread C can incur extra stall time:
      • The request’s row buffer hit status might be affected by interference
    • Estimate the row that would have been in the row buffer if the thread were running alone
    • Estimate the extra bank access latency the request incurs

Extra latency amortized across outstanding accesses of thread C

(memory level parallelism)

estimating t interference 2

Bus Transfer Latency of Scheduled Request

Tinterference(C’) +=

Estimating Tinterference(2)
  • When a DRAM request from thread C is scheduled
    • Any other thread C’ with outstanding requests incurs extra stall time
    • Interference in the DRAM data bus
    • Interference in the DRAM bank (see paper)

Bank Access Latency of Scheduled Request

Tinterference(C’) +=

* K

# Banks Needed by C’ Requests

hardware cost
Hardware Cost
  • <2KB storage cost for
    • 8-core system with 128-entry memory request buffer
  • Arithmetic operations approximated
    • Fixed point arithmetic
    • Divisions using lookup tables
  • Not on the critical path
    • Scheduler makes a decision only every DRAM cycle
  • More details in paper
outline21
Outline
  • The Problem
    • Unfair DRAM Scheduling
  • Stall-Time Fair Memory Scheduling
    • Fairness definition
    • Algorithm
    • Implementation
    • System software support
  • Experimental Evaluation
  • Conclusions
support for system software
Support for System Software
  • Supporting system-level thread weights/priorities
    • Thread weights communicated to the memory controller
    • Larger-weight threads should be slowed down less
    • Each thread’s slowdown is scaled by its weight
    • Weighted slowdown used for scheduling
      • Favors threads with larger weights
    • OS can choose thread weights to satisfy QoS requirements
  • : Maximum tolerable unfairness set by system software
    • Don’t need fairness? Set  large.
    • Need strict fairness? Set  close to 1.
    • Other values of : trade-off fairness and throughput
outline23
Outline
  • The Problem
    • Unfair DRAM Scheduling
  • Stall-Time Fair Memory Scheduling
    • Fairness definition
    • Algorithm
    • Implementation
    • System software support
  • Experimental Evaluation
  • Conclusions
evaluation methodology
Evaluation Methodology
  • 2-, 4-, 8-, 16-core systems
    • x86 processor model based on Intel Pentium M
    • 4 GHz processor, 128-entry instruction window
    • 512 Kbyte per core private L2 caches
  • Detailed DRAM model based on Micron DDR2-800
    • 128-entry memory request buffer
    • 8 banks, 2Kbyte row buffer
    • Row-hit round-trip latency: 35ns (140 cycles)
    • Row-conflict latency: 70ns (280 cycles)
  • Benchmarks
    • SPEC CPU2006 and some Windows Desktop applications
    • 256, 32, 3 benchmark combinations for 4-, 8-, 16-core experiments
comparison with related work
Comparison with Related Work
  • Baseline FR-FCFS [Rixner et al., ISCA’00]
    • Unfairly penalizes non-intensive threads with low-row-buffer locality
  • FCFS
    • Low DRAM throughput
    • Unfairly penalizes non-intensive threads
  • FR-FCFS+Cap
    • Static cap on how many younger row-hits can bypass older accesses
    • Unfairly penalizes non-intensive threads
  • Network Fair Queueing (NFQ)[Nesbit et al., Micro’06]
    • Per-thread virtual-time based scheduling
      • A thread’s private virtual-time increases when its request is scheduled
      • Prioritizes requests from thread with the earliest virtual-time
      • Equalizes bandwidth across equal-priority threads
      • Does not consider inherent performance of each thread
    • Unfairly prioritizes threads with bursty access patterns (idleness problem)
    • Unfairly penalizes threads with unbalanced bank usage (in paper)
idleness burstiness problem in fair queueing

Serviced

Serviced

Serviced

Serviced

Idleness/Burstiness Problem in Fair Queueing

Only Thread 2 serviced in interval [t1,t2] since its virtual time is smaller than Thread 1’s

Only Thread 3 serviced in interval [t2,t3] since its virtual time is smaller than Thread 1’s

Only Thread 4 serviced in interval [t3,t4] since its virtual time is smaller than Thread 1’s

Thread 1’s virtual time increases even though no other thread needs DRAM

Non-bursty thread suffers large performance loss

even though it fairly utilized DRAM when no other thread needed it

unfairness on 4 8 16 core systems
Unfairness on 4-, 8-, 16-core Systems

Unfairness = MAX Memory Slowdown / MIN Memory Slowdown

1.26X

1.27X

1.81X

outline30
Outline
  • The Problem
    • Unfair DRAM Scheduling
  • Stall-Time Fair Memory Scheduling
    • Fairness definition
    • Algorithm
    • Implementation
    • System software support
  • Experimental Evaluation
  • Conclusions
conclusions
Conclusions
  • A new definition of DRAM fairness: stall-time fairness
    • Equal-priority threads should experience equal memory-related slowdowns
    • Takes into account inherent memory performance of threads
  • New DRAM scheduling algorithm enforces this definition
    • Flexible and configurable fairness substrate
    • Supports system-level thread priorities/weights  QoS policies
  • Results across a wide range of workloads and systems show:
    • Improving DRAM fairness also improves system throughput
    • STFM provides better fairness and system performance than previously-proposed DRAM schedulers
stall time fair memory access scheduling33

Stall-Time Fair Memory Access Scheduling

Onur Mutlu and Thomas Moscibroda

Computer Architecture Group

Microsoft Research

comparison using nfq qos metrics
Comparison using NFQ QoS Metrics
  • Nesbit et al. [MICRO’06] proposed the following target for quality of service:
    • A thread that is allocated 1/Nthof the memory system bandwidth will run no slower than the same thread on a private memory system running at 1/Nth of the frequency of the shared physical memory system
    • Baseline with memory bandwidth scaled down by N
  • We compared different DRAM schedulers’ effectiveness using this metric
    • Number of violations of the above QoS target
    • Harmonic mean of IPC normalized to the above baseline
shortcomings of the nfq qos target
Shortcomings of the NFQ QoS Target
  • Low baseline (easily achievable target) for equal-priority threads
    • N equal-priority threads  a thread should do better than on a system with 1/Nth of the memory bandwidth
    • This target is usually very easy to achieve
      • Especially when N is large
  • Unachievable target in some cases
    • Consider two threads always accessing the same bank in an interleaved fashion  too much interference
  • Baseline performance very difficult to determine in a real system
    • Cannot scale memory frequency arbitrarily
    • Not knowing baseline performance makes it difficult to set thread priorities (how much bandwidth to assign to each thread)
a case study
A Case Study

Unfairness: 7.28 2.07 2.08 1.87 1.27

Memory Slowdown