multi core systems l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Multi-Core Systems PowerPoint Presentation
Download Presentation
Multi-Core Systems

Loading in 2 Seconds...

play fullscreen
1 / 30

Multi-Core Systems - PowerPoint PPT Presentation


  • 139 Views
  • Uploaded on

Multi-Core Systems. Multi-Core Chip. CORE 0. CORE 1. CORE 2. CORE 3. L2 CACHE. L2 CACHE. L2 CACHE. L2 CACHE. unfairness. Shared DRAM Memory System. DRAM MEMORY CONTROLLER. . . . DRAM Bank 0. DRAM Bank 1. DRAM Bank 2. DRAM Bank 7. DRAM Bank Operation.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Multi-Core Systems' - psyche


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
multi core systems
Multi-Core Systems

Multi-Core

Chip

CORE 0

CORE 1

CORE 2

CORE 3

L2

CACHE

L2

CACHE

L2

CACHE

L2

CACHE

unfairness

Shared DRAM

Memory System

DRAM MEMORY CONTROLLER

. . .

DRAM

Bank 0

DRAM

Bank 1

DRAM

Bank 2

DRAM

Bank 7

dram bank operation
DRAM Bank Operation

Access Address

(Row 0, Column 0)

Access Address

(Row 1, Column 0)

Access Address

(Row 0, Column 9)

Access Address

(Row 0, Column 1)

Columns

Row address 0

Row address 1

Row decoder

Rows

Row Buffer

CONFLICT !

HIT

HIT

Row 1

Row 0

Empty

Column address 0

Column address 9

Column address 1

Column decoder

Column address 0

Data

dram controllers
DRAM Controllers
  • A row-conflict memory access takes much longer than a row-hit access (50ns vs. 100ns)
  • Current controllers take advantage of the row buffer
  • Commonly used scheduling policy (FR-FCFS):
    • Row-hit first: Service row-hit memory accesses first
    • Oldest-first: Then service older accesses first
  • This scheduling policy aims to
    • Maximize DRAM throughput
    • But, yields itself to a new form of denial of service with multiple threads
memory performance attacks denial of memory service in multi core systems

Memory Performance Attacks:Denial of Memory Service in Multi-Core Systems

Thomas Moscibroda and Onur Mutlu

Computer Architecture Group

Microsoft Research

outline
Outline
  • The Problem
    • Memory Performance Hogs
  • Preventing Memory Performance Hogs
  • Our Solution
    • Fair Memory Scheduling
  • Experimental Evaluation
  • Conclusions
the problem
The Problem
  • Multiple threads share the DRAM controller
  • DRAM controllers are designed to maximize DRAM throughput
  • DRAM scheduling policies are thread-unaware and unfair
    • Row-hit first: unfairly prioritizes threads with high row buffer locality
      • Streaming threads
      • Threads that keep on accessing the same row
    • Oldest-first: unfairly prioritizes memory-intensive threads
  • Vulnerable to denial of service attacks
memory performance hog mph
Memory Performance Hog (MPH)
  • A thread that exploits unfairness in the DRAM controller
    • to deny memory service (for long time periods) to threads co-scheduled on the same chip
  • Can severely degrade other threads’ performance
  • While maintaining most of its own performance!
  • Easy to write an MPH
    • An MPH can be designed intentionally (malicious)
    • Or, unintentionally!
    • Implications in commodity multi-core systems:
      • widespread performance degradation and unpredictability
      • user discomfort and loss of productivity
      • vulnerable hardware could lose market share
an example memory performance hog
An Example Memory Performance Hog

STREAM

  • Sequential memory access
  • Very high row buffer locality (96% hit rate)
  • Memory intensive
a co scheduled application
A Co-Scheduled Application

RDARRAY

  • Random memory access
  • Very low row buffer locality (3% hit rate)
  • Similarly memory intensive
what does the mph do
What Does the MPH Do?

Row decoder

T0: Row 0

T0: Row 0

T0: Row 0

T0: Row 0

T0: Row 0

T0: Row 0

T0: Row 0

T1: Row 5

Row size: 8KB, data size: 64B

128 requests of stream serviced before rdarray

T1: Row 111

T0: Row 0

T1: Row 16

T0: Row 0

Request Buffer

Row Buffer

Row 0

Row 0

Column decoder

Data

already a problem in real dual core systems
Already a Problem in Real Dual-Core Systems

2.82X slowdown

1.18X slowdown

Results on Intel Pentium D running Windows XP

(Similar results for Intel Core Duo and AMD Turion, and on Fedora)

more severe problem with 4 and 8 cores
More Severe Problem with 4 and 8 Cores
  • Simulated config:
  • 512KB private L2s
  • 3-wide processors
  • 128-entry instruction
  • window
outline13
Outline
  • The Problem
    • Memory Performance Hogs
  • Preventing Memory Performance Hogs
  • Our Solution
    • Fair Memory Scheduling
  • Experimental Evaluation
  • Conclusions
preventing memory performance hogs
Preventing Memory Performance Hogs
  • MPHs cannot be prevented in software
    • Fundamentally hard to distinguish between malicious and unintentional MPHs
  • Software has no direct control over DRAM controller’s scheduling
    • OS/VMM can adjust applications’ virtualphysical page mappings
    • No control over DRAM controller’s hardwired pagebank mappings
  • MPHs should be prevented in hardware
    • Hardware also cannot distinguish malicious vs unintentional
    • Goal: Contain and limit MPHs by providingfair memory scheduling
fairness in shared dram systems
Fairness in Shared DRAM Systems
  • A thread’s DRAM performance dependent on its inherent
    • Row-buffer locality
    • Bank parallelism
  • Interference between threads can destroy either or both
  • Well-studied notions of fairness are inadequate
  • Fair scheduling cannot simply
    • Equalize memory bandwidth among threads
      • Unfairly penalizes threads with good bandwidth utilization
      • DRAM scheduling is not a pure bandwidth allocation problem
        • Unlike a network wire, DRAM is a stateful system
    • Equalize the latency or number of requests among threads
      • Unfairly penalizes threads with good row-buffer locality
        • Row-hit requests are less costly than row-conflict requests

Bandwidth, Latency

or Request Fairness

Performance

Fairness (slowdown)

our definition of dram fairness
Our Definition of DRAM Fairness
  • A DRAM system is fair if it slows down each thread equally
    • Compared to when the thread is run alone
  • Experienced Latency = EL(i) = Cumulative DRAM latency of thread i running with other threads
  • Ideal Latency= IL(i) = Cumulative DRAM latency of thread i if it were run alone on the same hardware resources
  • DRAM-Slowdown(i) = EL(i)/IL(i)
  • The goal of a fair scheduling policy is to equalize DRAM-Slowdown(i) for all Threads i
    • Considers inherent DRAM performance of each thread
outline17
Outline
  • The Problem
    • Memory Performance Hogs
  • Preventing Memory Performance Hogs
  • Our Solution
    • Fair Memory Scheduling
  • Experimental Evaluation
  • Conclusions
fair memory scheduling algorithm 1
Fair Memory Scheduling Algorithm (1)
  • During each time interval of cycles, for each thread, DRAM controller
    • Tracks EL (Experienced Latency)
    • Estimates IL (Ideal Latency)
  • At the beginning of a scheduling cycle, DRAM controller
    • Computes Slowdown = EL/IL for each thread
    • Computes unfairness index = MAX Slowdown / MIN Slowdown
  • If unfairness < 
    • Use baseline scheduling policy to maximize DRAM throughput
      • (1) row-hit first
      • (2) oldest-first
fair memory scheduling algorithm 2
Fair Memory Scheduling Algorithm (2)
  • If unfairness ≥ 
    • Use fairness-oriented scheduling policy
      • (1) requests from thread with MAX Slowdown first (if any)
      • (2) row-hit first
      • (3) oldest-first
    • Maximizes DRAM throughput if cannot improve fairness
  • : Maximum tolerable unfairness
  • : Lever between short-term versus long-term fairness
  • System software determines  and 
    • Conveyed to the hardware via privileged instructions
how does fairmem prevent an mph
How Does FairMem Prevent an MPH?

T0: Row 0

T1: Row 5

T0: Row 0

T0: Row 0

T1: Row 111

T0: Row 0

T0: Row 0

T0: Row 0

T0: Row 0

T1: Row 16

Row Buffer

Row 16

Row 111

Row 0

Row 0

T0 Slowdown

1.04

1.07

1.04

1.00

1.10

1.03

T1 Slowdown

1.14

1.08

1.03

1.06

1.11

1.06

1.00

Unfairness

1.06

1.03

1.03

1.04

1.06

1.04

1.03

1.00

Data

1.05

hardware implementation
Hardware Implementation
  • Tracking Experienced Latency
    • Relatively easy
    • Update a per-thread latency counter as long as there is an outstanding memory request for a thread
  • Estimating Ideal Latency
    • More involved because thread is not running alone
    • Need to estimate inherent row buffer locality
      • Keep track of the row that would have been in the row buffer if thread were running alone
      • For each memory request, update a per-thread latency counter based on “would-have-been” row buffer state
  • Efficient implementation possible – details in paper
outline22
Outline
  • The Problem
    • Memory Performance Hogs
  • Preventing Memory Performance Hogs
  • Our Solution
    • Fair Memory Scheduling
  • Experimental Evaluation
  • Conclusions
evaluation methodology
Evaluation Methodology
  • Instruction-level performance simulation
  • Detailed x86 processor model based on Intel Pentium M
  • 4 GHz processor, 128-entry instruction window
  • 512 Kbyte per core private L2 caches
  • Detailed DRAM model
    • 8 banks, 2Kbyte row buffer
    • Row-hit latency: 50ns (200 cycles)
    • Row-conflict latency: 100ns (400 cycles)
  • Benchmarks
    • Stream (MPH) and rdarray
    • SPEC CPU 2000 and Olden benchmark suites
dual core results
Dual-core Results

MPH with non-intensive VPR

Fairness improvement 7.9X to 1.11

System throughput improvement 4.5X

MPH with intensive MCF

Fairness improvement 4.8X to 1.08

System throughput improvement 1.9X

4 core results
4-Core Results
  • Non-memory intensive applications suffer the most (VPR)
  • Low row-buffer locality hurts (MCF)
  • Fairness improvement: 6.6X
  • System throughput improvement: 2.2X
8 core results
8-Core Results
  • Fairness improvement: 10.1X
  • System throughput improvement: 1.97X
effect of row buffer size
Effect of Row Buffer Size
  • Vulnerability increases with row buffer size
  • Also with memory latency (in paper)
outline28
Outline
  • The Problem
    • Memory Performance Hogs
  • Preventing Memory Performance Hogs
  • Our Solution
    • Fair Memory Scheduling
  • Experimental Evaluation
  • Conclusions
conclusions
Conclusions
  • Introduced the notion of memory performance attacks in shared DRAM memory systems
  • Unfair DRAM scheduling is the cause of the vulnerability
  • More severe problem in future many-core systems
  • We provide a novel definition of DRAM fairness
    • Threads should experience equal slowdowns
  • New DRAM scheduling algorithm enforces this definition
    • Effectively prevents memory performance attacks
  • Preventing attacks also improves system throughput!