on the interaction between commercial workloads and memory systems in high performance servers n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
On the Interaction Between Commercial Workloads and Memory Systems in High-Performance Servers PowerPoint Presentation
Download Presentation
On the Interaction Between Commercial Workloads and Memory Systems in High-Performance Servers

Loading in 2 Seconds...

play fullscreen
1 / 25

On the Interaction Between Commercial Workloads and Memory Systems in High-Performance Servers - PowerPoint PPT Presentation


  • 100 Views
  • Uploaded on

On the Interaction Between Commercial Workloads and Memory Systems in High-Performance Servers. Per Stenström Department of Computer Engineering, Chalmers, Göteborg, Sweden http://www.ce.chalmers.se/~pers. Fredrik Dahlgren, Magnus Karlsson, and Jim Nilsson in collaboration with

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'On the Interaction Between Commercial Workloads and Memory Systems in High-Performance Servers' - doctor


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
on the interaction between commercial workloads and memory systems in high performance servers

On the Interaction Between Commercial Workloads and Memory Systems in High-Performance Servers

Per Stenström

Department of Computer Engineering, Chalmers,

Göteborg, Sweden

http://www.ce.chalmers.se/~pers

Fredrik Dahlgren, Magnus Karlsson, and Jim Nilsson

in collaboration with

Sun Microsystems and Ericsson Research

motivation
Motivation
  • Database applications dominate (32%)
  • Yet, major focus is on scientific/eng. apps (16%)
project objective
Project Objective
  • Design principles for high-performance memory systems for emerging applications
  • Systems considered:
    • high-performance compute nodes
    • SMP and DSM systems built out of them
  • Applications considered:
    • Decision support and on-line transaction processing
    • Emerging applications
      • Computer graphics
      • video/sound coding/decoding
      • handwriting recognition
      • ...
outline
Outline
  • Experimental platform
  • Memory system issues studied
    • Working set size in DSS workloads
    • Prefetch approaches for pointer-intensive workloads (such as in OLTP)
    • Coherence issues in OLTP workloads
  • Concluding remarks
experimental platform

Application

Operating system (Linux)

Devices

CPU

Sparc V8

Memory

Interrupt

Ethernet

TTY

SCSI

$

M

Single and

multiprocessor

system models

Experimental Platform
  • Platform enables
  • Analysis of comm. workloads
  • Analysis of OS effects
  • Tracking architectural events to OS or appl. level
outline1
Outline
  • Experimental platform
  • Memory system issues studied
    • Working set size in DSS workloads
    • Prefetch approaches for pointer-intensive workloads (such as in OLTP)
    • Coherence issues in OLTP workloads
  • Concluding remarks
decision support systems dss

Level

i

i-1

2

1

Join

Scan

Join

Scan

Join

Scan

Scan

Decision-Support Systems (DSS)

Compile a list of matching entries in several database relations

Will moderately sized caches suffice for huge databases?

our findings

Cache Miss

Rate

MWS

DWS 1

DWS 2

DWS i

Cache size

Our Findings
  • MWS: footprint of instructions and private data to access a single tuple
    • typically small (< 1 Mbyte) and not affected by database size
  • DWS: footprint of database data (tuples) accessed across consecutive invocations of same scan node
    • typically small impact (~0.1%) on overall miss rate
methodological approach
Methodological Approach

Challenges:

  • Not feasible to simulate huge databases
  • Need source code: we used PostgreSQL and MySQL

Approach:

  • Analytical model using
    • parameters that describe the query
    • parameters measured on downscaled query executions
    • system parameters
footprints and reuse characteristics in dss

Level

i

i-1

2

1

Footprints per tuple access

Join

Scan

MWS and DWSi

Join

Scan

MWS and DWSi-1

Join

Scan

MWS and DWS2

MWS and DWS1

Scan

Footprints and Reuse Characteristics in DSS
  • MWS: instructions, private, and metadata
    • can be measured on downscaled simulation
  • DWS: all tuples accessed at lower levels
    • can be computed based on query composition and prob. of match
analytical model an overview
Analytical model-an overview
  • Goal: Predicts miss rate versus cache size for fully-assoc. caches with a LRU replacement policy for single-proc. systems
  • Number of cold misses:
  • size of footprint/block size
    • |MWS| is measured
    • |DWSi| computed based on parameters describing the query (size of relations probability of matching a search criterion, index versus sequential scan, etc)
  • Number of capacity misses for tuple access at level i:
  • CM0(1- C - C0) if C0 < Cache size < |MWS|
  • |MWS| - C0
  • size of tuple/block size if |MWS| <= Cache size < |MWS| + |DWSi|
  • Number of accesses per tuple: measured
  • Total number of misses and accesses: computed
model validation

Q3 on PostgreSQL

    • - 3 levels
    • - 1 seq. scan
    • - 2 index scan
    • - 2 nest. loop
    • joins

Miss ratio

Cache size (Kbyte)

Model Validation

Goal:

  • Prediction accuracy for queries with different compositions
    • Q3, Q6, and Q10 from TPC-D
  • Prediction accuracy when scaling up the database
    • parameters at 5Mbyte used to predict at 200 Mbytes databases
  • Robustness across database engines
    • Two engines: PostgreSQL and MySQL
model predictions miss rates for huge databases
Model Predictions: Miss rates for Huge Databases
  • Miss rate by instr., priv. and meta data rapidly decay (128 Kbytes)
  • Miss rate component for database data small

What’s in the tail?

outline2
Outline
  • Experimental platform
  • Memory system issues studied
    • Working set size in DSS workloads
    • Prefetch approaches for pointer-intensiveworkloads (such as in OLTP)
    • Coherence issues in OLTP workloads
  • Concluding remarks
cache issues for linked data structures
Cache Issues for Linked Data Structures
  • Pointer-chasing show up in many interesting applications:
    • 35% of the misses in OLTP (TPC-B)
    • 32% of the misses in an expert system
    • 21% of the misses in Raytrace
  • Traversal of lists may exhibit poor temporal locality
  • Results in chains of data dependent loads,
  • called pointer-chasing
sw prefetch techniques to attack pointer chasing
SW Prefetch Techniques to Attack Pointer-Chasing
  • Greedy Prefetching (G). - computation per node < latency
  • Jump Pointer Prefetching (J) - short list or traversal not known a priori
  • Prefetch Arrays (P.(S/H))

Generalization of G and J that addresses above shortcomings.

- Trade memory space and bandwidth for more latency tolerance

results hash tables and lists in olden

HEALTH

MST

B G J P.S P.H B G J P.S P.H

Results: Hash Tables and Lists in Olden

Prefetch Arrays do better because:

  • MST has short lists and little computation per node
  • They prefetch data for the first nodesin HEALTH unlike Jump prefetching
results tree traversals in oltp and olden

DB.tree

Tree.add

B G J P.S P.H B G J P.S P.H

Results: Tree Traversals in OLTP and Olden

Hardware-based prefetch Arrays do better because:

  • Traversal path not known in DB.tree (depth first search)
  • Data for the first nodes prefetched in Tree.add
other results in brief
Other Results in Brief
  • Impact of longer memory latencies:
    • Robust for lists
    • For trees, prefetch arrays may cause severe cache pollution
  • Impact of memory bandwidth
    • Performance improvements sustained for bandwidths of typical high-end servers (2.4 Gbytes/s)
    • Prefetch arrays may suffer for trees. Severe contention on low-bandwidth systems (640 Mbytes/s) were observed
  • Node insertion and deletion for jump pointers and prefetch arrays
    • Results in instruction overhead (-). However,
    • insertion/deletion is sped up by prefetching (+)
outline3
Outline
  • Experimental platform
  • Memory system issues studied
    • Working set size in DSS workloads
    • Prefetch approaches for pointer-intensive workloads (such as in OLTP)
    • Coherence issues in OLTP workloads
  • Concluding remarks
coherence issues in oltp

DSM node

SMP node

M

M

M

M

M

M

M

M

M

M

M

M

$

$

$

$

$

$

$

$

$

$

$

$

P

P

P

P

P

P

P

P

P

P

P

P

Coherence Issues in OLTP
  • Favorite protocol: write-invalidate
  • Ownership overhead: invalidations cause write stall and inval. traffic
ownership overhead in oltp
Ownership Overhead in OLTP

Simulation setup:

  • CC-NUMA with 4 nodes
  • MySQL, TPC-B, 600 MB database
  • 40% of all ownership transactions stem from load/store sequences
techniques to attack ownership overhead
Techniques to Attack Ownership Overhead
  • Dynamic detection of migratory sharing
    • detects two load/store sequences by different processors
    • only a sub-set of all load/store sequences (~40% in OLTP)
  • Static detection of load/store sequences
    • compiler algorithms that tags a load followed by a store and brings exclusive block in cache
    • poses problems in TPC-B
new protocol extension
New Protocol Extension

Criterion:

load miss from processor i followed by global store from i, tag block as Load/Store

concluding remarks
Concluding Remarks
  • Focus on DSS and OLTP has revealed challenges not exposed by traditional appl.
    • Pointer-chasing
    • Load/store optimizations
  • Application scaling not fully understood
    • Our work on combining simulation with analytical modeling shows some promise