A Performance Comparison of DRAM Memory System Optimizations for SMT Processors

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun Zhu Zhao Zhang ECE Department ECE Department Univ. Illinois at Chicago Iowa State Univ.

DRAM Memory Optimizations Optimizations at DRAM side can make a big difference on single-threaded processors • Enhancement of chip interface/interconnect • Access scheduling [Hong et al. HPCA’99, Mathew et al. HPCA’00, Rixner et al. ISCA’00] • DRAM-side locality [Cuppu et al. ISCA’99, ISCA’01, Zhang et al., MICRO’00, Lin et al. HPCA’01] HPCA-11

How does SMT Impact Memory Hierarchy? • Less performance loss per cache miss to DRAM memories – Lower benefit from DRAM-side optimizations? • But more cache misses due to cache contention – Much more pressure on main memory • Is DRAM memory design more important or not? HPCA-11

Outline • Motivation • Memory optimization techniques • Thread-aware memory access scheduling • Outstanding request-based • Resource occupancy-based • Methodology • Memory performance analysis on SMT systems • Effectiveness of single-thread techniques • Effectiveness of thread-aware schemes • Conclusion HPCA-11

Memory Optimization Techniques • Page modes • Open page: good for programs with good locality • Close page: good for programs with poor locality • Mapping schemes • Exploitation of concurrency (multiple channels, chips, banks) • Row buffer conflicts • Memory access scheduling • Reorder of concurrent accesses • Reducing average latency and improving bandwidth utilization HPCA-11

Memory Access Scheduling for Single-Threaded Systems • Hit-first • A row buffer hit has a higher priority than a row buffer miss • Read-first • A read has a higher priority than a write • Age-based • An older request has a higher priority than a new one • Criticality-based • A critical request has a higher priority than a non-critical one HPCA-11

Memory Access Concurrency with Multithreaded Processors Processor Memory Single-threaded Multi-threaded HPCA-11

Thread-Aware Memory Scheduling • New dimension in memory scheduling for SMT systems: considering the current state of each thread • States related to memory accesses • Number of outstanding requests • Number of processor resources occupied HPCA-11

time HA1 HA2 HB1 HA3 HB2 HA4 HB1 HB2 HA1 HA2 HA3 HA4 Outstanding Request-Based Scheme • Request-based • A request generated by a thread with fewer pending requests has a higher priority HPCA-11

time HA1 HA2 MB1 HA3 MB2 HA4 HA1 HA2 HA3 HA4 MB1 MB2 Outstanding Request-Based Scheme • Request-based • Hit-first and read-first are applied on top • For SMT processors, sustained memory bandwidth is more important than the latency of an individual access HPCA-11

Resource Occupancy-Based Scheme • ROB-based • Higher priority to requests from threads holding more ROB entries • IQ-based • Higher priority to requests from threads holding more IQ entries • Hit-first and read-first are applied on top HPCA-11

Methodology • Simulator • SMT extension of sim-Alpha • Event-driven memory simulator (DDR SDRAM and Direct Rambus DRAM) • Workload • Mixture of SPEC 2000 applications • 2-, 4-, 8-thread workload • “ILP”, “MIX”, and “MEM” workload mixes HPCA-11

Simulation Parameters HPCA-11

Workload Mixes HPCA-11

Performance Loss Due to Memory Access HPCA-11

Memory Access Concurrency HPCA-11

Memory Channel Configurations HPCA-11

Mapping Schemes HPCA-11

Memory Access Concurrency HPCA-11

Thread-Aware Schemes HPCA-11

Conclusion DRAM optimizations have significant impacts on the performance of SMT (and likely CMP) processors • Mostly effective when a workload mix includes some memory-intensive programs • Performance is sensitive to memory channel organizations • DRAM-side locality is harder to explore due to contention • Thread-aware access scheduling schemes does bring good performance HPCA-11

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors

Presentation Transcript

Modern DRAM Memory Architectures

Towards a Software Transactional Memory for Graphics Processors

Memory Optimizations for Graphics Processing Units

A Case for Refresh Pausing in DRAM Memory Systems

Verifying Optimizations using SMT Solvers

Common Word Processors A comparison

A Performance Comparison of Contemporary DRAM Architectures

PEEP: Exploiting Predictability of Memory Dependences in SMT Processors

Memory System Performance

Lecture: DRAM Main Memory

Lecture: Virtual Memory, DRAM Main Memory

Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors

Memory Oriented System-level Optimizations for Scripting Enabled Embedded Systems

Performance Challenges of Future DRAM´s

Improving Memory System Performance for Soft Vector Processors

DLL-Conscious Instruction Fetch Optimization for SMT Processors

Be-Nice Scheduling for embedded SMT processors

Lecture: DRAM Main Memory

Lecture: DRAM Main Memory

Lecture: Virtual Memory, DRAM Main Memory

Lecture: DRAM Main Memory