Writeback -Aware Bandwidth Partitioning for Multi-core Systems with PCM

Writeback-Aware Bandwidth Partitioning for Multi-core Systems with PCM Miao Zhou, Yu Du, Bruce Childers, Rami MeLHEM, Daniel Mossé University of Pittsburgh http://www.cs.pitt.edu/PCM

Introduction • DRAM memory is not energy efficient • Data centers are energy hungry • DRAM memory consumes 20-40% of the energy • Apply PCM as main memory • Energy efficient but • Slower read, much slower write and shorter lifetime • Hybrid memory: add a DRAM cache • Improve performance ( LLC miss rate) • Extend lifetime ( LLC writeback rate) • How to manage the shared resources? C0 C1 C2 C3 L1 L1 L1 L1 L2 L2 L2 L2 DRAM LLC DRAM LLC PCM DRAM

Shared Resource Management • CMP systems • Shared resources: - last-level cache - memory bandwidth • Unmanaged resources interference  poor performance • Partition resources:  interference,  performance C0 C1 C2 C3 L1 L1 L1 L1 L2 L2 L2 L2 LLC Memory Cache Partitioning Bandwidth Partitioning DRAM main memory RBP [Liu et. al., HPCA’10] UCP [Qureshi et. al., MICRO 39] Hybrid main memory This work WCP [Zhou et. al., HiPEAC’12] Utility-based Cache Partitioning (UCP) Tracks utility (LLC hit/miss) and minimizes overall LLC misses Questions: 1. Is read-only (LLC miss) information enough? 2. Is bus bandwidth still the bottleneck? Writeback-aware Cache Partitioning (WCP) Tracks and minimizes LLC miss & writebacks Read-only Bandwidth Partitioning (RBP) Partitions the bus bandwidth based on LLC miss information

Bandwidth Partitioning • Analytic model guides the run time partitioning • Use queuing theory to model delay • Monitor performance to estimate the parameters of the model • Find the partition that maximizes the system’s performance • Enforce the partition at run time • DRAM vs. Hybrid main memory • PCM writes are extremely slow and power hungry • Issues specific to hybrid main memory • Bottleneck: bus bandwidth or device bandwidth • Can we ignore the bandwidth consumed by LLC writebacks

Device Bandwidth Utilization DRAM PCM DRAM PCM DRAM Memory 1. Low device bandwidth utilization 2. Memory reads (LLC misses) dominate Hybrid Memory 1. High device bandwidth utilization 2. Memory writes (LLC writebacks) often dominate

RBP on Hybrid Main Memory 90% 10% Percentage of Device Bandwidth Consumed by PCM Writes (LLC Writebacks) RBP vs. SHARE 1. RBP outperforms SHARE for workloads dominated by PCM read (LLC miss) 2. RBP lost against SHARE for workloads dominated by PCM write (LLC writeback) A new bandwidth partitioning scheme is necessary for hybrid memory

Writeback-Aware Bandwidth Partitioning • Focus on collective bandwidth of PCM devices • Considers LLC writeback information • Token bucket algorithm • Device service units = tokens • Allocate tokens among app. every epoch (5 million cycles) • Analytic model • Maximize weighted speedup • Model the contention on bandwidth as queuing delay • Difficulty: write is blocking only when write queue is full

Analytic Model for bandwidth partitioning • For a single core • Additive CPI formula: CPI = CPILLC∞ + LLC miss freq. * LLC miss penalty • memory ≈ queue, memory service time ≈ queuing delay • For a CMP CPI due to LLC misses CPI with a infinite LLC Memory bandwidth α LLC miss rate λm request service rate request arrival rate … … … … Time to serve requests CPI due to LLC misses LLC miss rate λm,1 Memory bandwidth α1 LLC miss rate λm,1 Maximize Weighted Speedup Memory bandwidth α2 … LLC miss rate λm,N Memory bandwidth αN Memory

Analytic Model for WBP • Taking into account the LLC writebacks • CPI = CPILLC∞ + LLC miss freq. * LLC miss penalty + LLC writeback freq. * LLC writeback penalty CPI due to LLC writebacks • Prob. that writebacksare on the critical path * P p … … … … … … LLC miss rate λm,1 Read memory bandwidth α1 RQ LLC writeback rate λw,1 Write memory bandwidth β1 WQ Memory LLC miss rate λm,2 Read memory bandwidth α2 How to determine P? CPI due to LLC misses and writebacks LLC writeback rate λw,2 Maximize Weighted Speedup Write memory bandwidth β2 … LLC miss rate λm,N Read memory bandwidth αN LLC writeback rate λw,N Write memory bandwidth βN Memory

Dynamic Weight Adjustment Choose P based on the expected number of executed instructions (EEI) BU1 BU2 BUN … pm p1 p2 λm,1 α1,1 Bandwidth Utilization ratio (BU): utilized bandwidth : allocated bandwidth α2,1 αm,1 λw,1 β1,1 β2,1 βm,1 λm,2 α1,2 α2,2 αm,2 EEI WBP λw,2 β1,2 β2,2 βm,2 λm,N α1,N α2,N αm,N λw,N β1,N β2,N βm,N Actual EEI EEI1 EEIm P EEI2

Architecture Overview • BUMon tracks info during an epoch • DWA and WBP compute bandwidth partition for the next epoch • Bandwidth Regulator enforces the configuration

Enforcing Bandwidth Partitioning

Simulation Setup • Configurations • 8-core CMP, 168-entry instruction window • Private 4-way 64KB L1, Private 8-way 2MB L2 • Partitioned 32MB LLC, 12.5 ns latency • 64GB PCM, 4 channels of 2 ranks each, 50ns read latency, 1000ns write latency • Benchmarks • SPEC CPU2006 • Classified into 3 types (W, R, RW) based on whether PCM reads/writes dominate bandwidth consumption • Creates 15 workloads (Light, High) • Sensitivity study on write latency, #channels and #cores

Effective Read Latency 1. Different workloads favor different policy (partitioning weight) 2. WBP+DWA can match the best static policy (partitioning weight) 3. WBP+DWA reduces the effective read latency by 31.9% over RBP

Throughput 1. The best weight varies for different workloads (writebacksweight) 2. WBP+DWA achieves comparable performance to the best static weight 3. WBP+DWA improves the throughput by 24.2% over RBP

Fairness (Harmonic IPC) WBP+DWA improves fairness by an average of 16.7% over RBP

Conclusions PCM device bandwidth is the bottleneck in hybrid memory Writeback information is important (LLC writebacks consume a substantial portion of memory bandwidth) WBP can better partition the PCM bandwidth WBP outperforms RBP by an average of 24.9% in terms of weighted speedup

Thank you Questions ?

Writeback -Aware Bandwidth Partitioning for Multi-core Systems with PCM

Writeback -Aware Bandwidth Partitioning for Multi-core Systems with PCM

Presentation Transcript

PAP: Power Aware Partitioning of Reconfigurable Systems

Multi-Core Systems

Application-Aware Memory Channel Partitioning

Programming Multi-Core Systems

Systems for Future Multi-Core Architectures SFMA 11

Multi-core Systems and Coherence Hierarchies

Multi-core and Network Aware MPI Topology Functions

Parallel Data Mining with Services on Multi-core systems

Performance Aware Secure Code Partitioning

Eager Writeback — A Technique for Improving Bandwidth Utilization

Task Partitioning for Multi-Core Network Processors

Multi-core programming frameworks for embedded systems

Variation Aware Application Scheduling in Multi-core Systems

@Power/Bandwidth-Aware Reconfigurable FFT for OFDMA

Multi-Core Performance Modeling for Real-Time Systems

PAP: Power Aware Partitioning of Reconfigurable Systems

Variation Aware Application Scheduling in Multi-core Systems

View-Oriented Parallel Programming for multi-core systems

Performance Aware Secure Code Partitioning

Multi-Core Debug Platform for NoC-Based Systems

Multi-core systems

Multi-core programming frameworks for embedded systems