Improving I/O Performance: Architecturally Separate I/O Data from CPU Data

DMA Cache Architecturally Separate I/O Data fromCPU Data for Improving I/O Performance Dang Tang, YungangBao, WeiwuHu, Mingyu Chen 2010.1 Institute of Computing Technology (ICT) Chinese Academy of Sciences (CAS)

The role of I/O • I/O isubiquitous • Load binary files：Disk Memory • Brower web, media stream：NetworkMemory… • I/O is significant • Many commercial applications are I/O intensive： • Database etc.

State-of-the-Art I/O Technologies • I/O Bus: 20GB/s • PCI-Express 2.0 • HyperTransport 3.0 • QuickPath Interconnect • I/O Devices • SSD RAID: 1.2GB/s • 10GE: 1.25GB/s • Fusion-io: 8GB/s, 1M IOPS (2KB random 70/30 read/write mix)

Direct Memory Access (DMA) • DMA is used for I/O operations in all modern computers • DMA allows I/O subsystems to access system memory independently of CPU. • Many I/O devices have DMA engines • Including disk drive controllers, graphics cards, network cards, sound cards and GPUs

Outline • Revisiting I/O • DMA Cache Design • Evaluations • Conclusions

An Example of Disk Read: DMA Receiving Operation Memory CPU ① Descriptor Driver Buffer ④ Kernel Buffer ② ③ DMA Engine • Cache Access Latency： ~20 Cycles • Memory Access Latency：~200 Cycles

Direct Cache Access [Ram-ISCA05] Memory CPU ① Descriptor Driver Buffer ④ Kernel Buffer ② ③ DMA Engine Prefetch-Hint Approach [Kumar-Micro07] • This is a typical Shared-Cache Scheme

Problems of Shared-CacheScheme • Cache Pollution • CacheThrashing • Not suitable for other I/O • Degrade performance when DMA requests are large (>100KB) for “Oracle + TPC-H” application To address this problem deeply, we need to investigate the I/O data characteristics.

I/O Data V.S. CPU Data CPU Data MemCtrl I/O Data HMTT I/O Data + CPU Data

A short AD of HMTT [Bao-Sigmetrics08] • A Hardware/Software Hybrid Memory Trace Tool • Can support DDR2 DIMM interface on multiple platforms • Can collect full system off-chip memory traces • Can provide trace with semantic information, e.g., • virtual address • Process id • I/O operation • Can collect the trace of commercial applications, e.g., • Oracle • Web server The HMTT System

Characteristics of I/O Data(1) • % of Memory References to I/O data • % of References of various I/O types

Characteristics of I/O Data(2) • I/O request size distribution?

Characteristics of I/O Data(3) • Sequential access in I/O data • Compared with CPU data, I/Odata is very regular

Characteristics of I/O Data(4) • Reuse Distance (RD) • LRU Stack Distance 4 3 1 3 4 3 3 2 2 3 4 2 1 3 4 2 2 4 3 1 3 2 1 1 2 1 1 CDF 1 2 2 4 x% 1 1 2 1 RD <=n

Characteristics of I/O Data(5) DMA-W CPU-R CPU-W DMA-R CPU-RW CPU-RW

Rethink I/O & DMA Operation • 20~40% of memory references are for I/O data in I/O-intensive applications. • Characteristics of I/O data are different from CPU data • An explicit produce-consume relationship for I/O data • Reuse distance of I/O data is smaller than CPU data • References to I/O data are primarily sequential •  Separating I/O data and CPU data

Separating I/O data and CPU data Before Separating After Separating

DMA Cache Design Issues • Write Policy • Cache Coherence • Replacement Policy • Prefetching Dedicated DMA Cache (DDC)

DMA Cache Design Issues • Adopt Write-Allocate Policy • Both Write-Back or Write Through policies are available • Write Policy • Cache Coherence • Replacement Policy • Prefetching

DMA Cache Design Issues IO-ESI Protocol for WT policy • Write Policy • Cache Coherence • Replacement Policy • Prefetching IO-MOESI Protocol for WB Policy • The only difference between IO-MOESE/IO-ESI and the original protocols is exchanging the local source and the probe source of state transitions

A Big Issue • How to prove the correctness of integrating the heterogeneous cache coherency protocols in a system?

A Global State Method for Heterogeneous Cache Coherence Protocol [Pong-SPAA93, Pong-JACM98] DMA $ CPU $ CPU $ …… MS+I+ OS+I+ M O S S I I √ X R|E EI+ R|I W|* MI+ S+I+

Global State Cache Coherence Theorem • Given N (N>1) well-defined cache protocols, they are not conflict if and only if there does not exist any Conflict Global States in the global state transition machine. √ √ • 5 Global States: • S+I+ • EI* • I* • MI* • OS*I* √ √ √

MOESI + ESI • 6 Global States: • S+I+ • ECI* • I* • MCI* • EDI* • OCS*I* √ √ √ √ √ √

DMA Cache Design Issues • Write Policy • Cache Coherence • Replacement Policy • Prefetching • An LRU-like Replace Policy • Invalid • Shared • Owned • Exlusive • Modified

DMA Cache Design Issues • Write Policy • Cache Coherence • Replacement Policy • Prefetching • Adopt straightforward sequential prefetching • Prefetching trigged by cache miss • Fetch 4 blocks one time

Design Complexity vs.Design Cost Dedicated DMA Cache (DDC) Partition-Based DMA Cache (PBDC)

Speedup of Dedicated DMA Cache

% of Valid Prefetched Blocks • DMA caches can exhibit an impressive high prefetching accuracy • This is because I/O data has very regular access pattern.

Performance Comparisons • Although PBDC does not additional on-chip storage, it can achieve about 80% of DDC’s performance improvements.

Conclusions • We have proposed a DMA cache technique to separate I/O data and CPU • We adopt a Global State Method for Integrating Heterogeneous Cache Protocols • Experimental results show that DMA Cache schemes are better than the existing approaches that use unified, shared caches for I/O data and CPU data • Still Open Problems, e.g., • Can I/O data goes direct to L1 cache? • How to design heterogeneous caches for different types of data? • How to optimize MC with awareness of IO

Thanks！&Question?

RTL Emulation Platform • LLC and DMA cache Model from Loongson-2F • DDR2 Memory Controller from Loongson-2F • DDR2 DIMM model from Micron Technology Memory trace LL Cache DMA Cache MemCtrl DDR2 DIMM

Parameters • DDR2-666

Normalized Speedup for WB • Baseline is snoop cache scheme • DMA cache schemes exhibits better performance than others

DMA Write & CPU Read Hit Rate • Both shared cache and DMA cache exhibit high hit rates • Then, where do cycle go for shared cache scheme?

Breakdown of Normalized Total Cycles

Design Complexity of PBDC

More References on Cache Coherence Protocol Verification • Fong Pong , Michel Dubois, Formal verification of complex coherence protocols using symbolic state models, Journal of the ACM (JACM), v.45 n.4, p.557-587, July 1998 • Fong Pong , Michel Dubois, Verification techniques for cache coherence protocols, ACM Computing Surveys (CSUR), v.29 n.1, p.82-126, March 1997

Improving I/O Performance: Architecturally Separate I/O Data from CPU Data