1 / 42

DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance. Dang Tang, Yungang Bao , Weiwu Hu , Mingyu Chen 2010.1. Institute of Computing Technology (ICT) Chinese Academy of Sciences (CAS). The role of I/O. I/O is ubiquitous

shanta
Download Presentation

DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DMA Cache Architecturally Separate I/O Data fromCPU Data for Improving I/O Performance Dang Tang, YungangBao, WeiwuHu, Mingyu Chen 2010.1 Institute of Computing Technology (ICT) Chinese Academy of Sciences (CAS)

  2. The role of I/O • I/O isubiquitous • Load binary files:Disk Memory • Brower web, media stream:NetworkMemory… • I/O is significant • Many commercial applications are I/O intensive: • Database etc.

  3. State-of-the-Art I/O Technologies • I/O Bus: 20GB/s • PCI-Express 2.0 • HyperTransport 3.0 • QuickPath Interconnect • I/O Devices • SSD RAID: 1.2GB/s • 10GE: 1.25GB/s • Fusion-io: 8GB/s, 1M IOPS (2KB random 70/30 read/write mix)

  4. Direct Memory Access (DMA) • DMA is used for I/O operations in all modern computers • DMA allows I/O subsystems to access system memory independently of CPU. •  Many I/O devices have DMA engines • Including disk drive controllers, graphics cards, network cards, sound cards and GPUs

  5. Outline • Revisiting I/O • DMA Cache Design • Evaluations • Conclusions

  6. An Example of Disk Read: DMA Receiving Operation Memory CPU ① Descriptor Driver Buffer ④ Kernel Buffer ② ③ DMA Engine • Cache Access Latency: ~20 Cycles • Memory Access Latency:~200 Cycles

  7. Direct Cache Access [Ram-ISCA05] Memory CPU ① Descriptor Driver Buffer ④ Kernel Buffer ② ③ DMA Engine Prefetch-Hint Approach [Kumar-Micro07] • This is a typical Shared-Cache Scheme

  8. Problems of Shared-CacheScheme • Cache Pollution • CacheThrashing • Not suitable for other I/O • Degrade performance when DMA requests are large (>100KB) for “Oracle + TPC-H” application To address this problem deeply, we need to investigate the I/O data characteristics.

  9. I/O Data V.S. CPU Data CPU Data MemCtrl I/O Data HMTT I/O Data + CPU Data

  10. A short AD of HMTT [Bao-Sigmetrics08] • A Hardware/Software Hybrid Memory Trace Tool • Can support DDR2 DIMM interface on multiple platforms • Can collect full system off-chip memory traces • Can provide trace with semantic information, e.g., • virtual address • Process id • I/O operation • Can collect the trace of commercial applications, e.g., • Oracle • Web server The HMTT System

  11. Characteristics of I/O Data(1) • % of Memory References to I/O data • % of References of various I/O types

  12. Characteristics of I/O Data(2) • I/O request size distribution?

  13. Characteristics of I/O Data(3) • Sequential access in I/O data • Compared with CPU data, I/Odata is very regular

  14. Characteristics of I/O Data(4) • Reuse Distance (RD) • LRU Stack Distance 4 3 1 3 4 3 3 2 2 3 4 2 1 3 4 2 2 4 3 1 3 2 1 1 2 1 1 CDF 1 2 2 4 x% 1 1 2 1 RD <=n

  15. Characteristics of I/O Data(5) DMA-W CPU-R CPU-W DMA-R CPU-RW CPU-RW

  16. Rethink I/O & DMA Operation • 20~40% of memory references are for I/O data in I/O-intensive applications. • Characteristics of I/O data are different from CPU data • An explicit produce-consume relationship for I/O data • Reuse distance of I/O data is smaller than CPU data • References to I/O data are primarily sequential •  Separating I/O data and CPU data

  17. Separating I/O data and CPU data Before Separating After Separating

  18. Outline • Revisiting I/O • DMA Cache Design • Evaluations • Conclusions

  19. DMA Cache Design Issues • Write Policy • Cache Coherence • Replacement Policy • Prefetching Dedicated DMA Cache (DDC)

  20. DMA Cache Design Issues • Adopt Write-Allocate Policy • Both Write-Back or Write Through policies are available • Write Policy • Cache Coherence • Replacement Policy • Prefetching

  21. DMA Cache Design Issues IO-ESI Protocol for WT policy • Write Policy • Cache Coherence • Replacement Policy • Prefetching IO-MOESI Protocol for WB Policy • The only difference between IO-MOESE/IO-ESI and the original protocols is exchanging the local source and the probe source of state transitions

  22. A Big Issue • How to prove the correctness of integrating the heterogeneous cache coherency protocols in a system?

  23. A Global State Method for Heterogeneous Cache Coherence Protocol [Pong-SPAA93, Pong-JACM98] DMA $ CPU $ CPU $ …… MS+I+ OS+I+ M O S S I I √ X R|E EI+ R|I W|* MI+ S+I+

  24. Global State Cache Coherence Theorem • Given N (N>1) well-defined cache protocols, they are not conflict if and only if there does not exist any Conflict Global States in the global state transition machine. √ √ • 5 Global States: • S+I+ • EI* • I* • MI* • OS*I* √ √ √

  25. MOESI + ESI • 6 Global States: • S+I+ • ECI* • I* • MCI* • EDI* • OCS*I* √ √ √ √ √ √

  26. DMA Cache Design Issues • Write Policy • Cache Coherence • Replacement Policy • Prefetching • An LRU-like Replace Policy • Invalid • Shared • Owned • Exlusive • Modified

  27. DMA Cache Design Issues • Write Policy • Cache Coherence • Replacement Policy • Prefetching • Adopt straightforward sequential prefetching • Prefetching trigged by cache miss • Fetch 4 blocks one time

  28. Design Complexity vs.Design Cost Dedicated DMA Cache (DDC) Partition-Based DMA Cache (PBDC)

  29. Outline • Revisiting I/O • DMA Cache Design • Evaluations • Conclusions

  30. Speedup of Dedicated DMA Cache

  31. % of Valid Prefetched Blocks • DMA caches can exhibit an impressive high prefetching accuracy • This is because I/O data has very regular access pattern.

  32. Performance Comparisons • Although PBDC does not additional on-chip storage, it can achieve about 80% of DDC’s performance improvements.

  33. Outline • Revisiting I/O • DMA Cache Design • Evaluations • Conclusions

  34. Conclusions • We have proposed a DMA cache technique to separate I/O data and CPU • We adopt a Global State Method for Integrating Heterogeneous Cache Protocols • Experimental results show that DMA Cache schemes are better than the existing approaches that use unified, shared caches for I/O data and CPU data • Still Open Problems, e.g., • Can I/O data goes direct to L1 cache? • How to design heterogeneous caches for different types of data? • How to optimize MC with awareness of IO

  35. Thanks!&Question?

  36. RTL Emulation Platform • LLC and DMA cache Model from Loongson-2F • DDR2 Memory Controller from Loongson-2F • DDR2 DIMM model from Micron Technology Memory trace LL Cache DMA Cache MemCtrl DDR2 DIMM

  37. Parameters • DDR2-666

  38. Normalized Speedup for WB • Baseline is snoop cache scheme • DMA cache schemes exhibits better performance than others

  39. DMA Write & CPU Read Hit Rate • Both shared cache and DMA cache exhibit high hit rates • Then, where do cycle go for shared cache scheme?

  40. Breakdown of Normalized Total Cycles

  41. Design Complexity of PBDC

  42. More References on Cache Coherence Protocol Verification • Fong Pong , Michel Dubois, Formal verification of complex coherence protocols using symbolic state models, Journal of the ACM (JACM), v.45 n.4, p.557-587, July 1998 • Fong Pong , Michel Dubois, Verification techniques for cache coherence protocols, ACM Computing Surveys (CSUR), v.29 n.1, p.82-126, March 1997

More Related