1 / 29

Latency Reduction Techniques for Remote Memory Access in ANEMONE

Latency Reduction Techniques for Remote Memory Access in ANEMONE. Mark Lewandowski Department of Computer Science Florida State University. Outline. Introduction Architecture / Implementation Adaptive NEtwork MemOry engiNE (ANEMONE) Reliable Memory Access Protocol (RMAP)

gavivi
Download Presentation

Latency Reduction Techniques for Remote Memory Access in ANEMONE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University

  2. Outline • Introduction • Architecture / Implementation • Adaptive NEtwork MemOry engiNE (ANEMONE) • Reliable Memory Access Protocol (RMAP) • Two Level LRU Caching • Early Acknowledgments • Experimental Results • Future Work • Related Work • Conclusions

  3. Virtual Memory performance is bound by slow disks State of computers today lends to the idea of shared memory Gigabit Ethernet Machines on a LAN have lots of free memory Improvements to ANEMONE yield higher performance than disk and the original ANEMONE system Introduction Registers Cache Memory Disk ANEMONE

  4. Contributions • Pseudo Block Device (PBD) • Reliable Memory Access Protocol • Replace NFS • Early Acknowledgments • Shortcut Communication Path • Two Level LRU-Based Caching • Client • Memory Engine

  5. ANEMONE Architecture

  6. Architecture Engine Cache RMAP Protocol Client Module

  7. Pseudo Block Device • Provides a transparent interface for swap daemon and ANEMONE • Is not a kernel modification • Begins handling READ/WRITE requests in order of arrival • No expensive elevator algorithm

  8. Lightweight Reliable Flow Control Protocol sits next to IP layer to allow swap daemon quick access to pages Application Swap Daemon Transport IP RMAP Reliable Memory Access Protocol (RMAP) Ethernet

  9. RMAP • Window Based Protocol • Requests are served as they arrive • Messages: • REG/UNREG – Register the client with the ANEMONE cluster • READ/WRITE – send/receive data from ANEMONE • STAT – retrieves statistics from the ANEMONE cluster

  10. Why do we need cache? • It is a natural answer to on-disk buffers • Caching reduces network traffic • Decreases Latency • Write latencies benefit the most • Buffers requests before they are sent over the wire

  11. Basic Cache Structure • FIFO Queue is used to keep track of LRU page • Hashtable is used for fast page lookups

  12. ANEMONE Cache Details • Client Cache • 16 MB • Write-Back • Memory allocation at load time • Engine Cache • 80 MB • Write-Through • Partial memory allocation at load time • sk_buffs are copied when they arrive at the Engine

  13. Early Acknowledgments • Reduces client wait time • Can reduce write latency by up to 200 µs per write request • Early ACK performance is slowed by small RMAP window size • Small pool (~200) of sk_buffs are maintained for forward ACKing

  14. Experimental Testbed • Experimental testbed configured with 400,000 blocks (4KB page) of memory (~1.6 GB)

  15. Experimental Description • Latency • 100,000 Read/Write requests • Sequential/Random • Application Run Times • Quicksort / POV-Ray • Single/Multiple Processes • Execution Times • Cache Performance • Measured cache hit rates • Client / Engine

  16. Sequential Read

  17. Sequential Write

  18. Random Read

  19. Random Write

  20. Increase single process size by 100 MB for each iteration Quicksort: 298% performance increase over disk, 226% increase over original ANEMONE POV-Ray: 370% performance increase over disk, 263% increase over original ANEMONE Single Process Performance

  21. Increase number of 100 MB processes by 1 for each iteration Quicksort: 710% increase over disk, and 117% increase over original ANEMONE POV-Ray: 835% increase over disk, and 115% increase over original ANEMONE Multiple Process Performance

  22. Hits save ~500 µs POV-Ray hit rate saves ~270 seconds for 1200 MB test Quicksort hit rate saves ~45 seconds for 1200 MB test Swap daemon interferes with cache hit rates Prefetching Client Cache Performance

  23. Cache performance levels out ~10% POV-Ray does not exceed 10% because it performs over 3x the number of page swaps that Quicksort does Engine cache saves up to 1000 seconds for 1200 MB POV-Ray Engine Cache Performance

  24. Future Work • More extensive testing • Aggressive caching algorithms • Data Compression • Page Fragmentation • P2P • RDMA over Ethernet • Scalability and Fault tolerance

  25. Related Work • Global Memory System [feeley95] • Implements a global memory management algorithm over ATM • Does not directly address Virtual Memory • Reliable Remote Memory Pager [markatos96], Network RAM Disk [flouris99] • TCP Sockets • Samson [stark03] • Myrinet • Does not perform caching • Remote Memory Model [comer91] • Implements custom protocol • Guarantees in-order delivery

  26. Conclusions • ANEMONE does not modify client OS or applications • Performance increases by up to 263% for single processes • Performance increases by up to 117% for multiple processes • Improved caching is provocative line of research, but more aggressive algorithms are required.

  27. Questions?

  28. Appendix A: Quicksort Memory Access Patterns

  29. Appendix B: POV-Ray Memory Access Patterns

More Related