1 / 24

A Measurement Based Memory Performance Evaluation of High Throughput Servers

This study evaluates the memory performance of high throughput servers in streaming media, web and IP forwarding applications, observing the impact of OS on memory performance and identifying bottlenecks.

maxfield
Download Presentation

A Measurement Based Memory Performance Evaluation of High Throughput Servers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum & Minerals Dhahran Saudi Arabia April 14, 2003

  2. Motivation • CPU – Memory speed gap • CPU speed doubles in about 18 months (Moore’s Law) • Memory access time improves by only one-third in 10 years • Network bandwidth has improved significantly • Gigabit per second already deployed on LAN • NIC operates up to 10 Gbps • Ethernet switches also available in that range • Hierarchical memory architecture introduced to alleviate • CPU–memory speed gap • It works on locality of reference of data • temporal locality • spatial locality

  3. Motivation • Is it all applications that benefit from memory hierarchy? • some data have poor temporal locality (continuous data) • working set might be too large to fit into cache even if the data has good spatial locality • some data are never reused • For applications utilizing these type of data, hierarchical • memory architecture becomes ineffective SO WHERE IS EXACTLY THE BOTTLENECK?

  4. RTSP client server RTP server client RTCP server client Memory Access Streaming Media Servers • Streaming media content is a continuous data • working set is normally large, cannot fit into cache • it has very poor temporal locality (data reuse is poor) • A typical scenario of streaming media transaction (RTSP)

  5. Typical data flow in streaming using RTP • The transaction has: • stringent timing requirement • high bandwidth requirement • CPU intensive • high memory requirement

  6. HTTP client server Memory Access Memory Access Web Servers • Web content is normally a set of small files that make • a web document • working set is normally composed of small files (average aggregate size is 10k) • poor temporal locality • little or no data reuse • Web transaction (HTTP)

  7. Typical data flow in HTTP transaction • The transaction has: • relaxed timing requirement • but also high bandwidth requirement • high connection rate (as connections are established and torn within a short time – HTTP/1.0)

  8. Memory Access IP Forwarding • IP packets are generally small (maximum is 65536 bytes). • Due to datagram fragmentation by routers, the packets are • typically less than 15 KB (MTU issue). • packets are just forwarded, no data associated with any packet is reused. • apart from the need for high speed, no strict timing needs to be maintained • At high throughput, a lot of memory copying is involved: moving a lot of data (IP headers) into cache for processing.

  9. Typical data flow in IP forwarding

  10. Server Platform • Pentium 4 processor (2.0 GHz): • L1 cache 8 KB • L2 cache 512 KB • Peripherals: • 1 Gbps NIC • 40 GB EIDE hard drive (Western Digital WD400) • Main memory: 256 MB • Operating systems: • Linux Red Hat 7.2 (kernel 2.4.7-10) • Windows 2000 server • Network (LAN): • 1 Gbps layer II switch

  11. Memory Transfer Test • ECT (extended copy transfer) Characterizing the memory performance to observe what might be the impact of OS on memory performance • Locality of reference: • temporal locality – varying working set size (block size) • spatial locality – varying access pattern (strides)

  12. Performance of streaming media servers • L1 Cache Performance L1 cache misses (56kpbs) L1 cache misses (300kbps) • L1 cache misses are mostly influenced by number of streams • Worst-case performance when the number of streams is high: • 300kbps encoding rate and multiple media contents • are requested by clients

  13. Performance of streaming media servers • Memory Performance and throughput Page fault rate (300kbps) Throughput (300kbps) • Requests for unique media object does not incur much page • faults since object can easily be served from memory • Requests for multiple objects leads to high page fault rate • since a lot of data blocks will have to be fetched from the disk • High page fault rate leads to client’s timeout due to long delay

  14. Performance of Web servers • Transactions and Throughput • Smaller files are transferred • within short time, hence more • connections are established • and released at a high rate. Number of transactions per second • For larger files, throughput is high • even though, the transactions/sec • is low (less connection made) Throughput in Mbytes/sec

  15. Performance of Web servers • Cache Performance • L1 and L2 cache performance is • poor when the document size is • small. WHY?  L1 cache misses L2 cache misses

  16. Performance of Web servers • Page Fault and Latency • Unlike a small file, a large file • will have to be continuously • fetched from disk, leading to • more page faults Page fault rate • Large files significantly increase • the average latency of the server. • As clients wait for too long, they • may time out Latency

  17. Performance of IP forwarding Experimental setup • Routing (creating and updating of routing table) • is done by ‘routed’ • IP forwarding – Linux kernel space

  18. Performance of IP forwarding Routing configuration • 1 and 2 • 3 and 4 1-1 communication (simplex and duplex) Double 1-1 communication (simplex and duplex) • 7 and 8 • 5 and 6 Ring communication (simplex and duplex) 1-4 communication (simplex and duplex)

  19. Performance of IP forwarding • Bandwidth

  20. Performance of IP forwarding • Maximum bandwidth: 449 Mbps • at configuration 2 – only two NICs involved in router • CPU utilization (system) – mere 19.04% • context switching – 1312 (only two NICs switched) • Active page – 1006.48 (highest observed) • Very small packet size (64 bytes) degrades performance. • Accounts for highest context switching • Fairly uniformly distributed active page figures indicates • that memory activity is not very intensive.

  21. Performance of IP forwarding • Other metrics

  22. Conclusion Streaming servers: • Performance highly degraded due to cache misses and • page faults. • Uses continuous data with large working set and poor • temporal locality (no data reuse) Web servers: • Small working set does not help much as frequent connection • setup and tear down degrades performance significantly • When the document is large in size, the server delay becomes • unacceptably high, leading to client timeout • Large document size also leads to high page fault rate

  23. Conclusion IP forwarding: • Memory performance is not the main factor in the overall • performance of IP forwarding in Linux kernel • Context switching overhead is highly significant, and a key • factor in performance degradation. The more interface • involved in the forwarding of packets, the more the • contention for resources (bus contention). • All CPU activity (kernel space only) is below 100 %. If we • resolve bus contention, we can obtain more throughput

  24. Thank you

More Related