1 / 31

Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware. Yaoping Ruan Princeton University. Vivek Pai, Princeton University Erich Nahum , IBM T.J. Watson John Tracey , IBM T.J. Watson. Motivation. Network servers Throughput matters Hardware intensive

keefe
Download Presentation

Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluating the Impact of Simultaneous Multithreading on Network Servers Using Real Hardware Yaoping Ruan Princeton University Vivek Pai, Princeton University Erich Nahum, IBM T.J. Watson John Tracey, IBM T.J. Watson

  2. Motivation • Network servers • Throughput matters • Hardware intensive • Simultaneous Multithreading (SMT) • Processor support for high throughput • Simulated since mid-90s • Now - Intel Xeon/Pentium 4 (Hyper-Threading), IBM POWER5 available http://www.cs.princeton.edu/~yruan

  3. How Does SMT Work? • Simultaneous execution of multiple jobs • Higher utilization of functional units cycles (direction of data flow) Job 1 Processor 1 Job 2 Processor 2 (Colored blocks are functional units currently in use) Job 1&2 SMT processor http://www.cs.princeton.edu/~yruan

  4. Pipeline Execution Units Shared Resource Cache Hierarchy System Bus Main Memory SMT Architecture Appear as multi-processors for OS and app. DuplicatedResource Architectural State Registers #1 Architectural State Registers #2 http://www.cs.princeton.edu/~yruan

  5. Contributions • Detailed analysis of multiple real hardware platforms and server packages • Includes previously ignored OS overheads • Micro-architectural performance analysis • Demonstrates dominance of memory hierarchy • Comparison with simulation studies • Explain why SMT provides relatively small benefits on real hardware • Overly-aggressive memory simulation yielded higher expected benefits http://www.cs.princeton.edu/~yruan

  6. Outline • Background • Measurement methodology • Throughput & improvement • Micro-architectural performance • Discussion http://www.cs.princeton.edu/~yruan

  7. Measurements Overview • Metrics • Server throughput • Throughput improvements (relative speedups) • Architectural features (CPI, miss ratio, etc.) • Multiple configurations • Hardware platforms (clock speed, cache, etc.) • Server software (Apache, Flash, TUX, etc.) • Kernel configuration (uniprocessor and multiprocessor) http://www.cs.princeton.edu/~yruan

  8. Hardware Platforms • Three models of Xeon processors Clock rate Cache http://www.cs.princeton.edu/~yruan

  9. Web Servers • 5 Web server packages • Apache-MP: multi-process • Apache-MT: multi-thread • Flash: event-driven • TUX: in-kernel • Haboob: Java server, staged multi-thread model • Benchmark • SPECweb96 and SPECweb99 http://www.cs.princeton.edu/~yruan

  10. SMT on on 2 # CPUs 1 kernel Multiprocessor kernel System Configuration • 5 configuration labels • # CPUs, SMT on/off, kernel type (T – # threads, P – # processors) http://www.cs.princeton.edu/~yruan

  11. Outline • Background • Measurement methodology • Throughput & improvement • Single processor • Dual-processor • Micro-architectural performance • Discussion http://www.cs.princeton.edu/~yruan

  12. 1200 Apache-MP, 3.06GHz 1000 4T vs. 2P 800 600 Throughput (Mb/s) 2T vs. 1P-UP 2T vs. 1P-MP 400 200 0 1P-UP 1P-MP 2T w/ SMT 2P 4T w/ SMT Throughput Evaluation singleprocessor dual-processor http://www.cs.princeton.edu/~yruan

  13. 40 2T vs. 1P-MP 30 20 Throughput improvement (%) 10 0 Apache-MP Apache-MT Flash TUX Haboob -10 2.0GHz 3.06GHz 3.06GHz L3 Improvement on Single Processor 2T : 2 threads, multiprocessor kernel 1P-MP: 1 thread, multiprocessor kernel http://www.cs.princeton.edu/~yruan

  14. 40 2T vs. 1P-UP Kernel overhead 30 20 Throughput improvement (%) 10 0 Apache-MP Apache-MT Flash TUX Haboob -10 2.0GHz 3.06GHz 3.06GHz L3 Improvement on Single Processor 2T : 2 threads, Multiprocessor kernel 1P-UP: 1 threads, Uniprocessor kernel http://www.cs.princeton.edu/~yruan

  15. 40 4T vs. 2P 30 20 Throughput improvement (%) 10 0 -10 Apache-MP Apache-MT Flash TUX Haboob -20 2.0GHz 3.06GHz 3.06GHz L3 Improvement on Dual-processor 4T: 4 threads (2 processors, 2T/Processor) 2P: 2 physical processors (SMT disabled) • 2.0GHz & 3.06GHz with L3 are better • Memory is still the bottleneck http://www.cs.princeton.edu/~yruan

  16. Micro-architectural Analysis • Use Oprofile • In-house patch to measure extra events • About 25 performance events • Cache miss/hit • TLB miss/hit • Branches • Pipeline stall, clear, etc. • Bus utilization http://www.cs.princeton.edu/~yruan

  17. 20% 18% 16% 14% 12% 10% 8% 6% 4% 2% 0% Apache-MP Apache-MT Flash TUX Haboob 1P-UP 1P-MP 2P 4T(SMT) 2T(SMT) L1 Instruction Cache Miss Rate http://www.cs.princeton.edu/~yruan

  18. 10% 8% 6% 4% 2% 0% Apache-MP Apache-MT Flash TUX Haboob 1P-UP 1P-MP 2P 4T(SMT) 2T(SMT) L2 Cache Miss Rate • Instruction & data unified • Lower rate in SMT due to higher L1 misses http://www.cs.princeton.edu/~yruan

  19. work L1 Miss L2 Miss ITLB DTLB Branch Clear Buffer 16 Apache-MP 14 12 others 10 L2 Miss 8 6 L1 Miss 4 2 work 0 1P-UP 1P-MP 2T 2P 4T Putting Events Together Cycles per Instruction (CPI) http://www.cs.princeton.edu/~yruan

  20. Non-overlapped CPI • L1/L2 miss penalty dominates http://www.cs.princeton.edu/~yruan

  21. Measuring Bus Utilization • Event: FSB_DATA_ACTIVITY • CPU cycles when the bus is busy • Normalized to CPU speed • Comparable across all CPU clock rate http://www.cs.princeton.edu/~yruan

  22. Apache-MP 20 15 Bus Utilization (%) 10 5 0 2P 2T 4T 1P-UP 1P-MP 2.0GHz 3.06GHz 3.06GHz L3 Bus Utilization Results • 2.0GHz & 3.06GHz L3 have less data transfer cycles • Lower memory latency in 2.0GHz & 3.06GHz with L3 • Coefficient of correlation between bus utilization & speedups : 0.62 ~ 0.95 http://www.cs.princeton.edu/~yruan

  23. Outline • Background • Measurement parameters • Throughput speedup • Micro-architectural performance • Discussion • Compare to simulation • Other Web workloads http://www.cs.princeton.edu/~yruan

  24. 100% 90% 80% Simulation 70% 60% Throughput improvement 50% 40% Uniprocessor kernel Dual processor 30% 20% Multiprocessor kernel 10% 0% -10% SMT Performance on Web Servers http://www.cs.princeton.edu/~yruan

  25. Compare to Simulation http://www.cs.princeton.edu/~yruan

  26. Processor Development Trend Simulated models: 62-cycle mem 32 KB L1 256 KB L2 90-cycle mem 128 KB L1 16384 KB L2 90-cycle mem 64 KB L1 16384 KB L2 2000 2003 1996 Actual processors: 74-cycle mem 16 KB L1 256 KB L2 94-cycle mem 16 KB L1 512 KB L2 350-cycle mem 8-12 KB L1 512 KB L2 http://www.cs.princeton.edu/~yruan

  27. SMT on SPECweb99 • SPECweb99 results in paper • Dynamic + static • Multiple programs • CGI requests, user profile logging, etc. • Speedup very close to static-only workloads • No more negative speedups in Flash • May be due to better sharing of resources of different programs http://www.cs.princeton.edu/~yruan

  28. Summary • More realistic speedup evaluation of SMT • 3 processors, 5 servers, 2 kernels • Exposed factors not previously examined • 5~15% speedup in our best cases • Detailed analysis of memory hierarchy impact on SMT performance • All other architecture overheads secondary • Reasons why simulation results were overly optimistic http://www.cs.princeton.edu/~yruan

  29. Thank you http://www.cs.princeton.edu/~yruan

  30. Future Work • Ways of improving Simultaneous Multithreading performance • Server performance on POWER5 • Using execution driven simulation for deeper understanding • Study Chip Multiprocessor (CMP) • Intel, AMD, and IBM http://www.cs.princeton.edu/~yruan

  31. 0.30 0.25 0.20 0.15 0.10 0.05 0.00 Apache-MP Apache-MT Flash TUX Haboob Pipeline Clears (per Byte) • Conditions when the whole pipeline needs to be flushed 1T-UP 1T-MP 2T 2P 4T http://www.cs.princeton.edu/~yruan

More Related