1 / 61

Design and Evaluation of Architectures for Commercial Applications

Design and Evaluation of Architectures for Commercial Applications. Part III: architecture studies. Luiz André Barroso. Overview (3). Day III: architecture studies Memory system characterization Impact of out-of-order processors Simultaneous multithreading Final remarks.

brit
Download Presentation

Design and Evaluation of Architectures for Commercial Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design and Evaluation of Architectures for Commercial Applications Part III: architecture studies Luiz André Barroso

  2. Overview (3) • Day III: architecture studies • Memory system characterization • Impact of out-of-order processors • Simultaneous multithreading • Final remarks UPC, February 1999

  3. Memory system performance studies • Collaboration with Kourosh Gharachorloo and Edouard Bugnion • Presented at ISCA’98 UPC, February 1999

  4. Motivations • Market shift for high-performance systems • yesterday: technical/numerical applications • today: databases, Web servers, e-mail services, etc. • Bottleneck shift in commercial application • yesterday: I/O • today: memory system • Lack of data on behavior of commercial workloads • Re-evaluate memory system design trade-offs UPC, February 1999

  5. Bottleneck Shift • Just a few years back [Thakkar&Sweiger90] I/O was the only important bottleneck • Since then, several improvements: • better DB engines can tolerate I/O latencies • better OS’s do more efficient I/O operations and are more scalable • better parallelism in the disk subsystem (RAIDs) provide more bandwidth • … and memory keeps getting “slower” • faster processors • bigger machines • Result: memory system is a primary factor today UPC, February 1999

  6. Workloads • OLTP (on-line transaction processing) • modeled after TPC-B, using Oracle7 DB engine • short transactions, intense process communication & context switching • multiple transactions in-transit • DSS (decision support systems) • modeled after TPC-D, using Oracle7 • long running transactions, low process communication • parallelized queries • AltaVista • Web index search application using custom threads package • medium sized transactions, low process communication • multiple transactions in-transit UPC, February 1999

  7. Methodology: Platform • AlphaServer4100 5/300 • 4x 300 MHz processors (8KB/8KB I/D caches, 96KB L2 cache) • 2MB board-level cache • 2GB main memory • latencies: 1:7:21:80/125 cycles • 3-channel HSZ disk array controller • Digital Unix 4.0B UPC, February 1999

  8. Methodology: Tools • Monitoring tools: • IPROBE • DCPI • ATOM • Simulation tools: • tracing: preliminary user-level studies • SimOS-Alpha: full system simulation, including OS UPC, February 1999

  9. Scaling • Workload sizes make them difficult to study • Scaling the problem size is critical • Validation criteria: similar memory system behavior to larger run • Requires good understanding of workload • make sure system is well tuned • keep SGA many times larger than hardware caches (1GB) • use the same number of servers/processor as audit-sized runs (4-8/CPU) UPC, February 1999

  10. CPU Cycle Breakdown • Very high CPI for OLTP • Instruction and data related stalls are equally important UPC, February 1999

  11. Cache behavior UPC, February 1999

  12. Stall Cycle Breakdown • OLTP dominated by non-primary cache and memory stalls • DSS and AltaVista stalls are mostly Scache hits UPC, February 1999

  13. Impact of On-Chip Cache Size P=4; 2MB, 2-way off-chip cache • 64KB on-chip caches are enough for DSS UPC, February 1999

  14. OLTP: Effect of Off-Chip Cache Organization P=4 • Significant benefits from large off-chip caches (up to 8MB) UPC, February 1999

  15. OLTP: Impact of system size P=4; 2MB, 2-way off-chip cache • Communication misses become dominant for larger systems UPC, February 1999

  16. OLTP: Contribution of Dirty Misses P=4, 8MB Bcache • Shared metadata is the important region • 80% of off-chip misses • 95% of dirty misses • Fraction of dirty misses increases with cache and system size UPC, February 1999

  17. OLTP: Impact of Off-Chip Cache Line Size P=4; 2MB, 2-way off-chip cache • Good spatial locality on communication for OLTP • Very little false sharing in Oracle itself UPC, February 1999

  18. Summary of Results • On-chip cache • 64KB I/D sufficient for DSS & AltaVista • Off-chip cache • OLTP benefits from larger caches (up to 8MB) • Dirty misses • Can become dominant for OLTP UPC, February 1999

  19. Conclusion • Memory system is the current challenge in DB performance • Careful scaling enables detailed studies • Combination of monitoring and simulation is very powerful • Diverging memory system designs • OLTP benefits from large off-chip caches, fast communication • DSS & AltaVista may perform better without an off-chip cache UPC, February 1999

  20. Impact of out-of-order processors • Collaboration with: • Kourosh Gharachorloo (Compaq) • Parthasarathy Ranghanathan and Sarita Adve (Rice) • Presented at ASPLOS’98 UPC, February 1999

  21. Motivation • Databases fastest-growing market for shared-memory servers • Online transaction processing (OLTP) • Decision-support systems (DSS) • But current systems optimized for engineering/scientific workloads • Aggressive use of Instruction-Level Parallelism (ILP) • Multiple issue, out-of-order issue, • non-blocking loads, speculative execution • Need to re-evaluate system design for database workloads UPC, February 1999

  22. Contributions • Detailed simulation study of Oracle with ILP processors • Is ILP design complexity warranted for database workloads? • Improve performance (1.5X OLTP, 2.6X DSS) • Reduce performance gap between consistency models • How can we improve performance for OLTP workloads? • OLTP limited by instruction and migratory data misses • Small stream buffer close to perfect instruction cache • Prefetching/flush appear promising UPC, February 1999

  23. Simulation Environment - Workloads • Oracle 7.3.2 commercial DBMS engine • Database workloads • Online transaction processing (OLTP) - TPC-B-like • Day-to-day business operations • Decision-support System (DSS) - TPC-D/Query 6 • Offline business analysis UPC, February 1999

  24. Simulation Environment - Methodology • Used RSIM - Rice Simulator for ILP Multiprocessors • Detailed simulation of processor, memory, and network • But simulating commercial-grade database engine hard • Some simplifications • Similar to Lo et al. and Barroso et al., ISCA’98 UPC, February 1999

  25. Simulation Methodology - Simplifications • Trace-driven simulation • OS/system-call simulation • OS not a large component • Model only key effects • Page-mapping, TLB misses, process scheduling • System-call and I/O time dilation effects • Multiple processes per processor to hide I/O latency • Database scaling UPC, February 1999

  26. Simulated Environment - Hardware • 4-processor shared-memory system - 8 processes per processor • Directory-based MESI protocol with invalidations • Next-generation processing nodes • Aggressive ILP processor • 128 KB 2-way separate instruction and data L1 caches • 8M 4-way unified L2 cache • Representative miss latencies UPC, February 1999

  27. Outline • Motivation • Simulation Environment • Impact of ILP on Database Workloads • Multiple issue and OOO issue for OLTP • Multiple outstanding misses for OLTP • ILP techniques for DSS • ILP-enabled consistency optimizations • Improving Performance of OLTP • Conclusions UPC, February 1999

  28. Multiple Issue and OOO Issue for OLTP • Multiple issue and OOO improve performance by 1.5X • But 4-way, 64-element window enough • Instruction misses and dirty misses are key bottlenecks 100.0 92.1 90.1 88.8 86.8 74.3 68.4 67.8 In-order processors Out-of-order processors UPC, February 1999

  29. Multiple Outstanding Misses for OLTP • Support for two distinct outstanding misses enough • Data-dependent computation 100.0 83.2 79.4 79.4 UPC, February 1999

  30. Impact of ILP Techniques for DSS • Multiple issue and OOO improve performance by 2.6X • 4-way, 64-element window, 4 outstanding misses enough • Memory is not a bottleneck 100.0 89.2 74.1 68.1 68.4 52.1 39.7 39.0 In-order processors Out-of-order processors UPC, February 1999

  31. ILP-Enabled Consistency Optimizations • Memory consistency model of shared-memory system • Specifies ordering and overlap of memory operations • Performance /programmability tradeoff • Sequential consistency (SC) • Processor consistency (PC) • Release consistency (RC) • ILP-enabled consistency optimizations • Hardware prefetching, Speculative loads • Impact on database workloads? UPC, February 1999

  32. 100 88 74 72 68 68 Without optimizations With optimizations ILP-Enabled Consistency Optimizations SC: sequential consistency PC: processor consistency RC: release consistency • ILP-enabled optimizations • OLTP: RC only 1.1X better than SC (was 1.4X) • DSS: RC only 1.18X better than SC (was 1.85X) • Consistency model choice in hardware less important UPC, February 1999

  33. Outline • Motivation • Simulation Environment • Impact of ILP on Database Workloads • Improving Performance of OLTP • Improving OLTP - Instruction Misses • Improving OLTP - Dirty misses • Conclusions UPC, February 1999

  34. Improving OLTP - Instruction Misses • 4-element instruction cache stream buffer • hardware prefetching of instructions • 1.21X performance improvement • Simple and effective for database servers 100 83 71 UPC, February 1999

  35. Improving OLTP - Dirty Misses • Dirty misses • Mostly to migratory data • Due to few instructions in critical sections • Solutions for migratory reads • Software prefetching + producer-initiated flushes • Preliminary results without access to source code • 1.14X performance improvement UPC, February 1999

  36. Summary • Detailed simulation study of Oracle with out-of-order processors • Impact of ILP techniques on database workloads • Improve performance (1.5X OLTP, 2.6X DSS) • Reduce performance gap between consistency models • Improving performance of OLTP • OLTP limited by instruction and migratory data misses • Small stream buffer close to perfect instruction cache • Prefetching/flush appear promising UPC, February 1999

  37. Simultaneous Multithreading (SMT) • Collaboration with: • Kourosh Gharachorloo (Compaq) • Jack Lo, Susan Eggers, Hank Levy, Sujay Parekh (U.Washington) • Exploit multithreaded nature of commercial applications • Aggressive Wide-issue OOO superscalar saturates at 4-issue slots • Potential to increase utilization of issue slots • Potential to exploit parallelism in the memory system UPC, February 1999

  38. SMT: what is it? • SMT enables multiple threads to issue instructions to multiple functional units in a single cycle • SMT exploits instruction-level & thread-level parallelism • Hides long latencies • Increases resource utilization and instruction throughput fine-grain multithreading superscalar SMT thread 1 thread 2 thread 3 thread 4 UPC, February 1999

  39. SMT and database workloads • Pro: • SMT a good match, can take advantage of SMT’s multithreading HW • Low throughput • High cache miss rates • Con: • Fine-grain interleaving can cause cache interference • What software techniques can help avoid interference? UPC, February 1999

  40. SMT studies: methodology • Trace-driven simulation • Same traces used in previous ILP study • New front-end to SMT simulator • Used OLTP and DSS workloads UPC, February 1999

  41. SMT Configuration • 21264-like superscalar base, augmented with: • up to 8 hardware contexts • 8-wide superscalar • 128KB, 2-way I and D, L1 cache, 2 cycle access • 16MB, direct-mapped L2 cache, 12 cycle access • 80 cycle memory latency • 10 functional units (6 integer (4 ld/st), 4 FP) • 100 additional integer & FP renaming registers • integer and FP instruction queues, 32 entries each UPC, February 1999

  42. OLTP Characterization • Memory behavior (1 context, 16 server processes) • High miss rates & large footprints UPC, February 1999

  43. Cache interference (16 server processes) • With 8-context SMT, many conflict misses • DSS data set fits in L2$ UPC, February 1999

  44. s s e e c c n n 0 e e 0 r 0 r 2 e e 1 f f e e r r 5 5 Misses (Metadata) 1 7 e e h h Misses (Buffer cache) c c 0 0 a a 1 5 c c 2 1 L L 5 5 f f 2 o o t t n n 0 0 e e c c P S P S r r T T S S e e L L P P D D O O Where are the misses? • L1 and L2 misses dominated by PGA references • Misses result from unnecessary address conflicts 16 server processes, 8-context SMT Misses (PGA) Misses (Instructions) UPC, February 1999

  45. L2$ conflicts: page mapping • Page coloring can be augmented with random first seed UPC, February 1999

  46. OLTP 10.0 ) l a 9.0 b o l 8.0 g ( 7.0 7.0 e t a 6.0 6.0 r s 5.0 5.0 s i m 4.0 4.0 e 3.0 3.0 h c a 2.0 2.0 c 1.0 1.0 2 L 0.0 0.0 1 2 4 8 1 2 4 8 Number of contexts Number of contexts bin hopping page coloring page coloring seed Results for different page mapping schemes 16 MB, direct-mapped L2 cache, 16 server processes DSS 10.0 ) l a 9.0 b o l 8.0 g ( e t a r s s i m e h c a c 2 L UPC, February 1999

  47. Why the steady L2$ miss rates? • Not all footprint has temporal locality • Critical working sets are being cached • 87% of instruction refs are to 31% of the I-footprint • 41% of metadata refs are to 26KB • SMT and superscalar cache misses comparable • SMT changes interleaving, not total footprint • With proper global policies, working sets still fit in caches: SMT is effective UPC, February 1999

  48. L1$ conflicts: application-level offseting • Base of each thread’s PGA is at same virtual address • Causes unnecessary conflicts in virtually-indexed cache • Address offsets can avoid interference • Offset by thread id * 8KB UPC, February 1999

  49. DSS OLTP e e 30 t 30 t a a r r s s 25 25 s s i i m m 20 20 e e h h 15 15 c c a a c c 10 10 a a t t a a 5 5 d d 1 1 L L 0 0 1 2 4 8 1 2 4 8 Number of contexts Number of contexts Offsetting results 128KB, 2-way set associative L1 cache bin hopping no offset bin hopping with offset UPC, February 1999

  50. SMT: constructive interference • Cache interference can also be beneficial • Instruction segment is shared • SMT exploits instruction sharing • Improves I-cache locality • Reduces I-cache miss rate (OLTP) 14% with superscalar  9% with 8-context SMT UPC, February 1999

More Related