1 / 33

System Architecture: Big Iron (NUMA)

System Architecture: Big Iron (NUMA). Joe Chang jchang6@yahoo.com www.qdpma.com. About Joe Chang. SQL Server Execution Plan Cost Model True cost structure by system architecture Decoding statblob (distribution statistics) SQL Clone – statistics-only database Tools

rhonda
Download Presentation

System Architecture: Big Iron (NUMA)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. System Architecture: Big Iron (NUMA) Joe Chang jchang6@yahoo.com www.qdpma.com

  2. About Joe Chang • SQL Server Execution Plan Cost Model • True cost structure by system architecture • Decoding statblob (distribution statistics) • SQL Clone – statistics-only database • Tools • ExecStats – cross-reference index use by SQL-execution plan • Performance Monitoring, • Profiler/Trace aggregation

  3. Scaling SQL on NUMA Topics • OLTP – Thomas Kejser session • “Designing High Scale OLTP Systems” • Data Warehouse • Ongoing Database Development • Bulk Load – SQL CAT paper + TK session • “The Data Loading Performance Guide” Other Sessions with common coverage: Monitoring and Tuning Parallel Query Execution II, R Meyyappan (SQLBits 6) Inside the SQL Server Query Optimizer, Conor Cunningham Notes from the field: High Performance Storage, John Langford SQL Server Storage – 1000GB Level, Brent Ozar

  4. Server Systems and Architecture

  5. CPU CPU CPU CPU System Bus MCH PXH PXH ICH Symmetric Multi-Processing SMP, processors are not dedicated to specific tasks (ASMP), single OS image, each processor can acess all memory SMP makes no reference to memory architecture? Not to be confused to Simultaneous Multi-Threading (SMT) Intel calls SMT Hyper-Threading (HT), which is not to be confused with AMD Hyper-Transport (also HT)

  6. Non-Uniform Memory Access CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Memory Controller Memory Controller Memory Controller Memory Controller Node Controller Node Controller Node Controller Node Controller Shared Bus or X Bar • NUMA Architecture - Path to memory is not uniform • Node: Processors, Memory, Separate or combined Memory + Node Controllers • Nodes connected by shared bus, cross-bar, ring • Traditionally, 8-way+ systems Local memory latency ~150ns, remote node memory ~300-400ns, can cause erratic behavior if OS/code is not NUMA aware

  7. Opteron Opteron Opteron Opteron HT2100 HT2100 HT1100 AMD Opteron Technically, Opteron is NUMA, but remote node memory latency is low, no negative impact or erratic behavior! For practical purposes: behave like SMP system Local memory latency ~50ns, 1 hop ~100ns, two hop 150ns? Actual: more complicated because of snooping (cache coherency traffic)

  8. CPU 0 CPU 1 CPU 2 CPU 3 CPU 4 CPU 5 CPU 6 CPU 7 8-way Opteron Sys Architecture Opteron processor (prior to Magny-Cours) has 3 Hyper-Transport links. Note 8-way top and bottom right processors use 2 HT to connect to other processors, 3rd HT for IO, CPU 1 & 7 require 3 hops to each other

  9. http://www.techpowerup.com/img/09-08-26/17d.jpg

  10. Nehalem System Architecture Intel Nehalem generation processors have Quick Path Interconnect (QPI) Xeon 5500/5600 series have 2, Xeon 7500 series have 4 QPI 8-way Glue-less is possible

  11. NUMA Local and Remote Memory • Local memory is closer than remote • Physical access time is shorter • What is actual access time? • With cache coherency requirement!

  12. HT Assist – Probe Filter part of L3 cache used as directory cache ZDNET

  13. Source Snoop Coherency From HP PREMA Architecture whitepaper: All reads result in snoops to all other caches, … Memory controller cannot return the data until it has collected all the snoop responses and is sure that no cache provided a more recent copy of the memory line

  14. DL980G7 From HP PREAM Architecture whitepaper: Each node controller stores information about* all data in the processor caches, minimizes inter-processor coherency communication, reduces latency to local memory (*only cache tags, not cache data)

  15. HP ProLiant DL980 Architecture Node Controllers reduces effective memory latency

  16. Superdome 2 – Itanium, sx3000 Agent – Remote Ownership Tag + L4 cache tags 64M eDRAM L4 cache data

  17. IBM x3850 X5 (Glue-less) Connect two 4-socket Nodes to make 8-way system

  18. Fujitsu R900 4 IOH 14 x8 PCI-E slots, 2 x4, 1x8 internal

  19. 44 60 14 56 62 50 62 60 12 30 58 28 46 48 52 54 31 51 49 61 63 45 63 59 55 29 57 61 53 47 13 15 46 24 38 56 26 44 42 10 8 40 42 34 58 40 36 32 25 43 41 57 47 27 37 35 59 41 11 39 43 9 45 33 22 28 52 38 20 24 18 16 26 54 22 36 6 30 4 20 53 29 39 19 27 21 17 55 7 37 25 21 23 23 31 5 50 2 14 8 0 34 6 10 32 4 16 0 48 18 12 2 33 5 9 11 13 1 1 3 17 19 7 15 51 35 49 3 Node 0 Node 0 Node 0 Node 0 Node 0 Node 0 Node 0 Node 0 OS Memory Models SUMA: Sufficiently Uniform Memory Access Memory interleaved across nodes 2 1 NUMA: first interleaved within a node, then spanned across nodes 1 2 Memory stripe is then spanned across nodes

  20. 24 25 26 27 28 29 30 31 16 17 18 19 20 21 22 23 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 Node 0 Node 1 Node 2 Node 3 6 7 14 15 22 23 30 31 4 5 12 13 20 21 28 29 2 3 10 11 18 19 26 27 0 1 8 9 16 17 24 25 Node 0 Node 1 Node 2 Node 3 OS Memory Models SUMA: Sufficiently Uniform Memory Access Memory interleaved across nodes 2 1 NUMA: first interleaved within a node, then spanned across nodes 2 1 Memory stripe is then spanned across nodes

  21. 24 25 26 27 28 29 30 31 16 17 18 19 20 21 22 23 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 Node 0 Node 1 Node 2 Node 3 6 7 14 15 22 23 30 31 5 13 21 29 4 12 20 28 2 3 10 11 18 19 26 27 0 1 8 9 16 17 24 25 Node 0 Node 1 Node 2 Node 3 Windows OS NUMA Support • Memory models • SUMA – Sufficiently Uniform Memory Access • NUMA – separate memory pools by Node Memory is striped across NUMA nodes

  22. Memory Model Example: 4 Nodes • SUMA Memory Model • memory access uniformly distributed • 25% of memory accesses local, 75% remote • NUMA Memory Model • Goal is better than 25% local node access • True local access time also needs to be faster • Cache Coherency may increase local access

  23. Cal Tex SE NE Node 6 Node 4 Node 2 Node 0 0-0 6-0 4-0 3-0 5-0 1-0 2-0 7-0 0-1 1-1 2-1 3-1 4-1 5-1 6-1 7-1 PNW Mnt MidA Cen Node 5 Node 1 Node 7 Node 3 Architecting for NUMA End to End Affinity App Server TCP Port CPU Memory Table Web determines port for each user by group (but should not be by geography!) Affinitize port to NUMA node Each node access localized data (partition?) OS may allocate substantial chunk from Node 0? North East 1440 1441 1442 1443 1444 1445 1446 1447 Mid Atlantic South East Central Texas Mountain California Pacific NW

  24. Cal Tex SE NE Node 6 Node 4 Node 2 Node 0 0-0 6-0 4-0 3-0 5-0 1-0 2-0 7-0 0-1 1-1 2-1 3-1 4-1 5-1 6-1 7-1 PNW Mnt MidA Cen Node 5 Node 1 Node 7 Node 3 Architecting for NUMA End to End Affinity App Server TCP Port CPU Memory Table Web determines port for each user by group (but should not be by geography!) Affinitize port to NUMA node Each node access localized data (partition?) OS may allocate substantial chunk from Node 0? North East 1440 1441 1442 1443 1444 1445 1446 1447 Mid Atlantic South East Central Texas Mountain California Pacific NW

  25. HP-UX LORA • HP-UX – Not Microsoft Windows • Locality-Optimizer Resource Alignment • 12.5% Interleaved Memory • 87.5% NUMA node Local Memory

  26. System Tech Specs Processors Cores DIMM PCI-E G2 Total Cores Max memory Base 2 x Xeon X56x0 6 18 5 x8+,1 x4 12 192G* $7K 4 x Opteron 6100 12 32 5 x8, 1 x4 48 512G $14K 4 x Xeon X7560 8 64 4 x8, 6 x4† 32 1TB $30K 8 x Xeon X7560 8 128 9 x8, 5 x4‡ 64 2TB $100K 8GB $400 ea 18 x 8G = 144GB, $7200, 64 x 8G = 512GB - $26K 16GB $1100 ea 12 x16G =192GB, $13K, 64 x 16G = 1TB – $70K Max memory for 2-way Xeon 5600 is 12 x 16 = 192GB, † Dell R910 and HP DL580G7 have different PCI-E ‡ ProLiant DL980G7 can have 3 IOH for additional PCI-E slots

  27. Software Stack

  28. Operating System • Windows Server 2003 RTM, SP1 • Network limitations (default) • Scalable Networking Pack (912222) • Windows Server 2008 • Windows Server 2008 R2 (64-bit only) • Breaks 64 logical processor limit • NUMA IO enhancements? Impacts OLTP Search: MSI-X Do not bother trying to do DW on 32-bit OS or 32-bit SQL Server Don’t try to do DW on SQL Server 2000

  29. SQL Server version • SQL Server 2000 • Serious disk IO limitations (1GB/sec ?) • Problematic parallel execution plans • SQL Server 2005 (fixed most S2K problems) • 64-bit on X64 (Opteron and Xeon) • SP2 – performance improvement 10%(?) • SQL Server 2008 & R2 • Compression, Filtered Indexes, etc • Star join, Parallel query to partitioned table

  30. Configuration • SQL Server Startup Parameter: E • Trace Flags 834, 836, 2301 • Auto_Date_Correlation • Order date < A, Ship date > A • Implied: Order date > A-C, Ship date < A+C • Port Affinity – mostly OLTP • Dedicated processor ? • for log writer ?

More Related