High Scale OLTP Lessons Learned from SQLCAT Performance Labs

High Scale OLTPLessons Learned from SQLCAT Performance Labs Ewan Fairweather: Program Manager Microsoft

Session Objectives and Takeaways • Session Objectives: • Learn about SQL Server capabilities and challenges experienced by some of our extreme OLTP customer scenarios. • Insight into diagnosing and architecting around issues with Tier-1, mission critical workloads. • Key Takeaways • SQL Server can meet the needs of many of the most challenging OLTP scenarios in the world. • There are a number of new challenges when designing for high end OLTP systems.

Laying the foundation and tuning for OLTP

Laying the foundation and tuning for OLTP workloads: • Understand goals and attributes of workload • Performance requirements • Machine born data vs. User driven solution • Read-Write ratio • HA/DR requirements which may have an impact • Apply Configuration and Best Practices guidance • Database and data file considerations • Transaction Log sizing and placement • Configuring the SQL Server TempdbDatabase • Optimizing memory configuration • Be familiar with common performance methodologies, toolsets and common OTLP / Scaling performance pain points • Know your environment – Understand hardware is key

Hardware Setup – Database files • Database Files • # should be at least 25% of CPU cores • This alleviates PFS contention – PAGELATCH_UP • There is no signficant point of diminishing returns up to 100% of CPU cores • But manageability, is an issue... • Though Windows 2008R2 is much easier • TempDb • PFS contention is a larger problem here as it’s an instance wide resource • Deallocations and Allocations , RCSI – version store, triggers, temp tables • # files shoudl be exactly 100% of CPU Threads • Presize at 2 x Physical Memory • Data files and TempDb on same LUNs • It’s all random anyway – don’t sub-optimize • IOPS is a global resource for the machine. Goal is to avoid PAGEIOLATCH on any data file • Key Takeaway: Script it! At this scale, manual work WILL drive you insane

Special Consideration: Transaction Log • Transaction log is a set of 127 linked buffers with max 32 outstanding IOPS • Each buffer is 60KB • Multiple transactions can fit in one buffer • BUT: Buffer must flush before log manager can signal a commit OK • Pre-allocate log file • Use dbcc loginfo for existing systems • Example: Transaction log throughput was ~80MB/sec • But we consistently got <1ms latency, no spikes! • Initial Setup: 2 x HBA on dedicated storage port on RAID10 with 4+4 • When tuning for peak: SSD on internal PCI bus (latency: a few µs) • Key Takeway: For Transaction Log, dedicate storage components and optimize for low latency

SQL Server Memory Setup • For large CPU/Memory box, Lock Pages in Memory really matters • We saw more than double performance • Use gpedit.mscto grant it to SQL Service account • Consider TF834 (Large page Allocations) • On Windows 2008R2 previous issues with this TF are fixed • Around 5-10% throughput increase • Increases startup time • Beware of NUMA node memory distribution • Set max memory close to box max if dedicated box available

How we think about tuning • Let the workload access patterns guide you • Observe resource consumption and resource waits • http://sqlcat.com/whitepapers/archive/2007/11/19/sql-server-2005-waits-and-queues.aspx • http://sqlcat.com/whitepapers/archive/2009/04/14/troubleshooting-performance-problems-in-sql-server-2008.aspx • Standard tuning always applies (indexes, TSQL, etc…) • On these systems we always watch for concurrency related bottlenecks and key components which affect throughput • Locking, latching, spinlocks, log latency, etc. *Focus of tuning depends on the workload, foundation areas can bubble to the top. Focus on the 20% of issues that will give 80% of optimization In this talk we will focus on the unique challenges we face on high concurrency and applications requiring low latency

Laying the foundations for OLTP Performance • The Hardware Plays a Big Role … It is critical understand the theoretical capabilities of the systems in order to succeed • Understand server architecture (NUMA, PCI layout, etc…) • Nehalem-EX • Every socket is a NUMA node • How fast is your interconnect SysinternalsCoreInfo • Network Card Tuning is often needed for throughput intensive workloads • Storage – Never go in blind! Knowing only “it’s a SAN” will lead to disaster. • Understand and document all components in the path from the server to the disk (HBAs, PCI, network, connectivity on the array, disk configuration, are the resources shared, etc..) • Test the storage before running SQL workload

Upping the Limits • Previous (before 2008R2) windows was limited to 64 cores • Kernel tuned for this config • With Windows Server 2008R2 this limit is now upped to 256 Cores (plumbing for 1024 cores) • New concept: Kernel Groups • A bit like NUMA, but an extra layer in the hierarchy • SQL Server generally follows suit – but for now, 256 Cores is limit on R2 • Example x64 machines: HP DL980 (64 Cores, 128 in HyperThread). IBM 3950 (up to 256 Cores) • And largest IA-64 is 256 Hyperthread (at 128 Cores)

The Path to the Sockets Windows OS Hardware NUMA 6 CPU Socket Kernel Group 0 CPU Core CPU Core NUMA 0 NUMA 2 NUMA 4 NUMA 6 HT HT HT HT NUMA 1 NUMA 3 NUMA 5 NUMA 7 CPU Socket CPU Core CPU Core Kernel Group 1 NUMA 8 NUMA 10 NUMA 12 NUMA 14 HT HT HT HT NUMA 9 NUMA 11 NUMA 13 NUMA 15 NUMA 7 Kernel Group 2 NUMA 16 NUMA 18 NUMA 20 NUMA 22 CPU Socket CPU Core CPU Core NUMA 17 NUMA 19 NUMA 21 NUMA 23 HT HT HT HT Kernel Group 3 NUMA 24 NUMA 26 NUMA 28 NUMA 30 CPU Socket CPU Core CPU Core NUMA 25 NUMA 27 NUMA 29 NUMA 31 HT HT HT HT

SQL Server Today: Capabilities and Challenges with real customer workloads

Case Study: Large Healthcare Application • Application: Patient care application (workflow, EMR, etc…) • Performance: • Sustain 9,500 concurrent “application” users with acceptable response time & total CPU utilization – 15,000 planned for March/April 2011 with ultimate goal of 25,000+ • Workload Characteristics: • 6,000-7,000 batches/sec with a Read/Write ratio of about 80/20 • Highly normalized schema, lots of relatively complex queries (heavy on loop joins), heavy use of temporary objects (table valued functions), use of BLOBs, transactional and storage based replication • Hardware/Deployment Configuration (Benchmark): • 24 Application Servers, 12 Load generators (LoadRunner) • Database servers: DL980 and IBM 3950 (2 node single SQL Server failover cluster instance) • 64 Physical cores, 1TB RAM, Hyper-threading is used to get to 128 logical cores

Case Study: Large Healthcare Application (cont.) • Other Solution Requirements: • Require zero data loss (patient data) • Use Synchronous IO SAN based replication for DR • This means we have to tolerate some transaction log overhead (3-5ms) • Application connections must run with lowest privileges possible • Application audits all access to patient data • Near real time reporting required (transactional replication used to “scale-out) • Observation • x64 Servers provide >2x per core processing over previous IA64 CPUs

HealthCare Application - Technical Challenges

NUMA latencies • SysinternalsCoreInfo http://technet.microsoft.com/en-us/sysinternals/cc835722.aspx • Nehalem-EX • Every socket is a NUMA node • How fast is your interconnect….

Log Growth and Virtual Log File Fragmentation • SQL Server physical transaction log is comprised of Virtual Log Files (VLFs) • Each auto-growth/growth event will add additional VLFs • Frequent auto-growths can introduce a large number of VLFs which can have a negative effect on log performance due to: • Overhead of the additional VLFs • File system fragmentation • Additional information can be found here • Consider rebuilding log if you find 100’s or 1,000’s of VLF’s • DBCC LOGINFO can be used to report on this (example below) FileIdFileSizeStartOffsetFSeqNo Status Parity CreateLSN ----------- -------------------- -------------------- ----------- ----------- ------ --------------------------------------- 2 253952 8192 48141 0 64 0 … 2 427556864 74398826496 0 0 128 22970000047327200649 2 427950080 74826383360 0 0 128 22970000047327200649

Spinlocks • Lightweight synchronization primitives used to protect access to data structures • Used to protect structures in SQL such as lock hash tables (LOCK_HASH), security caches (SOS_CACHESTORE) and more • Used when it is expected that resources will be held for a very short duration • Why not yield? • It would be more expensive to yield and context switch than spin to acquire the resource Threads accessing the same hash bucket of the table are synchronized Resource Thread attempts to obtain lock (row, page, database, etc.. Lock Manager Lock Hash Table LOCK_HASH Hash of lock maintained in hash table

4 Spinlocks Diagnosis 1 select * from sys.dm_os_spinlock_stats order by spins desc 2 3 5 • These symptoms mayindicate spinlock contention: • 1. A high number of spins is reported for a particular spinlock type. • AND • 2. The system is experiencing heavy CPU utilization. • AND • 3. The system has a high amount of concurrency.

Spinlock Diagnosis Walk Through 3 1 Extended events capture the “backoff events” over a 1 min interval & provide the code paths of the contention – security check related Not a resolution but we know where to start Much higher CPU with drop in throughput (At this point many SQL threads are – “spinning”) Confirmed theory via dm_os_spinlock_stats – observe this type with highest “spins” & “backoffs” High backoffs = contention 2

Spinlock Walkthrough – Extended Events Script --Get the type value for any given spinlock type selectmap_value,map_key, name fromsys.dm_xe_map_values wheremap_valueIN('SOS_CACHESTORE') --create the even session that will capture the callstacks to a bucketizer createeventsessionspin_lock_backoffonserver addeventsqlos.spinlock_backoff(action (package0.callstack) where type= 144 --SOS_CACHESTORE ) addtarget package0.asynchronous_bucketizer( setfiltering_event_name='sqlos.spinlock_backoff', source_type=1,source='package0.callstack') with (MAX_MEMORY=50MB, MEMORY_PARTITION_MODE = PER_NODE) --Ensure the session was created select*fromsys.dm_xe_sessions where name ='spin_lock_backoff' --Run this section to measure the contention altereventsessionspin_lock_backoffonserverstate=start --wait to measure the number of backoffs over a 1 minute period waitfordelay'00:01:00' --To view the data --1. Ensure the sqlservr.pdb is in the same directory as the sqlservr.exe --2. Enable this trace flag to turn on symbol resolution DBCCtraceon(3656,-1) --Get the callstacks from the bucketize target selectevent_session_address,target_name,execution_count,cast(target_dataasXML) fromsys.dm_xe_session_targetsxst innerjoinsys.dm_xe_sessionsxson (xst.event_session_address=xs.address) where xs.name ='spin_lock_backoff' --clean up the session altereventsessionspin_lock_backoffonserverstate=stop dropeventsessionspin_lock_backoffonserver A complete walkthrough of the technique can be found here: http://sqlcat.com/msdnmirror/archive/2010/05/11/resolving-dtc-related-waits-and-tuning-scalability-of-dtc.aspx

Regeneration of Security Tokens Result in High SOS_CACHESTORE Spins 1 At Random Times CPU spikes, then almost all sessions waiting on LCK_M_X Hugeincrease in number of spins & backoffs associated with SOS_CACHESTORE 2 Observation: It is counterintuitive to have high waits times (LCK_M_X) correlate with heavy CPU – This is the symptom not the cause 3 Approach: Use extended events to profile the code path with the spinlock contention (i.e. where there is a high number of backoffs) 4 Root cause: Regeneration of security tokens exposes contention in code paths for access permission checks Workaround/Problem Isolation: Run with sysadmin rights Long Term Changes Required: SQL Server fix 5

Fully Qualified Calls To Stored Procedures • Developer uses Exec myproc for dbo.myproc • SQL acquires an exclusive lock LCK_M_X and prepares to compile the procedure; this includes calculating the object ID • dm_exec_requestsrevealed almost all the sessions were waiting on LCK_M_X to compile a stored procedure • Workaround: make app user DB_Owner

Case Study: Point of Sale (POS) System • Application: Point of Sale application supporting sales at 8,000 stores • Performance: • Sustain expected peak load of ~230 business transactions (“checks”) per second • Workload Characteristics: • 230 business transactions = ~50,000 batches/sec • Heavy insert into a few tables, periodic range scans of newly added data • Heavy network utilization due to inserts and use of BLOB data • Hardware/Deployment Configuration: • Custom test harness, 12 Load Generators, 5 Application servers • Database servers: HP DL 785 • 48 Physical cores, 256GB RAM

Case Study: Point of Sale (POS) System (cont.) • Other Solution Requirements: • Mission critical to the business in terms of performance and availability. • Strict uptime requirements. • SQL Server Failover Clustering for local (within datacenter) availability • Storage based replication (EMC SRDF) for disaster recovery • Quick recovery time for failover is a priority. • Observation • Initial tests showed low overall system utilization • Long duration for insert statements • High waits on buffer pages (PAGELATCH_EX/PAGELATCH_SH) • Network bottlenecks once the latch waits were resolved • Recovery times (failure to DB online) after failover under full load were between 45 seconds and 3 minutes for unplanned node failures

POS Benchmark Configuration BL460 Blade Servers Dell R900’s , R805’s Active/Active Failover cluster Reporting DB Server 1 x DL585 4P (dual core), 2.6 GHz 32 GB RAM Transaction DB Server 1 x DL785 8P (quad core), 2.3GHz 256 GB RAM Network switch DL785 DL585 SAN switch Brocade 4900 (32-ports active) Switch 12 x Load drivers: 2 proc (quad core), x64 32+ GB memory 5x App servers: 5 x BL460 2proc (quad core), 32bit 32 GB memory SAN CX-960 (240 drives, 15K, 300GB) Switch

Technical Challenges and Architecting for Extreme OLTP

Hot Latches! • We observed very high waits for PAGELATCH_EX • High = more than 1ms, we observed greater than 20 ms • Be careful drawing conclusions just on ”averages” • What are we contending on? • Latch – a light weight semaphore • Locks are logical (transactional consistency) • Latches are physical (memory consitency) • Because rows are small (many fit a page) multiple threads accesses single page may compete for one PAGELATCH – even if there is no lock blocking Page (8K) EX_LATCH ROW IX – Page 298 ROW INSERT VALUES (298, xxxx…) ROW EX_LATCH wait EX_LATCH ROW IX – Page 299 INSERT VALUES (299, xxxx …)

Waits & Latches • Dig into details with: • sys.dm_os_wait_stats • sys.dm_os_latch_waits

Waits & Latches – Server Level • sys.dm_os_wait_stats select* ,wait_time_ms/waiting_tasks_count [avg_wait_time] ,signal_wait_time_ms/waiting_tasks_count [avg_signal_wait_time] fromsys.dm_os_wait_stats wherewait_time_ms> 0 andwait_typelike'%PAGELATCH%' orderbywait_time_msdesc

Waits & Latches – Index Level • sys.dm_db_index_operational_stats /* latch waits ********************************************/ selecttop 20 database_id,object_id,index_id,count(partition_number) [num partitions] ,sum(leaf_insert_count) [leaf_insert_count],sum(leaf_delete_count) [leaf_delete_count] ,sum(leaf_update_count) [leaf_update_count] ,sum(singleton_lookup_count) [singleton_lookup_count] ,sum(range_scan_count) [range_scan_count] ,sum(page_latch_wait_in_ms) [page_latch_wait_in_ms],sum(page_latch_wait_count) [page_latch_wait_count] ,sum(page_latch_wait_in_ms)/sum(page_latch_wait_count) [avg_page_latch_wait] ,sum(tree_page_latch_wait_in_ms) [tree_page_latch_wait_ms],sum(tree_page_latch_wait_count) [tree_page_latch_wait_count] ,casewhen (sum(tree_page_latch_wait_count)= 0)then 0 elsesum(tree_page_latch_wait_in_ms)/sum(tree_page_latch_wait_count)end [avg_tree_page_latch_wait] fromsys.dm_db_index_operational_stats(null,null,null,null) os wherepage_latch_wait_count> 0 groupbydatabase_id,object_id,index_id orderbysum(page_latch_wait_in_ms)desc

Hot Latches - Last Page Insert Contention B-tree Page • Most common for indexes which have monotinically increasing key values (i.e. Datetime, identity, etc..) • Our scenario • Two tables were insert heavy, by far receiving the highest number of inserts • INSERT mainly however there is background process reading off ranges of the newly added data • And don’t forget – • We have to obtain latches on the non-leaf Btree pages as well. • Page Latch waits vs. Tree Page Latch waits (sys.dm_db_index_operational stats) B-tree Page B-tree Page Tree Pages Data Page Data Page Date Page Data Page Data Page Data Page Data Page Date Page Data Page Data Page Data Page Data Page Data Page Leaf Pages Logical Key Order of Index – Monotonically Increasing Many threads inserting into end of range We call this ”Last Page Insert Contention” • Expect: PAGELATCH_EX/SH waits • And this is the observation

How to Solve INSERT hotspot • Option #1: Hash partition the table • Based on ”hash” of a column (commonly a modulo) • Creates multiple B-trees (each partition is a B-tree) • Round robin between the B-trees create more resources and less contention • Option #2: Do not use a sequential key • Distribute the inserts all over the B-tree After Before 0 -1000 1001 - 2000 2001 - 3000 3001 - 4000 Hash Partitioned Table / Index INSERT INSERT INSERT INSERT … Threads inserting into end of range but across each partition Hash Partitioning Reference: http://sqlcat.com/technicalnotes/archive/2009/09/22/resolving-pagelatch-contention-on-highly-concurrent-insert-workloads-part-1.aspx Threads inserting into end of range – contention on ”last page”

Example: Before Hash Partitioning 1 Latch waits of approximately 36 ms at baseline of 99 checks/sec. 2

Example: After Hash Partitioning* *Other optimizations were applied, Hash Partitioning was responsible to a 2.5x improvement in insert throughput 1 Latch waits of approximately 0.6 ms at highest throughput of 249 checks/sec. 3 4 2

Table Partitioning Example 1 2 --Create the partition scheme and function CREATE PARTITION FUNCTION [pf_hash16] (tinyint) AS RANGE LEFT FOR VALUES (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15) CREATE PARTITION SCHEME [ps_hash16] AS PARTITION [pf_hash16]ALL TO ( [ALL_DATA] ) -- Add the computed column to the existing table (this is an OFFLINE operation of done the simply way) - Consider using bulk loading techniques to speed it up. ALTER TABLE [dbo].[Transaction] ADD [HashValue] AS (CONVERT([tinyint], abs(binary_checksum([uidMessageID ])%(16)),(0))) PERSISTED NOT NULL --Create the index on the new partitioning scheme CREATE UNIQUE CLUSTERED INDEX [IX_Transaction_ID] ON [dbo].[Transaction([Transaction_ID ], [HashValue]) ON ps_hash16(HashValue) 3 • Note: Requires application changes • Ensure Select/Update/Delete have appropriate partition elimination

Network Cards – Rule of Thumb • At scale, network traffic will generate a LOT of interrupts for the CPU • These must be handled by CPU Cores • Must distribute packets to cores for processing

Tuning a Single NIC Card – POS system • Enable RSS to enable multiple CPUs to process receive indications: http://www.microsoft.com/whdc/device/network/NDIS_RSS.mspx • The next step was to disable the Base Filtering Service in Windows and explicitly enable TCP Chimney offload. • Turning off Base Filtering Service – huge reduction in CPU – may not be suitable for all production environments • Careful with Chimney Offload as per KB 942861

Before and After Tuning Single NIC • Before any network changes the workload was CPU bound on CPU0 • After tuning RSS, disabling Base Filtering Service and explicitly enabling TCP Chimney Offload CPU time on CPU0 was reduced. The base CPU for RSS successfully moved from CPU0 to another CPU. 1 2 3 Single 1 Gb/s NIC

To DTC or not to DTC: POS System • Com+ transactional applications are still prevalent today • This results in all database calls enlisting in a DTC transaction  45% performance overhead • Scenario in the lab involved two Resource Managers MSMQ and SQL: • Tuning approaches • Optimize DTC TM configuration (transparent to app) • Remove DTC transactions (requires app changes) • Utilize System.Transactions which will only promote to DTC if more than one RM is involved • See Lightweight transactions: http://msdn.microsoft.com/en-us/magazine/cc163847.aspx#S5

Optimizing DTC Configuration • Default application servers use local TM (MSDTC Coordinator) • Introduces RPC communication between SQL TM and App Server TM • App virtualization layer incurs ‘some’ delay • Configuring application servers to use remote coordinator removes RPC communication • See Mike Ruthruff’s paper on SQLCAT.COM: • http://sqlcat.com/msdnmirror/archive/2010/05/11/resolving-dtc-related-waits-and-tuning-scalability-of-dtc.aspx

Recap: Session Objectives and Takeaways • Session Objectives: • Learn about SQL Server capabilities and challenges experienced by some of our extreme OLTP customer scenarios. • Insight into diagnosing and architecting around issues with Tier-1, mission critical workloads. • Key Takeaways • SQL Server can meet the needs of many of the most challenging OLTP scenarios in the world. • There are a number of new challenges when designing for high end OLTP systems.

Applied Architecture Patterns on the Microsoft Platform

Agenda • Windows Server 2008R2 and SQL Server 2008R2 improvements • Scale architecture • Customer Requirements • Hardware setup • Transaction log essentials • Getting the code right • Application Server Essentials • Database Design • Tuning Data Modification • UPDATE statements • INSERT statements • Management of LOB data • The problem with NUMA and what to do about it • Final results and Thoughts

Top statistics

Upping the Limits • Previous (before 2008R2) windows was limited to 64 cores • Kernel tuned for this config • With Windows Server 2008R2 this limit is now upped to 256 Cores (plumbing for 1024 cores) • New concept: Kernel Groups • A bit like NUMA, but an extra layer in the hierarchy • SQL Server generally follows suit – but for now, 256 Cores is limit on R2 • Example x64 machines: HP DL980 (64 Cores, 128 in HyperThread). IBM 3950 (up to 256 Cores) • And largest IA-64 is 256 Hyperthread (at 128 Cores)

The Path to the Sockets Windows OS Hardware NUMA 6 CPU Socket Kernel Group 0 CPU Core CPU Core NUMA 0 NUMA 2 NUMA 4 NUMA 6 HT HT HT HT NUMA 1 NUMA 3 NUMA 5 NUMA 7 CPU Socket CPU Core CPU Core Kernel Group 1 NUMA 8 NUMA 10 NUMA 12 NUMA 14 HT HT HT HT NUMA 9 NUMA 11 NUMA 13 NUMA 15 NUMA 7 Kernel Group 2 NUMA 16 NUMA 18 NUMA 20 NUMA 22 CPU Socket CPU Core CPU Core NUMA 17 NUMA 19 NUMA 21 NUMA 23 HT HT HT HT Kernel Group 3 NUMA 24 NUMA 26 NUMA 28 NUMA 30 CPU Socket CPU Core CPU Core NUMA 25 NUMA 27 NUMA 29 NUMA 31 HT HT HT HT

And we measure it like this • SysinternalsCoreInfo http://technet.microsoft.com/en-us/sysinternals/cc835722.aspx • Nehalem-EX • Every socket is a NUMA node • How fast is your interconnect….

High Scale OLTP Lessons Learned from SQLCAT Performance Labs

High Scale OLTP Lessons Learned from SQLCAT Performance Labs

Presentation Transcript

Global Lessons Learned from Taking CATS/CLTS to Scale

Lessons Learned from Clippit

Precious lessons learned from ANTS

Building High Performance Virtual Teams: Lessons Learned from MERLOT’s Business Discipline Editorial Board

Lessons Learned from Organic Synthesis

Lessons learned from Jonah

Lessons Learned From Israel

Lessons Learned from SOFIA

Lessons Learned from the field

High Scale OLTP Lessons Learned from SQLCAT Performance Labs

Lessons Learned from a Breach

Lessons learned from SMEI

Lessons Learned From Federal Programs

Lessons Learned from Blended Learning

Lessons learned from the

LESSONS LEARNED FROM

Lessons Learned from Other Countries

Lessons Learned from Clippit

Lessons learned from country experiences

LESSONS LEARNED FROM ORGANIZATIONS

Lessons Learned from SC2004