System architecture big iron numa
This presentation is the property of its rightful owner.
Sponsored Links
1 / 254

System Architecture: Big Iron (NUMA) PowerPoint PPT Presentation


  • 33 Views
  • Uploaded on
  • Presentation posted in: General

System Architecture: Big Iron (NUMA). Joe Chang [email protected] www.qdpma.com. About Joe Chang. SQL Server Execution Plan Cost Model True cost structure by system architecture Decoding statblob (distribution statistics) SQL Clone – statistics-only database Tools

Download Presentation

System Architecture: Big Iron (NUMA)

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


System architecture big iron numa

System Architecture: Big Iron (NUMA)

Joe Chang

[email protected]

www.qdpma.com


About joe chang

About Joe Chang

  • SQL Server Execution Plan Cost Model

  • True cost structure by system architecture

  • Decoding statblob (distribution statistics)

    • SQL Clone – statistics-only database

  • Tools

    • ExecStats – cross-reference index use by SQL-execution plan

    • Performance Monitoring,

    • Profiler/Trace aggregation


Scaling sql on numa topics

Scaling SQL on NUMA Topics

  • OLTP – Thomas Kejser session

    • “Designing High Scale OLTP Systems”

  • Data Warehouse

  • Ongoing Database Development

  • Bulk Load – SQL CAT paper + TK session

    • “The Data Loading Performance Guide”

Other Sessions with common coverage:

Monitoring and Tuning Parallel Query Execution II, R Meyyappan

(SQLBits 6) Inside the SQL Server Query Optimizer, Conor Cunningham

Notes from the field: High Performance Storage, John Langford

SQL Server Storage – 1000GB Level, Brent Ozar


Server systems and architecture

Server Systems and Architecture


Symmetric multi processing

CPU

CPU

CPU

CPU

System Bus

MCH

PXH

PXH

ICH

Symmetric Multi-Processing

SMP, processors are not dedicated to specific tasks (ASMP), single OS image, each processor can acess all memory

SMP makes no reference to memory architecture?

Not to be confused to Simultaneous Multi-Threading (SMT)

Intel calls SMT Hyper-Threading (HT), which is not to be confused with AMD Hyper-Transport (also HT)


Non uniform memory access

Non-Uniform Memory Access

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

Memory

Controller

Memory

Controller

Memory

Controller

Memory

Controller

Node Controller

Node Controller

Node Controller

Node Controller

Shared Bus or X Bar

  • NUMA Architecture - Path to memory is not uniform

  • Node: Processors, Memory, Separate or combined Memory + Node Controllers

  • Nodes connected by shared bus, cross-bar, ring

  • Traditionally, 8-way+ systems

Local memory latency ~150ns, remote node memory ~300-400ns, can cause erratic behavior if OS/code is not NUMA aware


Amd opteron

Opteron

Opteron

Opteron

Opteron

HT2100

HT2100

HT1100

AMD Opteron

Technically, Opteron is NUMA,

but remote node memory latency is low,

no negative impact or erratic behavior!

For practical purposes: behave like SMP system

Local memory latency ~50ns, 1 hop ~100ns, two hop 150ns?

Actual: more complicated because of snooping (cache coherency traffic)


8 way opteron sys architecture

CPU

0

CPU

1

CPU

2

CPU

3

CPU

4

CPU

5

CPU

6

CPU

7

8-way Opteron Sys Architecture

Opteron processor (prior to Magny-Cours) has 3 Hyper-Transport links. Note 8-way top and bottom right processors use 2 HT to connect

to other processors, 3rd HT for IO,

CPU 1 & 7 require 3 hops to each other


System architecture big iron numa

http://www.techpowerup.com/img/09-08-26/17d.jpg


Nehalem system architecture

Nehalem System Architecture

Intel Nehalem generation processors have Quick Path Interconnect (QPI) Xeon 5500/5600 series have 2, Xeon 7500 series have 4 QPI

8-way Glue-less is possible


Numa local and remote memory

NUMA Local and Remote Memory

  • Local memory is closer than remote

  • Physical access time is shorter

  • What is actual access time?

    • With cache coherency requirement!


Ht assist probe filter

HT Assist – Probe Filter

part of L3 cache used as directory cache

ZDNET


Source snoop coherency

Source Snoop Coherency

From HP PREMA Architecture whitepaper:

All reads result in snoops to all other caches, …

Memory controller cannot return the data until it has collected all the snoop responses and is sure that no cache provided a more recent copy of the memory line


Dl980g7

DL980G7

From HP PREAM Architecture whitepaper:

Each node controller stores information about* all data in the processor caches, minimizes inter-processor coherency communication, reduces latency to local memory

(*only cache tags, not cache data)


Hp proliant dl980 architecture

HP ProLiant DL980 Architecture

Node Controllers reduces effective memory latency


Superdome 2 itanium sx3000

Superdome 2 – Itanium, sx3000

Agent – Remote Ownership Tag

+ L4 cache tags

64M eDRAM L4 cache data


Ibm x3850 x5 glue less

IBM x3850 X5 (Glue-less)

Connect two 4-socket Nodes to make 8-way system


Os memory models

24

25

26

27

28

29

30

31

16

17

18

19

20

21

22

23

8

9

10

11

12

13

14

15

0

1

2

3

4

5

6

7

Node 0

Node 1

Node 2

Node 3

6

7

14

15

22

23

30

31

4

5

12

13

20

21

28

29

2

3

10

11

18

19

26

27

0

1

8

9

16

17

24

25

Node 0

Node 1

Node 2

Node 3

OS Memory Models

SUMA: Sufficiently Uniform Memory Access

Memory interleaved across nodes

2

1

NUMA: first interleaved within a node,

then spanned across nodes

2

1

Memory stripe is then spanned across nodes


Windows os numa support

24

25

26

27

28

29

30

31

16

17

18

19

20

21

22

23

8

9

10

11

12

13

14

15

0

1

2

3

4

5

6

7

Node 0

Node 1

Node 2

Node 3

6

7

14

15

22

23

30

31

5

13

21

29

4

12

20

28

2

3

10

11

18

19

26

27

0

1

8

9

16

17

24

25

Node 0

Node 1

Node 2

Node 3

Windows OS NUMA Support

  • Memory models

    • SUMA – Sufficiently Uniform Memory Access

    • NUMA – separate memory pools by Node

Memory is striped across NUMA nodes


Memory model example 4 nodes

Memory Model Example: 4 Nodes

  • SUMA Memory Model

    • memory access uniformly distributed

    • 25% of memory accesses local, 75% remote

  • NUMA Memory Model

    • Goal is better than 25% local node access

    • True local access time also needs to be faster

    • Cache Coherency may increase local access


Architecting for numa

Cal

Tex

SE

NE

Node 6

Node 4

Node 2

Node 0

0-0

6-0

4-0

3-0

5-0

1-0

2-0

7-0

0-1

1-1

2-1

3-1

4-1

5-1

6-1

7-1

PNW

Mnt

MidA

Cen

Node 5

Node 1

Node 7

Node 3

Architecting for NUMA

End to End Affinity

App Server

TCP Port

CPU

Memory

Table

Web determines port for each user by group

(but should not be by geography!)

Affinitize port to NUMA node

Each node access localized data (partition?)

OS may allocate substantial chunk from Node 0?

North East

1440

1441

1442

1443

1444

1445

1446

1447

Mid Atlantic

South East

Central

Texas

Mountain

California

Pacific NW


Architecting for numa1

Cal

Tex

SE

NE

Node 6

Node 4

Node 2

Node 0

0-0

6-0

4-0

3-0

5-0

1-0

2-0

7-0

0-1

1-1

2-1

3-1

4-1

5-1

6-1

7-1

PNW

Mnt

MidA

Cen

Node 5

Node 1

Node 7

Node 3

Architecting for NUMA

End to End Affinity

App Server

TCP Port

CPU

Memory

Table

Web determines port for each user by group

(but should not be by geography!)

Affinitize port to NUMA node

Each node access localized data (partition?)

OS may allocate substantial chunk from Node 0?

North East

1440

1441

1442

1443

1444

1445

1446

1447

Mid Atlantic

South East

Central

Texas

Mountain

California

Pacific NW


Hp ux lora

HP-UX LORA

  • HP-UX – Not Microsoft Windows

  • Locality-Optimizer Resource Alignment

    • 12.5% Interleaved Memory

    • 87.5% NUMA node Local Memory


System tech specs

System Tech Specs

Processors

Cores

DIMM

PCI-E G2

Total Cores

Max

memory

Base

2 x Xeon X56x0

6

18

5 x8+,1 x4

12

192G*

$7K

4 x Opteron 6100

12

32

5 x8, 1 x4

48

512G

$14K

4 x Xeon X7560

8

64

4 x8, 6 x4†

32

1TB

$30K

8 x Xeon X7560

8

128

9 x8, 5 x4‡

64

2TB

$100K

8GB $400 ea 18 x 8G = 144GB, $7200,64 x 8G = 512GB - $26K

16GB $1100 ea 12 x16G =192GB, $13K,64 x 16G = 1TB – $70K

Max memory for 2-way Xeon 5600 is 12 x 16 = 192GB,

† Dell R910 and HP DL580G7 have different PCI-E

‡ ProLiant DL980G7 can have 3 IOH for additional PCI-E slots


Software stack

Software Stack


Operating system

Operating System

  • Windows Server 2003 RTM, SP1

    • Network limitations (default)

    • Scalable Networking Pack (912222)

  • Windows Server 2008

  • Windows Server 2008 R2 (64-bit only)

    • Breaks 64 logical processor limit

    • NUMA IO enhancements?

Impacts OLTP

Search: MSI-X

Do not bother trying to do DW on 32-bit OS or 32-bit SQL Server

Don’t try to do DW on SQL Server 2000


Sql server version

SQL Server version

  • SQL Server 2000

    • Serious disk IO limitations (1GB/sec ?)

    • Problematic parallel execution plans

  • SQL Server 2005 (fixed most S2K problems)

    • 64-bit on X64 (Opteron and Xeon)

    • SP2 – performance improvement 10%(?)

  • SQL Server 2008 & R2

    • Compression, Filtered Indexes, etc

    • Star join, Parallel query to partitioned table


Configuration

Configuration

  • SQL Server Startup Parameter: E

    • Trace Flags 834, 836, 2301

  • Auto_Date_Correlation

    • Order date < A, Ship date > A

      • Implied: Order date > A-C, Ship date < A+C

  • Port Affinity – mostly OLTP

  • Dedicated processor ?

    • for log writer ?


Storage performance for data warehousing

Storage Performance for Data Warehousing

Joe Chang

[email protected]

www.qdpma.com


About joe chang1

About Joe Chang

  • SQL Server Execution Plan Cost Model

  • True cost structure by system architecture

  • Decoding statblob (distribution statistics)

    • SQL Clone – statistics-only database

  • Tools

    • ExecStats – cross-reference index use by SQL-execution plan

    • Performance Monitoring,

    • Profiler/Trace aggregation


Storage

Storage


Organization structure

Organization Structure

  • In many large IT departments

    • DB and Storage are in separate groups

  • Storage usually has own objectives

    • Bring all storage into one big system under full management (read: control)

  • Storage as a Service, in the Cloud

    • One size fits all needs

  • Usually have zero DB knowledge

Of course we do high bandwidth, 600MB/sec good enough for you?


Data warehouse storage

Data Warehouse Storage

  • OLTP – Throughput with Fast Response

  • DW – Flood the queues for maximum through-put

Do not use shared storage for data warehouse!

Storage system vendors like to give the impression the SAN is a magical, immensely powerful box that can meet all your needs.

Just tell us how much capacity you need and don’t worry about anything else.

My advice: stay away from shared storage, controlled by different team.


Nominal and net bandwidth

Nominal and Net Bandwidth

  • PCI-E Gen 2 – 5 Gbit/sec signaling

    • x8 = 5GB/s, net BW 4GB/s, x4 = 2GB/s net

  • SAS 6Gbit/s – 6 Gbit/s

    • x4 port: 3GB/s nominal, 2.2GB/sec net?

  • Fibre Channel 8 Gbit/s nominal

    • 780GB/s point-to-point,

    • 680MB/s from host to SAN to back-end loop

  • SAS RAID Controller, x8 PCI-E G2, 2 x4 6G

    • 2.8GB/s

Depends on the controller, will change!


Storage sas direct attach

SAS x4

SAS x4

RAID

PCI-E x8

SAS x4

SAS x4

SAS x4

SAS x4

RAID

PCI-E x8

SAS x4

SAS x4

RAID

PCI-E x8

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

RAID

PCI-E x8

RAID

PCI-E x4

SAS x4

2 x10GbE

PCI-E x4

PCI-E x4

2 x10GbE

Storage – SAS Direct-Attach

Many Fat Pipes

Very Many Disks

Option A:

24-disks in one enclosure for each x4 SAS port. Two x4 SAS ports per controller

Option B: Split enclosure over 2 x4 SAS ports, 1 controller

Balance by pipe bandwidth

Don’t forget fat network pipes


Storage fc san

8Gb FC

8Gb FC

HBA

8Gb FC

8Gb FC

PCI-E x8

8Gb FC

8Gb FC

8Gb FC

8Gb FC

HBA

PCI-E x8

8Gb FC

8Gb FC

HBA

PCI-E x4

8Gb FC

8Gb FC

.

.

.

.

.

.

8Gb FC

8Gb FC

8Gb FC

.

.

.

.

.

.

HBA

.

.

PCI-E x4

8Gb FC

8Gb FC

8Gb FC

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

HBA

PCI-E x4

.

.

HBA

HBA

PCI-E x4

PCI-E x4

2 x10GbE

PCI-E x4

PCI-E x4

2 x10GbE

Storage – FC/SAN

PCI-E x8 Gen 2 Slot with quad-port 8Gb FC

If 8Gb quad-port is not supported, consider system with many x4 slots, or consider SAS!

SAN systems typically offer 3.5in 15-disk enclosures. Difficult to get high spindle count with density.

1-2 15-disk enclosures per 8Gb FC port, 20-30MB/s per disk?


Storage ssd hdd hybrid

SAS x4

SAS x4

SAS

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

PCI-E x8

SAS x4

SAS x4

SAS x4

SAS x4

SAS

PCI-E x8

SAS x4

SAS x4

.

.

.

.

.

.

.

.

SAS

PCI-E x8

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

SAS

PCI-E x8

PCI-E x4

RAID

SAS x4

2 x10GbE

PCI-E x4

PCI-E x4

2 x10GbE

Storage – SSD / HDD Hybrid

No RAID w/SSD?

Storage enclosures typically 12 disks per channel. Can only support bandwidth of a few SSD. Use remaining bays for extra storage with HDD. No point expending valuable SSD space for backups and flat files

Log:

Single DB – HDD, unless rollbacks or T-log backups disrupts log writes.

Multi DB – SSD, otherwise to many RAID1 pairs to logs


System architecture big iron numa

SSD

  • Current: mostly 3Gbps SAS/SATA SDD

    • Some 6Gbps SATA SSD

  • Fusion IO – direct PCI-E Gen2 interface

    • 320GB-1.2TB capacity, 200K IOPS, 1.5GB/s

  • No RAID ?

    • HDD is fundamentally a single point failure

    • SDD could be built with redundant components

  • HP report problems with SSD on RAID controllers, Fujitsu did not?


Big dw storage iscsi

Big DW Storage – iSCSI

  • Are you nuts?


Storage configuration arrays

Storage Configuration - Arrays

Shown:

two 12-disk Arrays per 24-disk enclosure

Options: between 6-16 disks per array

SAN systems may recommend R10 4+4 or R5 7+1

Very Many Spindles

Comment on Meta LUN


Data consumption rate xeon

Processors

GHz

Total

Cores

Mem

GB

SQL

Q1

sec

SF

Total

MB/s

MB/s

per core

2 Xeon 5355

2.66

8

64

5sp2

85.4

100

1,165.5

145.7

Conroe

2 Xeon 5570

2.93

8

144

8sp1

42.2

100

2,073.5

259.2

Nehalem

2 Xeon 5680

3.33

12

192

8r2

21.0

100

4,166.7

347.2

Westmere

4 Xeon 7560

8 Xeon 7560

2.26

2.26

32

64

640

512

8r2

8r2

37.2

183.8

3000

300

7,056.5

14,282

220.5

223.2

Neh.-EX

Data Consumption Rate: Xeon

TPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875M

Data consumption rate is much higher for current generation Nehalem and Westmere processors than Core 2 referenced in Microsoft FTDW document. TPC-H Q1 is more compute intensive than the FTDW light query.


Data consumption rate opteron

8 Opt 8439

8 Opt 8439

2.8

2.8

48

48

256

512

8sp1

8rtm

166.9

49.0

300

1000

5,242.7

5,357.1

109.2

111.6

2 Opt 6176

4 Opt 6176

2.3

2.3

24

48

192

512

8r2

8r2

20.2

31.8

300

100

4,331.7

8,254.7

180.5

172.0

-

Magny-C

Data Consumption Rate: Opteron

TPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875M

Processors

GHz

Total

Cores

Mem

GB

SQL

Q1

sec

SF

Total

MB/s

MB/s

per core

4 Opt 8220

2.8

8

128

5rtm

309.7

300

868.7

121.1

8 Opt 8360

2.5

32

256

8rtm

91.4

300

2,872.0

89.7

Barcelona

8 Opt 8384

2.7

32

256

8rtm

72.5

300

3,620.7

113.2

Shanghai

Istanbul

Expected Istanbul to have better performance per core than Shanghai due to HT Assist. Magny-Cours has much better performance per core! (at 2.3GHz versus 2.8 for Istanbul), or is this Win/SQL 2K8 R2?


Data consumption rate

Processors

GHz

Total

Cores

Mem

GB

SQL

Q1

sec

SF

Total

MB/s

MB/s

per core

2 Xeon 5355

2.66

8

64

5sp2

85.4

100

1165.5

145.7

2 Xeon 5570

2.93

8

144

8sp1

42.2

100

2073.5

259.2

2 Xeon 5680

3.33

12

192

8r2

21.0

100

4166.7

347.2

2 Opt 6176

2.3

24

192

8r2

20.2

100

4331.7

180.5

4 Opt 8220

2.8

8

128

5rtm

309.7

300

868.7

121.1

8 Opt 8360

2.5

32

256

8rtm

91.4

300

2872.0

89.7

8 Opt 8384

2.7

32

256

8rtm

72.5

300

3620.7

113.2

8 Opt 8439

2.8

48

256

8sp1

49.0

300

5357.1

111.6

4 Opt 6176

2.3

48

512

8r2

31.8

300

8254.7

172.0

8 Xeon 7560

2.26

64

512

8r2

183.8

3000

14282

223.2

Data Consumption Rate

TPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875M

Barcelona

Shanghai

Istanbul

Magny-C


Storage targets

Processors

BW

Core

Total Cores

Target MB/s

PCI-E

x8-x4

SAS

HBA

Storage

Units/Disks

Actual Bandwidth

Storage

Units/Disks

2 Xeon X5680

350

12

4200

5 - 1

2

2 - 48

5 GB/s

4 - 96

4 Opt 6176

8400

175

48

5 - 1

4

4 - 96

10 GB/s

8 - 192

4 Xeon X7560

8000

32

250

6 - 4

6

6 - 144

15 GB/s

12 - 288

8 Xeon X7560

225

64

14400

9 - 5

11†

10 - 240

26 GB/s

20 - 480

Storage Targets

2U disk enclosure

24 x 73GB

15K 2.5in disks

$14K, $600 per disk

† 8-way :

9 controllers in x8 slots, 24 disks per x4 SAS port

2 controllers in x4 slots, 12 disk

24 15K disks per enclosure,

12 disks per x4 SAS port requires 100MB/sec per disk,

possible but not always practical

24 disks per x4 SAS port requires 50MB/sec,

more achievable in practice

Think: Shortest path to metal (iron-oxide)


Your storage and the optimizer

Model

Disks

BW (KB/s)

Sequential IOPS

“Random” IOPS

Sequential- Rand IO ratio

Optimizer

-

10,800

1,350

320

4.22

SAS 2x4

24

2,800,000

350,000

9,600

36.5

SAS 2x4

48

2,800,000

350,000

19,200

18.2

FC 4G

30

360,000

45,000

12,000

3.75

SSD

8

2,800,000

350,000

280,000

1.25

Your Storage and the Optimizer

Assumptions

2.8GB/sec per SAS 2 x4 Adapter, Could be 3.2GB/sec per PCI-E G2 x8

HDD 400 IOPS per disk – Big query key lookup, loop join at high queue, and short-stroked, possible skip-seek. SSD 35,000 IOPS

The SQL Server Query Optimizer make key lookup versus table scan decisions based on a 4.22 sequential-to-random IO ratio

A DW configured storage system has a 18-36 ratio, 30 disks per 4G FC about matches the QO, SSD is in the other direction


System architecture big iron numa

Data Consumption Rates

TPC-H SF100

Query 1, 9, 13, 21

TPC-H SF300

Query 1, 9, 13, 21


Fast track reference architecture

Fast Track Reference Architecture

My Complaints

  • Several Expensive SAN systems (11 disks)

    • Each must be configured independently

    • $1,500-2,000 amortized per disk

  • Too many 2-disk Arrays

    • 2 LUN per Array, too many data files

  • Build Indexes with MAXDOP 1

    • Is this brain dead?

  • Designed around 100MB/sec per disk

    • Not all DW is single scan, or sequential

Scripting?


Fragmentation

Fragmentation

Table

Weak Storage System

1) Fragmentation could degrade IO performance,

2) Defragmenting very large table on a weak storage system could render the database marginally to completely non-functional for a very long time.

Powerful Storage System

3) Fragmentation has very little impact.

4) Defragmenting has mild impact, and completes within night time window.

What is the correct conclusion?

File

Partition

LUN

Disk


Operating system view of storage

Operating System View of Storage


Operating system disk view

Disk 2

Basic

396GB

Online

Disk 3

Basic

396GB

Online

Controller 1 Port 0

Controller 1 Port 1

Controller 2 Port 0

Controller 2 Port 1

Controller 3 Port 0

Controller 3 Port 1

Disk 4

Basic

396GB

Online

Disk 6

Basic

396GB

Online

Disk 5

Basic

396GB

Online

Disk 7

Basic

396GB

Online

Operating System Disk View

Additional disks not shown, Disk 0 is boot drive, 1 – install source?


File layout

Disk 5 Partition 0

Partition 2

Disk 2, Partition 0

Partition 4

Partition 1

Partition 1

Partition 2

Partition 4

Partition 4

Partition 4

Disk 4 Partition 0

Partition 1

Disk 3 Partition 0

Partition 1

Partition 2

Partition 4

Disk 6 Partition 0

Partition 1

Partition 2

Partition 4

Disk 7 Partition 0

Partition 1

Partition 2

Partition 2

File Group for the big Table

File 4

Backup and Load

File 6

Backup and Load

File 3

Tempdb

File 6

Tempdb

File 3

Backup and Load

File 2

Small File Group

File 2

File Group for the big Table

File 3

Backup and Load

File 4

File Group for the big Table

File 2

Small File Group

File 3

Small File Group

File 4

Small File Group

File 6

Tempdb

File 2

Backup and Load

File 1

Backup and Load

File 5

File Group for the big Table

File 1

Tempdb

File 1

Small File Group

File 5

Tempdb

File 5

File Group for all others

File 1

File Group for the big Table

File 5

File Group for the big Table

File 6

Tempdb

File 4

File Layout

Each File Group is distributed across all data disks

Log disks not shown, tempdb share common pool with data


File groups and files

File Groups and Files

  • Dedicated File Group for largest table

    • Never defragment

  • One file group for all other regular tables

  • Load file group?

    • Rebuild indexes to different file group


Partitioning pitfalls

Disk 5

Disk 2

Disk 3

Disk 7

Disk 4

Disk 6

Table Partition 1

File Group 1

File Group 2

File Group 3

File Group 4

File Group 5

File Group 6

Table Partition 2

Table Partition 3

Table Partition 4

Table Partition 5

Table Partition 6

Partitioning - Pitfalls

Common Partitioning Strategy

Partition Scheme maps partitions to File Groups

What happens in a table scan?

Read first from Part 1

then 2, then 3, … ?

SQL 2008 HF to read from each partition in parallel?

What if partitions have disparate sizes?


Parallel execution plans

Parallel Execution Plans

Joe Chang

[email protected]

www.qdpma.com


About joe chang2

About Joe Chang

  • SQL Server Execution Plan Cost Model

  • True cost structure by system architecture

  • Decoding statblob (distribution statistics)

    • SQL Clone – statistics-only database

  • Tools

    • ExecStats – cross-reference index use by SQL-execution plan

    • Performance Monitoring,

    • Profiler/Trace aggregation


So you bought a 64 core box

So you bought a 64+ core box

Now

  • Learn all about Parallel Execution

    • All guns (cores) blazing

    • Negative scaling

    • Super-scaling

    • High degree of parallelism & small SQL

    • Anomalies, execution plan changes etc

  • Compression

  • Partitioning

Yes, this can happen, how will you know

No I have not been smoking pot

How much in CPU do I pay for this?

Great management tool, what else?


Parallel execution plans1

Parallel Execution Plans

Reference: Adam Machanic PASS


Execution plan quickie

Execution Plan Quickie

I/O and CPU Cost components

F4

Estimated Execution Plan

Cost is duration in seconds on some reference platform

IO Cost for scan: 1 = 10,800KB/s, 810 implies 8,748,000KB

IO in Nested Loops Join: 1 = 320/s, multiple of 0.003125


Index key lookup scan

Index + Key Lookup - Scan

Actual CPUTime (Data in memory)

LU19191919

Scan87368727

(926.67- 323655 * 0.0001581) / 0.003125 = 280160 (86.6%)

True cross-over approx 1,400,000 rows

1 row : page

1,093,729 pages/1350 = 810.17 (8,748MB)


Index key lookup scan1

Index + Key Lookup - Scan

Actual CPUTime

LU 2138321

Scan18622658

(817- 280326 * 0.0001581) / 0.003125 = 247259 (88%)

8748000KB/8/1350 = 810


Actual execution plan

Actual Execution Plan

Estimated

Actual

Note Actual Number

of Rows, Rebinds,

Rewinds

Estimated

Actual


Row count and executions

Row Count and Executions

Outer

Inner

Source

For Loop Join inner source and Key Lookup,

Actual Num Rows = Num of Exec × Num of Rows


Parallel plans

Parallel Plans


Parallelism operations

Parallelism Operations

  • Distribute Streams

    • Non-parallel source, parallel destination

  • Repartition Streams

    • Parallel source and destination

  • Gather Streams

    • Destination is non-parallel


Parallel execution plans2

Parallel Execution Plans

Note: gold circle with double arrow, and parallelism operations


Parallel scan and index seek

Parallel Scan (and Index Seek)

2X

IO Cost same

CPU reduce by degree of parallelism, except no reduction for DOP 16

DOP 1

DOP 2

8X

4X

IO contributes most of cost!

DOP 8

DOP 4


Parallel scan 2

Parallel Scan 2

DOP 16


Hash match aggregate

Hash Match Aggregate

CPU cost only reduces

By 2X,


Parallel scan

Parallel Scan

  • IO Cost is the same

  • CPU cost reduced in proportion to degree of parallelism, last 2X excluded?

On a weak storage system, a single thread can saturate the IO channel,

Additional threads will not increase IO (reduce IO duration).

A very powerful storage system can provide IO proportional to the number of threads. It might be nice if this was optimizer option?

The IO component can be a very large portion of the overall plan cost

Not reducing IO cost in parallel plan may inhibit generating favorable plan,

i.e., not sufficient to offset the contribution from the Parallelism operations.

A parallel execution plan is more likely on larger systems (-P to fake it?)


Actual execution plan parallel

Actual Execution Plan - Parallel


More parallel plan details

More Parallel Plan Details


Parallel plan actual

Parallel Plan - Actual


Parallelism hash joins

Parallelism – Hash Joins


Hash join cost

Hash Join Cost

DOP 4

DOP 1

DOP 2

Search: Understanding Hash Joins

For In-memory, Grace, Recursive

DOP 8


Hash join cost1

Hash Join Cost

CPU Cost is linear with number of rows, outer and inner source

See BOL on Hash Joins for In-Memory, Grace, Recursive

IO Cost is zero for small intermediate data size,

beyond set point proportional to server memory(?)

IO is proportional to excess data (beyond in-memory limit)

Parallel Plan: Memory allocation is per thread!

Summary: Hash Join plan cost depends on memory if IO component is not zero, in which case is disproportionately lower with parallel plans. Does not reflect real cost?


Parallelism repartition streams

Parallelism Repartition Streams

DOP 8

DOP 2

DOP 4


Bitmap

Bitmap

BOL: Optimizing Data Warehouse Query Performance Through Bitmap Filtering

A bitmap filter uses a compact representation of a set of values from a table in one part of the operator tree to filter rows from a second table in another part of the tree. Essentially, the filter performs a semi-join reduction; that is, only the rows in the second table that qualify for the join to the first table are processed.

SQL Server uses the Bitmap operator to implement bitmap filtering in parallel query plans. Bitmap filtering speeds up query execution by eliminating rows with key values that cannot produce any join records before passing rows through another operator such as the Parallelism operator. A bitmap filter uses a compact representation of a set of values from a table in one part of the operator tree to filter rows from a second table in another part of the tree. By removing unnecessary rows early in the query, subsequent operators have fewer rows to work with, and the overall performance of the query improves. The optimizer determines when a bitmap is selective enough to be useful and in which operators to apply the filter. For more information, see Optimizing Data Warehouse Query Performance Through Bitmap Filtering.


Parallel execution plan summary

Parallel Execution Plan Summary

  • Queries with high IO cost may show little plan cost reduction on parallel execution

  • Plans with high portion hash or sort cost show large parallel plan cost reduction

  • Parallel plans may be inhibited by high row count in Parallelism Repartition Streams

  • Watch out for (Parallel) Merge Joins!


Scaling theory

Scaling Theory


Parallel execution strategy

Parallel Execution Strategy

  • Partition work into little pieces

    • Ensures each thread has same amount

    • High overhead to coordinate

  • Partition into big pieces

    • May have uneven distribution between threads

    • Small table join to big table

      • Thread for each row from small table

  • Partitioned table options


What should scale

What Should Scale?

3

2

2

Trivially parallelizable:

1) Split large chunk of work among threads,

2) Each thread works independently,

3) Small amount of coordination to consolidate threads


More difficult

More Difficult?

4

3

2

3

2

Parallelizable:

1) Split large chunk of work among threads,

2) Each thread works on first stage

3) Large coordination effort between threads

4) More work

Consolidate


Partitioned tables

Partitioned Tables

No Repartition Streams

Regular Table

Partitioned Tables

No Repartition Streams operations!


Scaling reality

Scaling Reality

8-way Quad-Core Opteron

Windows Server 2008 R2

SQL Server 2008 SP1 + HF 27


Test queries

Test Queries

  • TPC-H SF 10 database

    • Standard, Compressed, Partitioned (30)

  • Line Item Table SUM, 59M rows, 8.75GB

  • Orders Table 15M rows


Cpu sec

CPU-sec

Standard

CPU-sec to SUM 1 or 2 columns in Line Item

Compressed


Speed up

Speed Up

Standard

Compressed


Line item sum 1 column

Line Item sum 1 column

CPU-sec

Speed up relative to DOP 1


Line item sum w group by

Line Item Sum w/Group By

CPU-sec

Speedup


Hash join

Hash Join

CPU-sec

Speedup


Key lookup and table scan

Key Lookup and Table Scan

CPU-sec

1.4M rows

Speedup


Parallel execution summary

Parallel Execution Summary

  • Contention in queries w/low cost per page

    • Simple scan,

  • High Cost per Page – improves scaling!

    • Multiple Aggregates, Hash Join, Compression

  • Table Partitioning –

    • alternative query plans

  • Loop Joins – broken at high DOP

  • Merge Join – seriously broken (parallel)


Scaling dw summary

Scaling DW Summary

  • Massive IO bandwidth

  • Parallel options for data load, updates etc

  • Investigate Parallel Execution Plans

    • Scaling from DOP 1, 2, 4, 8, 16, 32 etc

    • Scaling with and w/o HT

  • Strategy for limiting DOP with multiple users


Fixes from microsoft needed

Fixes from Microsoft Needed

  • Contention issues in parallel execution

    • Table scan, Nested Loops

  • Better plan cost model for scaling

    • Back-off on parallelism if gain is negligible

  • Fix throughput degradation with multiple users running big DW queries

    • Sybase and Oracle, Throughput is close to Power or better


Test systems

Test Systems


Test systems1

Test Systems

  • 2-way quad-core Xeon 5430 2.66GHz

    • Windows Server 2008 R2, SQL 2008 R2

  • 8-way dual-core Opteron 2.8GHz

    • Windows Server 2008 SP1, SQL 2008 SP1

  • 8-way quad-core Opteron 2.7GHz Barcelona

    • Windows Server 2008 R2, SQL 2008 SP1

Build 2789

8-way systems were configured for AD- not good!


Test methodology

Test Methodology

  • Boot with all processors

    • Run queries at MAXDOP 1, 2, 4, 8, etc

  • Not the same as running on 1-way, 2-way, 4-way server

  • Interpret results with caution


References

References

  • Search Adam Machanic PASS


Sql server scaling on big iron numa systems

TPC-H

SQL Server Scaling on Big Iron (NUMA) Systems

Joe Chang

[email protected]

www.qdpma.com


About joe chang3

About Joe Chang

  • SQL Server Execution Plan Cost Model

  • True cost structure by system architecture

  • Decoding statblob (distribution statistics)

    • SQL Clone – statistics-only database

  • Tools

    • ExecStats – cross-reference index use by SQL-execution plan

    • Performance Monitoring,

    • Profiler/Trace aggregation


Tpc h

TPC-H


Tpc h1

TPC-H

  • DSS – 22 queries, geometric mean

    • 60X range plan cost, comparable actual range

  • Power – single stream

    • Tests ability to scale parallel execution plans

  • Throughput – multiple streams

  • Scale Factor 1 – Line item data is 1GB

    • 875MB with DATE instead of DATETIME

  • Only single column indexes allowed, Ad-hoc


Observed scaling behaviors

Observed Scaling Behaviors

  • Good scaling, leveling off at high DOP

  • Perfect Scaling ???

  • Super Scaling

  • Negative Scaling

    • especially at high DOP

  • Execution Plan change

    • Completely different behavior


Tpc h published results

TPC-H Published Results


Tpc h sf 100gb

TPC-H SF 100GB

2-way Xeon 5355, 5570, 5680, Opt 6176

Between 2-way Xeon 5570, all are close, HDD has best throughput, SATA SSD has best composite, and Fusion-IO has be power.

Westmere and Magny-Cours, both 192GB memory, are very close


Tpc h sf 300gb

TPC-H SF 300GB

8x QC/6C & 4x12C Opt,

6C Istanbul improved over 4C Shanghai by 45% Power, 73% Through-put, 59% overall.

4x12C 2.3GHz improved17% over 8x6C 2.8GHz


Tpc h sf 1000

TPC-H SF 1000


Tpc h sf 3tb

TPC-H SF 3TB

X7460 & X7560

Nehalem-EX 64 cores better than 96 Core 2.


Tpc h sf 100gb 300gb 3tb

TPC-H SF 100GB, 300GB & 3TB

SF100 2-way

Westmere and Magny-Cours are very close

Between 2-way Xeon 5570, all are close, HDD has best through-put, SATA SSD has best composite, and Fusion-IO has be power

SF300 8x QC/6C & 4x12C

6C Istanbul improved over 4C Shanghai by 45% Power, 73% Through-put, 59% overall.

4x12C 2.3GHz improved17% over 8x6C 2.8GHz

SF 3TB X7460 & X7560

Nehalem-EX 64 cores better than 96 Core 2.


Tpc h published results1

TPC-H Published Results

  • SQL Server excels in Power

    • Limited by Geometric mean, anomalies

  • Trails in Throughput

    • Other DBMS get better throughput than power

    • SQL Server throughput below Power

      • by wide margin

    • Speculation – SQL Server does not throttle back parallelism with load?


Tpc h sf100

Processors

GHz

Total

Cores

Mem

GB

SQL

SF

Power

Through

put

QphH

2 Xeon 5355

2.66

8

64

5sp2

100

23,378.0

13,381.0

17,686.7

5570 Fusion

2x5570 SSD

2x5570 HDD

2.93

2.93

2.93

8

8

8

144

144

144

8sp1

8sp1

8sp1

100

100

100

72,110.5

70,048.5

67,712.9

38,019.1

36,190.8

37,749.1

51,085.6

51,422.4

50,738.4

2 Xeon 5680

3.33

12

192

8r2

100

99,426.3

55,038.2

73,974.6

2 Opt 6176

2.3

24

192

8r2

100

94,761.5

53,855.6

71,438.3

TPC-H SF100


Tpc h sf300

Processors

GHz

Total

Cores

Mem

GB

SQL

SF

Power

Through

put

QphH

4 Opt 8220

2.8

8

128

5rtm

300

25,206.4

13,283.8

18,298.5

8 Opt 8360

2.5

32

256

8rtm

300

67,287.4

41,526.4

52,860.2

8 Opt 8384

2.7

32

256

8rtm

300

75,161.2

44,271.9

57,684.7

8 Opt 8439

2.8

48

256

8sp1

300

109,067.1

76,869.0

91,558.2

4 Xeon 7560

4 Opt 6176

2.26

2.3

48

32

512

640

8r2

8r2

300

300

129,198.3

152,453.1

96,585.4

89,547.7

107,561.2

121,345.6

TPC-H SF300

All of the above are HP results?, Sun result Opt 8384, sp1,

Pwr 67,095.6, Thr 45,343.5, QphH 55,157.5


Tpc h 1tb

Itanium 9350

8 Opt 8439

8 Opt 8439

2.8

1.73

2.8

48

64

48

512

384

512

ASE

O11R2

8R2?

1000

1000

1000

139,181.0

108,436.8

95,789.1

69,367.6

96,652.7

141,188.1

102,375.3

140,181.1

81,367.6

TPC-H 1TB

Processors

GHz

Total

Cores

Mem

GB

SQL

SF

Power

Through

put

QphH


Tpc h 3tb

8 Xeon 7560

16 Xeon 7460

POWER6

Itanium 9350

2.26

1.73

2.66

5.0

64

96

64

64

512

1024

512

512

Sybase

8r2

Sybase

8r2

3000

3000

1000

3000

185,297.7

142,790.7

120,254.8

142,790.7

142,685.6

87,841.4

171,607.4

171,607.4

156,537.3

162,601.7

102,254.8

156,537.3

TPC-H 3TB

Processors

GHz

Total

Cores

Mem

GB

SQL

SF

Power

Through

put

QphH


Tpc h published results2

TPC-H Published Results

Processors

GHz

Total

Cores

Mem

GB

SQL

SF

Power

Through

put

QphH

2 Xeon 5355

2.66

8

64

5sp2

100

23,378

13,381

17,686.7

2 Xeon 5570

2.93

8

144

8sp1

100

72,110.5

36,190.8

51,085.6

2 Xeon 5680

3.33

12

192

8r2

100

99,426.3

55,038.2

73,974.6

2 Opt 6176

2.3

24

192

8r2

100

94,761.5

53,855.6

71,438.3

4 Opt 8220

2.8

8

128

5rtm

300

25,206.4

13,283.8

18,298.5

8 Opt 8360

2.5

32

256

8rtm

300

67,287.4

41,526.4

52,860.2

8 Opt 8384

2.7

32

256

8rtm

300

75,161.2

44,271.9

57,684.7

8 Opt 8439

2.8

48

256

8sp1

300

109,067.1

76,869.0

91,558.2

4 Opt 6176

2.3

48

512

8r2

300

129,198.3

89,547.7

107,561.2

8 Xeon 7560

2.26

64

512

8r2

3000

185,297.7

142,685.6

162,601.7


Sf100 2 way big queries sec

SF100 2-way Big Queries (sec)

Query time in sec

Xeon 5570 with SATA SSD poor on Q9, reason unknown

Both Xeon 5680 and Opteron 6176 big improvement over Xeon 5570


Sf100 middle q

SF100 Middle Q

Query time in sec

Xeon 5570-HDD and 5680-SSD poor on Q12, reason unknown

Opteron 6176 poor on Q11


Sf100 small queries

SF100 Small Queries

Query time in sec

Xeon 5680 and Opteron poor on Q20

Note limited scaling on Q2, & 17


Sf300 32 cores big queries

SF300 32+ cores Big Queries

Query time in sec

Opteron 6176 poor relative to 8439 on Q9 & 13,

same number of total cores


Sf300 middle q

SF300 Middle Q

Query time in sec

Opteron 6176 much better than 8439 on Q11 & 19

Worse on Q12


Sf300 small q

SF300 Small Q

Query time in sec

Opteron 6176 much better on Q2, even with 8439 on others


Sf1000

SF1000


Sf10001

SF1000


Sf10002

SF1000


Sf10003

SF1000


Sf1000 itanium superdome

SF1000 Itanium - Superdome


Sf 3tb 8 7560 versus 16 7460

SF 3TB – 8×7560 versus 16×7460

5.6X

Broadly 50% faster overall, 5X+ on one, slower on 2, comparable on 3


64 cores 7560 relative to pwr6

64 cores, 7560 relative to PWR6


Tpc h summary

TPC-H Summary

  • Scaling is impressive on some SQL

  • Limited ability (value) is scaling small Q

  • Anomalies, negative scaling


Tpc h queries

TPC-H Queries


Q1 pricing summary report

Q1 Pricing Summary Report


Query 2 minimum cost supplier

Query 2 Minimum Cost Supplier

Wordy, but only touches the small tables, second lowest plan cost (Q15)


System architecture big iron numa

Q3


Q6 forecasting revenue change

Q6 Forecasting Revenue Change


Q7 volume shipping

Q7 Volume Shipping


Q8 national market share

Q8 National Market Share


Q9 product type profit measure

Q9 Product Type Profit Measure


Q11 important stock identification

Q11 Important Stock Identification

Parallel

Non-Parallel


Q12 random io

Q12 Random IO?


System architecture big iron numa

Q13

Why does Q13 have perfect scaling?


Q17 small quantity order revenue

Q17 Small Quantity Order Revenue


Q18 large volume customer

Q18 Large Volume Customer

Non-Parallel

Parallel


System architecture big iron numa

Q19


System architecture big iron numa

Q20?

Date functions are usually written as

because Line Item date columns are “date” type

CAST helps DOP 1 plan, but get bad plan for parallel

This query may get a poor execution plan


Q21 suppliers who kept orders waiting

Q21 Suppliers Who Kept Orders Waiting

Note 3 references to Line Item


System architecture big iron numa

Q22


Tpc h studies

TPC-H Studies

Joe Chang

[email protected]

www.qdpma.com


About joe chang4

About Joe Chang

  • SQL Server Execution Plan Cost Model

  • True cost structure by system architecture

  • Decoding statblob (distribution statistics)

    • SQL Clone – statistics-only database

  • Tools

    • ExecStats – cross-reference index use by SQL-execution plan

    • Performance Monitoring,

    • Profiler/Trace aggregation


Tpc h2

TPC-H


Tpc h3

TPC-H

  • DSS – 22 queries, geometric mean

    • 60X range plan cost, comparable actual range

  • Power – single stream

    • Tests ability to scale parallel execution plans

  • Throughput – multiple streams

  • Scale Factor 1 – Line item data is 1GB

    • 875MB with DATE instead of DATETIME

  • Only single column indexes allowed, Ad-hoc


Sf 10 test studies

SF 10, test studies

  • Not valid for publication

  • Auto-Statistics enabled,

    • Excludes compile time

  • Big Queries – Line Item Scan

  • Super Scaling – Mission Impossible

  • Small Queries & High Parallelism

  • Other queries, negative scaling

Did not apply T2301, or disallow page locks


Big q plan cost vs actual

Big Q: Plan Cost vs Actual

memory affects Hash IO onset

Plan Cost reduction from DOP1 to 16/32

Q128%

Q944%

Q1870%

Q2120%

Plan Cost @ 10GB

Plan Cost says scaling is poor except for Q18,

Q18 & Q 21 > 3X Q1, Q9

Actual Query time

In seconds

Plan Cost is poor indicator of true parallelism scaling


Big query speed up and cpu

Big Query: Speed Up and CPU

Holy Grail

Q13 has slightly better than perfect scaling?

In general, excellent scaling to DOP 8-24, weak afterwards

Speed up relative to DOP 1

CPU time

In seconds


Super scaling

Super Scaling

  • Suppose at DOP 1, a query runs for 100 seconds, with one CPU fully pegged

    • CPU time = 100 sec, elapse time = 100 sec

  • What is best case for DOP 2?

    • Assuming nearly zero Repartition Threads cost

    • CPU time = 100 sec, elapsed time = 50?

Super Scaling: CPU time decreases going from Non-Parallel to Parallel plan!

No, I have not started drinking, yet


Super scaling1

Super Scaling

CPU normalized to DOP 1

CPU-sec goes down from DOP 1 to 2 and higher (typically 8)

3.5X speedup from DOP 1 to 2 (Normalized to DOP 1)

Speed up relative to DOP 1


Cpu and query time in seconds

CPU and Query time in seconds

CPU time

Query time


Super scaling summary

Super Scaling Summary

  • Most probable cause

    • Bitmap Operator in Parallel Plan

  • Bitmap Filters are great,

    • Question for Microsoft:

    • Can I use Bitmap Filters in OLTP systems with non-parallel plans?


Small queries plan cost vs act

Small Queries – Plan Cost vs Act

Query 3 and 16 have lower plan cost than Q17, but not included

Plan Cost

Q4,6,17 great scaling to DOP 4, then weak

Negative scaling also occurs

Query time


Small queries cpu speedup

Small Queries CPU & Speedup

CPU time

What did I get for all that extra CPU?, Interpretation: sharp jump in CPU means poor scaling, disproportionate means negative scaling

Query 2 negative at DOP 2, Q4 is good, Q6 get speedup, but at CPU premium, Q17 and 20 negative after DOP 8

Speed up


High parallelism small queries

High Parallelism – Small Queries

  • Why? Almost No value

  • TPC-H geometric mean scoring

    • Small queries have as much impact as large

      • Linear sum of weights large queries

  • OLTP with 32, 64+ cores

    • Parallelism good if super-scaling

    • Default max degree of parallelism 0

      • Seriously bad news, especially for small Q

    • Increase cost threshold for parallelism?

Sometimes you do get lucky


Q that go negative

Q that go Negative

Query time

“Speedup”


System architecture big iron numa

CPU


Other queries cpu speedup

Other Queries – CPU & Speedup

CPU time

Q3 has problems beyond DOP 2

Speedup


Other query time seconds

Other - Query Time seconds

Query time


Scaling summary

Scaling Summary

  • Some queries show excellent scaling

  • Super-scaling, better than 2X

  • Sharp CPU jump on last DOP doubling

  • Need strategy to cap DOP

    • To limit negative scaling

    • Especially for some smaller queries?

  • Other anomalies


Compression

Compression

PAGE


Compression overhead overall

Compression Overhead - Overall

Query time compressed relative to uncompressed

40% overhead for compression at low DOP,

10% overhead at max DOP???

CPU time compressed relative to uncompressed


System architecture big iron numa

Query time compressed relative to uncompressed

CPU time compressed relative to uncompressed


Compressed table

Compressed Table

LINEITEM – real data may be more compressible

Uncompressed: 8,749,760KB, Average Bytes per row: 149

Compressed: 4,819,592KB, Average Bytes per row: 82


Partitioning

Partitioning

Orders and Line Item on Order Key


Partitioning impact overall

Partitioning Impact - Overall

Query time partitioned relative to not partitioned

CPU time partitioned relative to not partitioned


System architecture big iron numa

Query time partitioned relative to not partitioned

CPU time partitioned relative to not partitioned


Plan for partitioned tables

Plan for Partitioned Tables


Scaling dw summary1

Scaling DW Summary

  • Massive IO bandwidth

  • Parallel options for data load, updates etc

  • Investigate Parallel Execution Plans

    • Scaling from DOP 1, 2, 4, 8, 16, 32 etc

    • Scaling with and w/o HT

  • Strategy for limiting DOP with multiple users


Fixes from microsoft needed1

Fixes from Microsoft Needed

  • Contention issues in parallel execution

    • Table scan, Nested Loops

  • Better plan cost model for scaling

    • Back-off on parallelism if gain is negligible

  • Fix throughput degradation with multiple users running big DW queries

    • Sybase and Oracle, Throughput is close to Power or better


Query plans

Query Plans


Big queries

Big Queries


Q1 pricing summary report1

Q1 Pricing Summary Report


Q1 plan

Q1 Plan

Non-Parallel

Parallel plan 28% lower than scalar, IO is 70%, no parallel plan cost reduction

Parallel


Q9 product type profit measure1

Q9 Product Type Profit Measure

IO from 4 tables contribute 58% of plan cost, parallel plan is 39% lower

Non-Parallel

Parallel


Q9 non parallel plan

Q9 Non-Parallel Plan

Table/Index Scans comprise 64%, IO from 4 tables contribute 58% of plan cost

Join sequence: Supplier, (Part, PartSupp), Line Item, Orders


Q9 parallel plan

Q9 Parallel Plan

Non-Parallel: (Supplier), (Part, PartSupp), Line Item, Orders

Parallel: Nation, Supplier, (Part, Line Item), Orders, PartSupp


Q9 non parallel plan details

Q9 Non-Parallel Plan details

Table Scans comprise 64%,

IO from 4 tables contribute 58% of plan cost


Q9 parallel reg vs partitioned

Q9 Parallel reg vs Partitioned


System architecture big iron numa

Q13

Why does Q13 have perfect scaling?


Q18 large volume customer1

Q18 Large Volume Customer

Non-Parallel

Parallel


Q18 graphical plan

Q18 Graphical Plan

Non-Parallel Plan: 66% of cost in Hash Match, reduced to 5% in Parallel Plan


Q18 plan details

Q18 Plan Details

Non-Parallel

Parallel

Non-Parallel Plan Hash Match cost is 1245 IO, 494.6 CPU

DOP 16/32: size is below IO threshold, CPU reduced by >10X


Q21 suppliers who kept orders waiting1

Q21 Suppliers Who Kept Orders Waiting

Parallel

Non-Parallel

Note 3 references to Line Item


Q21 non parallel plan

Q21 Non-Parallel Plan

H3

H1

H2

H2

H1

H3


Q21 parallel

Q21 Parallel


System architecture big iron numa

Q21

  • 3 full Line Item clustered index scans

  • Plan cost is approx 3X Q1, single “scan”


Super scaling2

Super Scaling


Q7 volume shipping1

Q7 Volume Shipping

Non-Parallel

Parallel


Q7 non parallel plan

Q7 Non-Parallel Plan

Join sequence: Nation, Customer, Orders, Line Item


Q7 parallel plan

Q7 Parallel Plan

Join sequence: Nation, Customer, Orders, Line Item


Q8 national market share1

Q8 National Market Share

Non-Parallel

Parallel


Q8 non parallel plan

Q8 Non-Parallel Plan

Join sequence: Part, Line Item, Orders, Customer


Q8 parallel plan

Q8 Parallel Plan

Join sequence: Part, Line Item, Orders, Customer


Q11 important stock identification1

Q11 Important Stock Identification

Parallel

Non-Parallel


System architecture big iron numa

Q11

Join sequence: A) Nation, Supplier, PartSupp, B) Nation, Supplier, PartSupp


System architecture big iron numa

Q11

Join sequence: A) Nation, Supplier, PartSupp, B) Nation, Supplier, PartSupp


Small queries

Small Queries


Query 2 minimum cost supplier1

Query 2 Minimum Cost Supplier

Wordy, but only touches the small tables, second lowest plan cost (Q15)


System architecture big iron numa

Q2

Clustered Index Scan on Part and PartSupp have highest cost (48%+42%)


System architecture big iron numa

Q2

PartSupp is now Index Scan + Key Lookup


Q6 forecasting revenue change1

Q6 Forecasting Revenue Change

Note sure why this blows CPU

Scalar values are pre-computed, pre-converted


System architecture big iron numa

Q20?

Date functions are usually written as

because Line Item date columns are “date” type

CAST helps DOP 1 plan, but get bad plan for parallel

This query may get a poor execution plan


System architecture big iron numa

Q20


System architecture big iron numa

Q20


Q20 alternate parallel

Q20 alternate - parallel

Statistics estimation error here

Penalty for mistake

applied here


Other queries

Other Queries


System architecture big iron numa

Q3


System architecture big iron numa

Q3


Q12 random io1

Q12 Random IO?

Will this generate random IO?


Query 12 plans

Query 12 Plans

Non-Parallel

Parallel


Queries that go negative

Queries that go Negative


Q17 small quantity order revenue1

Q17 Small Quantity Order Revenue


System architecture big iron numa

Q17

Table Spool is concern


System architecture big iron numa

Q17

the usual suspects


System architecture big iron numa

Q19


System architecture big iron numa

Q19


System architecture big iron numa

Q22


System architecture big iron numa

Q22


System architecture big iron numa

Speedup from DOP 1 query time

CPU relative to DOP 1


  • Login