System architecture big iron numa
Download
1 / 254

System Architecture: Big Iron (NUMA) - PowerPoint PPT Presentation


  • 44 Views
  • Uploaded on

System Architecture: Big Iron (NUMA). Joe Chang [email protected] www.qdpma.com. About Joe Chang. SQL Server Execution Plan Cost Model True cost structure by system architecture Decoding statblob (distribution statistics) SQL Clone – statistics-only database Tools

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' System Architecture: Big Iron (NUMA)' - galvin-middleton


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
System architecture big iron numa

System Architecture: Big Iron (NUMA)

Joe Chang

[email protected]

www.qdpma.com


About joe chang
About Joe Chang

  • SQL Server Execution Plan Cost Model

  • True cost structure by system architecture

  • Decoding statblob (distribution statistics)

    • SQL Clone – statistics-only database

  • Tools

    • ExecStats – cross-reference index use by SQL-execution plan

    • Performance Monitoring,

    • Profiler/Trace aggregation


Scaling sql on numa topics
Scaling SQL on NUMA Topics

  • OLTP – Thomas Kejser session

    • “Designing High Scale OLTP Systems”

  • Data Warehouse

  • Ongoing Database Development

  • Bulk Load – SQL CAT paper + TK session

    • “The Data Loading Performance Guide”

Other Sessions with common coverage:

Monitoring and Tuning Parallel Query Execution II, R Meyyappan

(SQLBits 6) Inside the SQL Server Query Optimizer, Conor Cunningham

Notes from the field: High Performance Storage, John Langford

SQL Server Storage – 1000GB Level, Brent Ozar



Symmetric multi processing

CPU

CPU

CPU

CPU

System Bus

MCH

PXH

PXH

ICH

Symmetric Multi-Processing

SMP, processors are not dedicated to specific tasks (ASMP), single OS image, each processor can acess all memory

SMP makes no reference to memory architecture?

Not to be confused to Simultaneous Multi-Threading (SMT)

Intel calls SMT Hyper-Threading (HT), which is not to be confused with AMD Hyper-Transport (also HT)


Non uniform memory access
Non-Uniform Memory Access

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

Memory

Controller

Memory

Controller

Memory

Controller

Memory

Controller

Node Controller

Node Controller

Node Controller

Node Controller

Shared Bus or X Bar

  • NUMA Architecture - Path to memory is not uniform

  • Node: Processors, Memory, Separate or combined Memory + Node Controllers

  • Nodes connected by shared bus, cross-bar, ring

  • Traditionally, 8-way+ systems

Local memory latency ~150ns, remote node memory ~300-400ns, can cause erratic behavior if OS/code is not NUMA aware


Amd opteron

Opteron

Opteron

Opteron

Opteron

HT2100

HT2100

HT1100

AMD Opteron

Technically, Opteron is NUMA,

but remote node memory latency is low,

no negative impact or erratic behavior!

For practical purposes: behave like SMP system

Local memory latency ~50ns, 1 hop ~100ns, two hop 150ns?

Actual: more complicated because of snooping (cache coherency traffic)


8 way opteron sys architecture

CPU

0

CPU

1

CPU

2

CPU

3

CPU

4

CPU

5

CPU

6

CPU

7

8-way Opteron Sys Architecture

Opteron processor (prior to Magny-Cours) has 3 Hyper-Transport links. Note 8-way top and bottom right processors use 2 HT to connect

to other processors, 3rd HT for IO,

CPU 1 & 7 require 3 hops to each other



Nehalem system architecture
Nehalem System Architecture

Intel Nehalem generation processors have Quick Path Interconnect (QPI) Xeon 5500/5600 series have 2, Xeon 7500 series have 4 QPI

8-way Glue-less is possible


Numa local and remote memory
NUMA Local and Remote Memory

  • Local memory is closer than remote

  • Physical access time is shorter

  • What is actual access time?

    • With cache coherency requirement!


Ht assist probe filter
HT Assist – Probe Filter

part of L3 cache used as directory cache

ZDNET


Source snoop coherency
Source Snoop Coherency

From HP PREMA Architecture whitepaper:

All reads result in snoops to all other caches, …

Memory controller cannot return the data until it has collected all the snoop responses and is sure that no cache provided a more recent copy of the memory line


Dl980g7
DL980G7

From HP PREAM Architecture whitepaper:

Each node controller stores information about* all data in the processor caches, minimizes inter-processor coherency communication, reduces latency to local memory

(*only cache tags, not cache data)


Hp proliant dl980 architecture
HP ProLiant DL980 Architecture

Node Controllers reduces effective memory latency


Superdome 2 itanium sx3000
Superdome 2 – Itanium, sx3000

Agent – Remote Ownership Tag

+ L4 cache tags

64M eDRAM L4 cache data


Ibm x3850 x5 glue less
IBM x3850 X5 (Glue-less)

Connect two 4-socket Nodes to make 8-way system


Os memory models

24

25

26

27

28

29

30

31

16

17

18

19

20

21

22

23

8

9

10

11

12

13

14

15

0

1

2

3

4

5

6

7

Node 0

Node 1

Node 2

Node 3

6

7

14

15

22

23

30

31

4

5

12

13

20

21

28

29

2

3

10

11

18

19

26

27

0

1

8

9

16

17

24

25

Node 0

Node 1

Node 2

Node 3

OS Memory Models

SUMA: Sufficiently Uniform Memory Access

Memory interleaved across nodes

2

1

NUMA: first interleaved within a node,

then spanned across nodes

2

1

Memory stripe is then spanned across nodes


Windows os numa support

24

25

26

27

28

29

30

31

16

17

18

19

20

21

22

23

8

9

10

11

12

13

14

15

0

1

2

3

4

5

6

7

Node 0

Node 1

Node 2

Node 3

6

7

14

15

22

23

30

31

5

13

21

29

4

12

20

28

2

3

10

11

18

19

26

27

0

1

8

9

16

17

24

25

Node 0

Node 1

Node 2

Node 3

Windows OS NUMA Support

  • Memory models

    • SUMA – Sufficiently Uniform Memory Access

    • NUMA – separate memory pools by Node

Memory is striped across NUMA nodes


Memory model example 4 nodes
Memory Model Example: 4 Nodes

  • SUMA Memory Model

    • memory access uniformly distributed

    • 25% of memory accesses local, 75% remote

  • NUMA Memory Model

    • Goal is better than 25% local node access

    • True local access time also needs to be faster

    • Cache Coherency may increase local access


Architecting for numa

Cal

Tex

SE

NE

Node 6

Node 4

Node 2

Node 0

0-0

6-0

4-0

3-0

5-0

1-0

2-0

7-0

0-1

1-1

2-1

3-1

4-1

5-1

6-1

7-1

PNW

Mnt

MidA

Cen

Node 5

Node 1

Node 7

Node 3

Architecting for NUMA

End to End Affinity

App Server

TCP Port

CPU

Memory

Table

Web determines port for each user by group

(but should not be by geography!)

Affinitize port to NUMA node

Each node access localized data (partition?)

OS may allocate substantial chunk from Node 0?

North East

1440

1441

1442

1443

1444

1445

1446

1447

Mid Atlantic

South East

Central

Texas

Mountain

California

Pacific NW


Architecting for numa1

Cal

Tex

SE

NE

Node 6

Node 4

Node 2

Node 0

0-0

6-0

4-0

3-0

5-0

1-0

2-0

7-0

0-1

1-1

2-1

3-1

4-1

5-1

6-1

7-1

PNW

Mnt

MidA

Cen

Node 5

Node 1

Node 7

Node 3

Architecting for NUMA

End to End Affinity

App Server

TCP Port

CPU

Memory

Table

Web determines port for each user by group

(but should not be by geography!)

Affinitize port to NUMA node

Each node access localized data (partition?)

OS may allocate substantial chunk from Node 0?

North East

1440

1441

1442

1443

1444

1445

1446

1447

Mid Atlantic

South East

Central

Texas

Mountain

California

Pacific NW


Hp ux lora
HP-UX LORA

  • HP-UX – Not Microsoft Windows

  • Locality-Optimizer Resource Alignment

    • 12.5% Interleaved Memory

    • 87.5% NUMA node Local Memory


System tech specs
System Tech Specs

Processors

Cores

DIMM

PCI-E G2

Total Cores

Max

memory

Base

2 x Xeon X56x0

6

18

5 x8+,1 x4

12

192G*

$7K

4 x Opteron 6100

12

32

5 x8, 1 x4

48

512G

$14K

4 x Xeon X7560

8

64

4 x8, 6 x4†

32

1TB

$30K

8 x Xeon X7560

8

128

9 x8, 5 x4‡

64

2TB

$100K

8GB $400 ea 18 x 8G = 144GB, $7200, 64 x 8G = 512GB - $26K

16GB $1100 ea 12 x16G =192GB, $13K, 64 x 16G = 1TB – $70K

Max memory for 2-way Xeon 5600 is 12 x 16 = 192GB,

† Dell R910 and HP DL580G7 have different PCI-E

‡ ProLiant DL980G7 can have 3 IOH for additional PCI-E slots



Operating system
Operating System

  • Windows Server 2003 RTM, SP1

    • Network limitations (default)

    • Scalable Networking Pack (912222)

  • Windows Server 2008

  • Windows Server 2008 R2 (64-bit only)

    • Breaks 64 logical processor limit

    • NUMA IO enhancements?

Impacts OLTP

Search: MSI-X

Do not bother trying to do DW on 32-bit OS or 32-bit SQL Server

Don’t try to do DW on SQL Server 2000


Sql server version
SQL Server version

  • SQL Server 2000

    • Serious disk IO limitations (1GB/sec ?)

    • Problematic parallel execution plans

  • SQL Server 2005 (fixed most S2K problems)

    • 64-bit on X64 (Opteron and Xeon)

    • SP2 – performance improvement 10%(?)

  • SQL Server 2008 & R2

    • Compression, Filtered Indexes, etc

    • Star join, Parallel query to partitioned table


Configuration
Configuration

  • SQL Server Startup Parameter: E

    • Trace Flags 834, 836, 2301

  • Auto_Date_Correlation

    • Order date < A, Ship date > A

      • Implied: Order date > A-C, Ship date < A+C

  • Port Affinity – mostly OLTP

  • Dedicated processor ?

    • for log writer ?


Storage performance for data warehousing

Storage Performance for Data Warehousing

Joe Chang

[email protected]

www.qdpma.com


About joe chang1
About Joe Chang

  • SQL Server Execution Plan Cost Model

  • True cost structure by system architecture

  • Decoding statblob (distribution statistics)

    • SQL Clone – statistics-only database

  • Tools

    • ExecStats – cross-reference index use by SQL-execution plan

    • Performance Monitoring,

    • Profiler/Trace aggregation



Organization structure
Organization Structure

  • In many large IT departments

    • DB and Storage are in separate groups

  • Storage usually has own objectives

    • Bring all storage into one big system under full management (read: control)

  • Storage as a Service, in the Cloud

    • One size fits all needs

  • Usually have zero DB knowledge

Of course we do high bandwidth, 600MB/sec good enough for you?


Data warehouse storage
Data Warehouse Storage

  • OLTP – Throughput with Fast Response

  • DW – Flood the queues for maximum through-put

Do not use shared storage for data warehouse!

Storage system vendors like to give the impression the SAN is a magical, immensely powerful box that can meet all your needs.

Just tell us how much capacity you need and don’t worry about anything else.

My advice: stay away from shared storage, controlled by different team.


Nominal and net bandwidth
Nominal and Net Bandwidth

  • PCI-E Gen 2 – 5 Gbit/sec signaling

    • x8 = 5GB/s, net BW 4GB/s, x4 = 2GB/s net

  • SAS 6Gbit/s – 6 Gbit/s

    • x4 port: 3GB/s nominal, 2.2GB/sec net?

  • Fibre Channel 8 Gbit/s nominal

    • 780GB/s point-to-point,

    • 680MB/s from host to SAN to back-end loop

  • SAS RAID Controller, x8 PCI-E G2, 2 x4 6G

    • 2.8GB/s

Depends on the controller, will change!


Storage sas direct attach

SAS x4

SAS x4

RAID

PCI-E x8

SAS x4

SAS x4

SAS x4

SAS x4

RAID

PCI-E x8

SAS x4

SAS x4

RAID

PCI-E x8

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

RAID

PCI-E x8

RAID

PCI-E x4

SAS x4

2 x10GbE

PCI-E x4

PCI-E x4

2 x10GbE

Storage – SAS Direct-Attach

Many Fat Pipes

Very Many Disks

Option A:

24-disks in one enclosure for each x4 SAS port. Two x4 SAS ports per controller

Option B: Split enclosure over 2 x4 SAS ports, 1 controller

Balance by pipe bandwidth

Don’t forget fat network pipes


Storage fc san

8Gb FC

8Gb FC

HBA

8Gb FC

8Gb FC

PCI-E x8

8Gb FC

8Gb FC

8Gb FC

8Gb FC

HBA

PCI-E x8

8Gb FC

8Gb FC

HBA

PCI-E x4

8Gb FC

8Gb FC

.

.

.

.

.

.

8Gb FC

8Gb FC

8Gb FC

.

.

.

.

.

.

HBA

.

.

PCI-E x4

8Gb FC

8Gb FC

8Gb FC

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

HBA

PCI-E x4

.

.

HBA

HBA

PCI-E x4

PCI-E x4

2 x10GbE

PCI-E x4

PCI-E x4

2 x10GbE

Storage – FC/SAN

PCI-E x8 Gen 2 Slot with quad-port 8Gb FC

If 8Gb quad-port is not supported, consider system with many x4 slots, or consider SAS!

SAN systems typically offer 3.5in 15-disk enclosures. Difficult to get high spindle count with density.

1-2 15-disk enclosures per 8Gb FC port, 20-30MB/s per disk?


Storage ssd hdd hybrid

SAS x4

SAS x4

SAS

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

PCI-E x8

SAS x4

SAS x4

SAS x4

SAS x4

SAS

PCI-E x8

SAS x4

SAS x4

.

.

.

.

.

.

.

.

SAS

PCI-E x8

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

SAS

PCI-E x8

PCI-E x4

RAID

SAS x4

2 x10GbE

PCI-E x4

PCI-E x4

2 x10GbE

Storage – SSD / HDD Hybrid

No RAID w/SSD?

Storage enclosures typically 12 disks per channel. Can only support bandwidth of a few SSD. Use remaining bays for extra storage with HDD. No point expending valuable SSD space for backups and flat files

Log:

Single DB – HDD, unless rollbacks or T-log backups disrupts log writes.

Multi DB – SSD, otherwise to many RAID1 pairs to logs


SSD

  • Current: mostly 3Gbps SAS/SATA SDD

    • Some 6Gbps SATA SSD

  • Fusion IO – direct PCI-E Gen2 interface

    • 320GB-1.2TB capacity, 200K IOPS, 1.5GB/s

  • No RAID ?

    • HDD is fundamentally a single point failure

    • SDD could be built with redundant components

  • HP report problems with SSD on RAID controllers, Fujitsu did not?



Storage configuration arrays
Storage Configuration - Arrays

Shown:

two 12-disk Arrays per 24-disk enclosure

Options: between 6-16 disks per array

SAN systems may recommend R10 4+4 or R5 7+1

Very Many Spindles

Comment on Meta LUN


Data consumption rate xeon

Processors

GHz

Total

Cores

Mem

GB

SQL

Q1

sec

SF

Total

MB/s

MB/s

per core

2 Xeon 5355

2.66

8

64

5sp2

85.4

100

1,165.5

145.7

Conroe

2 Xeon 5570

2.93

8

144

8sp1

42.2

100

2,073.5

259.2

Nehalem

2 Xeon 5680

3.33

12

192

8r2

21.0

100

4,166.7

347.2

Westmere

4 Xeon 7560

8 Xeon 7560

2.26

2.26

32

64

640

512

8r2

8r2

37.2

183.8

3000

300

7,056.5

14,282

220.5

223.2

Neh.-EX

Data Consumption Rate: Xeon

TPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875M

Data consumption rate is much higher for current generation Nehalem and Westmere processors than Core 2 referenced in Microsoft FTDW document. TPC-H Q1 is more compute intensive than the FTDW light query.


Data consumption rate opteron

8 Opt 8439

8 Opt 8439

2.8

2.8

48

48

256

512

8sp1

8rtm

166.9

49.0

300

1000

5,242.7

5,357.1

109.2

111.6

2 Opt 6176

4 Opt 6176

2.3

2.3

24

48

192

512

8r2

8r2

20.2

31.8

300

100

4,331.7

8,254.7

180.5

172.0

-

Magny-C

Data Consumption Rate: Opteron

TPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875M

Processors

GHz

Total

Cores

Mem

GB

SQL

Q1

sec

SF

Total

MB/s

MB/s

per core

4 Opt 8220

2.8

8

128

5rtm

309.7

300

868.7

121.1

8 Opt 8360

2.5

32

256

8rtm

91.4

300

2,872.0

89.7

Barcelona

8 Opt 8384

2.7

32

256

8rtm

72.5

300

3,620.7

113.2

Shanghai

Istanbul

Expected Istanbul to have better performance per core than Shanghai due to HT Assist. Magny-Cours has much better performance per core! (at 2.3GHz versus 2.8 for Istanbul), or is this Win/SQL 2K8 R2?


Data consumption rate

Processors

GHz

Total

Cores

Mem

GB

SQL

Q1

sec

SF

Total

MB/s

MB/s

per core

2 Xeon 5355

2.66

8

64

5sp2

85.4

100

1165.5

145.7

2 Xeon 5570

2.93

8

144

8sp1

42.2

100

2073.5

259.2

2 Xeon 5680

3.33

12

192

8r2

21.0

100

4166.7

347.2

2 Opt 6176

2.3

24

192

8r2

20.2

100

4331.7

180.5

4 Opt 8220

2.8

8

128

5rtm

309.7

300

868.7

121.1

8 Opt 8360

2.5

32

256

8rtm

91.4

300

2872.0

89.7

8 Opt 8384

2.7

32

256

8rtm

72.5

300

3620.7

113.2

8 Opt 8439

2.8

48

256

8sp1

49.0

300

5357.1

111.6

4 Opt 6176

2.3

48

512

8r2

31.8

300

8254.7

172.0

8 Xeon 7560

2.26

64

512

8r2

183.8

3000

14282

223.2

Data Consumption Rate

TPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875M

Barcelona

Shanghai

Istanbul

Magny-C


Storage targets

Processors

BW

Core

Total Cores

Target MB/s

PCI-E

x8-x4

SAS

HBA

Storage

Units/Disks

Actual Bandwidth

Storage

Units/Disks

2 Xeon X5680

350

12

4200

5 - 1

2

2 - 48

5 GB/s

4 - 96

4 Opt 6176

8400

175

48

5 - 1

4

4 - 96

10 GB/s

8 - 192

4 Xeon X7560

8000

32

250

6 - 4

6

6 - 144

15 GB/s

12 - 288

8 Xeon X7560

225

64

14400

9 - 5

11†

10 - 240

26 GB/s

20 - 480

Storage Targets

2U disk enclosure

24 x 73GB

15K 2.5in disks

$14K, $600 per disk

† 8-way :

9 controllers in x8 slots, 24 disks per x4 SAS port

2 controllers in x4 slots, 12 disk

24 15K disks per enclosure,

12 disks per x4 SAS port requires 100MB/sec per disk,

possible but not always practical

24 disks per x4 SAS port requires 50MB/sec,

more achievable in practice

Think: Shortest path to metal (iron-oxide)


Your storage and the optimizer

Model

Disks

BW (KB/s)

Sequential IOPS

“Random” IOPS

Sequential- Rand IO ratio

Optimizer

-

10,800

1,350

320

4.22

SAS 2x4

24

2,800,000

350,000

9,600

36.5

SAS 2x4

48

2,800,000

350,000

19,200

18.2

FC 4G

30

360,000

45,000

12,000

3.75

SSD

8

2,800,000

350,000

280,000

1.25

Your Storage and the Optimizer

Assumptions

2.8GB/sec per SAS 2 x4 Adapter, Could be 3.2GB/sec per PCI-E G2 x8

HDD 400 IOPS per disk – Big query key lookup, loop join at high queue, and short-stroked, possible skip-seek. SSD 35,000 IOPS

The SQL Server Query Optimizer make key lookup versus table scan decisions based on a 4.22 sequential-to-random IO ratio

A DW configured storage system has a 18-36 ratio, 30 disks per 4G FC about matches the QO, SSD is in the other direction


Data Consumption Rates

TPC-H SF100

Query 1, 9, 13, 21

TPC-H SF300

Query 1, 9, 13, 21


Fast track reference architecture
Fast Track Reference Architecture

My Complaints

  • Several Expensive SAN systems (11 disks)

    • Each must be configured independently

    • $1,500-2,000 amortized per disk

  • Too many 2-disk Arrays

    • 2 LUN per Array, too many data files

  • Build Indexes with MAXDOP 1

    • Is this brain dead?

  • Designed around 100MB/sec per disk

    • Not all DW is single scan, or sequential

Scripting?


Fragmentation
Fragmentation

Table

Weak Storage System

1) Fragmentation could degrade IO performance,

2) Defragmenting very large table on a weak storage system could render the database marginally to completely non-functional for a very long time.

Powerful Storage System

3) Fragmentation has very little impact.

4) Defragmenting has mild impact, and completes within night time window.

What is the correct conclusion?

File

Partition

LUN

Disk



Operating system disk view

Disk 2

Basic

396GB

Online

Disk 3

Basic

396GB

Online

Controller 1 Port 0

Controller 1 Port 1

Controller 2 Port 0

Controller 2 Port 1

Controller 3 Port 0

Controller 3 Port 1

Disk 4

Basic

396GB

Online

Disk 6

Basic

396GB

Online

Disk 5

Basic

396GB

Online

Disk 7

Basic

396GB

Online

Operating System Disk View

Additional disks not shown, Disk 0 is boot drive, 1 – install source?


File layout

Disk 5 Partition 0

Partition 2

Disk 2, Partition 0

Partition 4

Partition 1

Partition 1

Partition 2

Partition 4

Partition 4

Partition 4

Disk 4 Partition 0

Partition 1

Disk 3 Partition 0

Partition 1

Partition 2

Partition 4

Disk 6 Partition 0

Partition 1

Partition 2

Partition 4

Disk 7 Partition 0

Partition 1

Partition 2

Partition 2

File Group for the big Table

File 4

Backup and Load

File 6

Backup and Load

File 3

Tempdb

File 6

Tempdb

File 3

Backup and Load

File 2

Small File Group

File 2

File Group for the big Table

File 3

Backup and Load

File 4

File Group for the big Table

File 2

Small File Group

File 3

Small File Group

File 4

Small File Group

File 6

Tempdb

File 2

Backup and Load

File 1

Backup and Load

File 5

File Group for the big Table

File 1

Tempdb

File 1

Small File Group

File 5

Tempdb

File 5

File Group for all others

File 1

File Group for the big Table

File 5

File Group for the big Table

File 6

Tempdb

File 4

File Layout

Each File Group is distributed across all data disks

Log disks not shown, tempdb share common pool with data


File groups and files
File Groups and Files

  • Dedicated File Group for largest table

    • Never defragment

  • One file group for all other regular tables

  • Load file group?

    • Rebuild indexes to different file group


Partitioning pitfalls

Disk 5

Disk 2

Disk 3

Disk 7

Disk 4

Disk 6

Table Partition 1

File Group 1

File Group 2

File Group 3

File Group 4

File Group 5

File Group 6

Table Partition 2

Table Partition 3

Table Partition 4

Table Partition 5

Table Partition 6

Partitioning - Pitfalls

Common Partitioning Strategy

Partition Scheme maps partitions to File Groups

What happens in a table scan?

Read first from Part 1

then 2, then 3, … ?

SQL 2008 HF to read from each partition in parallel?

What if partitions have disparate sizes?


Parallel execution plans

Parallel Execution Plans

Joe Chang

[email protected]

www.qdpma.com


About joe chang2
About Joe Chang

  • SQL Server Execution Plan Cost Model

  • True cost structure by system architecture

  • Decoding statblob (distribution statistics)

    • SQL Clone – statistics-only database

  • Tools

    • ExecStats – cross-reference index use by SQL-execution plan

    • Performance Monitoring,

    • Profiler/Trace aggregation


So you bought a 64 core box
So you bought a 64+ core box

Now

  • Learn all about Parallel Execution

    • All guns (cores) blazing

    • Negative scaling

    • Super-scaling

    • High degree of parallelism & small SQL

    • Anomalies, execution plan changes etc

  • Compression

  • Partitioning

Yes, this can happen, how will you know

No I have not been smoking pot

How much in CPU do I pay for this?

Great management tool, what else?


Parallel execution plans1

Parallel Execution Plans

Reference: Adam Machanic PASS


Execution plan quickie
Execution Plan Quickie

I/O and CPU Cost components

F4

Estimated Execution Plan

Cost is duration in seconds on some reference platform

IO Cost for scan: 1 = 10,800KB/s, 810 implies 8,748,000KB

IO in Nested Loops Join: 1 = 320/s, multiple of 0.003125


Index key lookup scan
Index + Key Lookup - Scan

Actual CPU Time (Data in memory)

LU 1919 1919

Scan 8736 8727

(926.67- 323655 * 0.0001581) / 0.003125 = 280160 (86.6%)

True cross-over approx 1,400,000 rows

1 row : page

1,093,729 pages/1350 = 810.17 (8,748MB)


Index key lookup scan1
Index + Key Lookup - Scan

Actual CPU Time

LU 2138 321

Scan 18622 658

(817- 280326 * 0.0001581) / 0.003125 = 247259 (88%)

8748000KB/8/1350 = 810


Actual execution plan
Actual Execution Plan

Estimated

Actual

Note Actual Number

of Rows, Rebinds,

Rewinds

Estimated

Actual


Row count and executions
Row Count and Executions

Outer

Inner

Source

For Loop Join inner source and Key Lookup,

Actual Num Rows = Num of Exec × Num of Rows



Parallelism operations
Parallelism Operations

  • Distribute Streams

    • Non-parallel source, parallel destination

  • Repartition Streams

    • Parallel source and destination

  • Gather Streams

    • Destination is non-parallel


Parallel execution plans2
Parallel Execution Plans

Note: gold circle with double arrow, and parallelism operations


Parallel scan and index seek
Parallel Scan (and Index Seek)

2X

IO Cost same

CPU reduce by degree of parallelism, except no reduction for DOP 16

DOP 1

DOP 2

8X

4X

IO contributes most of cost!

DOP 8

DOP 4



Hash match aggregate
Hash Match Aggregate

CPU cost only reduces

By 2X,


Parallel scan
Parallel Scan

  • IO Cost is the same

  • CPU cost reduced in proportion to degree of parallelism, last 2X excluded?

On a weak storage system, a single thread can saturate the IO channel,

Additional threads will not increase IO (reduce IO duration).

A very powerful storage system can provide IO proportional to the number of threads. It might be nice if this was optimizer option?

The IO component can be a very large portion of the overall plan cost

Not reducing IO cost in parallel plan may inhibit generating favorable plan,

i.e., not sufficient to offset the contribution from the Parallelism operations.

A parallel execution plan is more likely on larger systems (-P to fake it?)






Hash join cost
Hash Join Cost

DOP 4

DOP 1

DOP 2

Search: Understanding Hash Joins

For In-memory, Grace, Recursive

DOP 8


Hash join cost1
Hash Join Cost

CPU Cost is linear with number of rows, outer and inner source

See BOL on Hash Joins for In-Memory, Grace, Recursive

IO Cost is zero for small intermediate data size,

beyond set point proportional to server memory(?)

IO is proportional to excess data (beyond in-memory limit)

Parallel Plan: Memory allocation is per thread!

Summary: Hash Join plan cost depends on memory if IO component is not zero, in which case is disproportionately lower with parallel plans. Does not reflect real cost?



Bitmap
Bitmap

BOL: Optimizing Data Warehouse Query Performance Through Bitmap Filtering

A bitmap filter uses a compact representation of a set of values from a table in one part of the operator tree to filter rows from a second table in another part of the tree. Essentially, the filter performs a semi-join reduction; that is, only the rows in the second table that qualify for the join to the first table are processed.

SQL Server uses the Bitmap operator to implement bitmap filtering in parallel query plans. Bitmap filtering speeds up query execution by eliminating rows with key values that cannot produce any join records before passing rows through another operator such as the Parallelism operator. A bitmap filter uses a compact representation of a set of values from a table in one part of the operator tree to filter rows from a second table in another part of the tree. By removing unnecessary rows early in the query, subsequent operators have fewer rows to work with, and the overall performance of the query improves. The optimizer determines when a bitmap is selective enough to be useful and in which operators to apply the filter. For more information, see Optimizing Data Warehouse Query Performance Through Bitmap Filtering.


Parallel execution plan summary
Parallel Execution Plan Summary

  • Queries with high IO cost may show little plan cost reduction on parallel execution

  • Plans with high portion hash or sort cost show large parallel plan cost reduction

  • Parallel plans may be inhibited by high row count in Parallelism Repartition Streams

  • Watch out for (Parallel) Merge Joins!



Parallel execution strategy
Parallel Execution Strategy

  • Partition work into little pieces

    • Ensures each thread has same amount

    • High overhead to coordinate

  • Partition into big pieces

    • May have uneven distribution between threads

    • Small table join to big table

      • Thread for each row from small table

  • Partitioned table options


What should scale
What Should Scale?

3

2

2

Trivially parallelizable:

1) Split large chunk of work among threads,

2) Each thread works independently,

3) Small amount of coordination to consolidate threads


More difficult
More Difficult?

4

3

2

3

2

Parallelizable:

1) Split large chunk of work among threads,

2) Each thread works on first stage

3) Large coordination effort between threads

4) More work

Consolidate


Partitioned tables
Partitioned Tables

No Repartition Streams

Regular Table

Partitioned Tables

No Repartition Streams operations!


Scaling reality

Scaling Reality

8-way Quad-Core Opteron

Windows Server 2008 R2

SQL Server 2008 SP1 + HF 27


Test queries
Test Queries

  • TPC-H SF 10 database

    • Standard, Compressed, Partitioned (30)

  • Line Item Table SUM, 59M rows, 8.75GB

  • Orders Table 15M rows


Cpu sec
CPU-sec

Standard

CPU-sec to SUM 1 or 2 columns in Line Item

Compressed


Speed up
Speed Up

Standard

Compressed


Line item sum 1 column
Line Item sum 1 column

CPU-sec

Speed up relative to DOP 1


Line item sum w group by
Line Item Sum w/Group By

CPU-sec

Speedup


Hash join
Hash Join

CPU-sec

Speedup


Key lookup and table scan
Key Lookup and Table Scan

CPU-sec

1.4M rows

Speedup


Parallel execution summary
Parallel Execution Summary

  • Contention in queries w/low cost per page

    • Simple scan,

  • High Cost per Page – improves scaling!

    • Multiple Aggregates, Hash Join, Compression

  • Table Partitioning –

    • alternative query plans

  • Loop Joins – broken at high DOP

  • Merge Join – seriously broken (parallel)


Scaling dw summary
Scaling DW Summary

  • Massive IO bandwidth

  • Parallel options for data load, updates etc

  • Investigate Parallel Execution Plans

    • Scaling from DOP 1, 2, 4, 8, 16, 32 etc

    • Scaling with and w/o HT

  • Strategy for limiting DOP with multiple users


Fixes from microsoft needed
Fixes from Microsoft Needed

  • Contention issues in parallel execution

    • Table scan, Nested Loops

  • Better plan cost model for scaling

    • Back-off on parallelism if gain is negligible

  • Fix throughput degradation with multiple users running big DW queries

    • Sybase and Oracle, Throughput is close to Power or better



Test systems1
Test Systems

  • 2-way quad-core Xeon 5430 2.66GHz

    • Windows Server 2008 R2, SQL 2008 R2

  • 8-way dual-core Opteron 2.8GHz

    • Windows Server 2008 SP1, SQL 2008 SP1

  • 8-way quad-core Opteron 2.7GHz Barcelona

    • Windows Server 2008 R2, SQL 2008 SP1

Build 2789

8-way systems were configured for AD- not good!


Test methodology
Test Methodology

  • Boot with all processors

    • Run queries at MAXDOP 1, 2, 4, 8, etc

  • Not the same as running on 1-way, 2-way, 4-way server

  • Interpret results with caution


References
References

  • Search Adam Machanic PASS


Sql server scaling on big iron numa systems

TPC-H

SQL Server Scaling on Big Iron (NUMA) Systems

Joe Chang

[email protected]

www.qdpma.com


About joe chang3
About Joe Chang

  • SQL Server Execution Plan Cost Model

  • True cost structure by system architecture

  • Decoding statblob (distribution statistics)

    • SQL Clone – statistics-only database

  • Tools

    • ExecStats – cross-reference index use by SQL-execution plan

    • Performance Monitoring,

    • Profiler/Trace aggregation



Tpc h1
TPC-H

  • DSS – 22 queries, geometric mean

    • 60X range plan cost, comparable actual range

  • Power – single stream

    • Tests ability to scale parallel execution plans

  • Throughput – multiple streams

  • Scale Factor 1 – Line item data is 1GB

    • 875MB with DATE instead of DATETIME

  • Only single column indexes allowed, Ad-hoc


Observed scaling behaviors
Observed Scaling Behaviors

  • Good scaling, leveling off at high DOP

  • Perfect Scaling ???

  • Super Scaling

  • Negative Scaling

    • especially at high DOP

  • Execution Plan change

    • Completely different behavior



Tpc h sf 100gb
TPC-H SF 100GB

2-way Xeon 5355, 5570, 5680, Opt 6176

Between 2-way Xeon 5570, all are close, HDD has best throughput, SATA SSD has best composite, and Fusion-IO has be power.

Westmere and Magny-Cours, both 192GB memory, are very close


Tpc h sf 300gb
TPC-H SF 300GB

8x QC/6C & 4x12C Opt,

6C Istanbul improved over 4C Shanghai by 45% Power, 73% Through-put, 59% overall.

4x12C 2.3GHz improved17% over 8x6C 2.8GHz



Tpc h sf 3tb
TPC-H SF 3TB

X7460 & X7560

Nehalem-EX 64 cores better than 96 Core 2.


Tpc h sf 100gb 300gb 3tb
TPC-H SF 100GB, 300GB & 3TB

SF100 2-way

Westmere and Magny-Cours are very close

Between 2-way Xeon 5570, all are close, HDD has best through-put, SATA SSD has best composite, and Fusion-IO has be power

SF300 8x QC/6C & 4x12C

6C Istanbul improved over 4C Shanghai by 45% Power, 73% Through-put, 59% overall.

4x12C 2.3GHz improved17% over 8x6C 2.8GHz

SF 3TB X7460 & X7560

Nehalem-EX 64 cores better than 96 Core 2.


Tpc h published results1
TPC-H Published Results

  • SQL Server excels in Power

    • Limited by Geometric mean, anomalies

  • Trails in Throughput

    • Other DBMS get better throughput than power

    • SQL Server throughput below Power

      • by wide margin

    • Speculation – SQL Server does not throttle back parallelism with load?


Tpc h sf100

Processors

GHz

Total

Cores

Mem

GB

SQL

SF

Power

Through

put

QphH

2 Xeon 5355

2.66

8

64

5sp2

100

23,378.0

13,381.0

17,686.7

5570 Fusion

2x5570 SSD

2x5570 HDD

2.93

2.93

2.93

8

8

8

144

144

144

8sp1

8sp1

8sp1

100

100

100

72,110.5

70,048.5

67,712.9

38,019.1

36,190.8

37,749.1

51,085.6

51,422.4

50,738.4

2 Xeon 5680

3.33

12

192

8r2

100

99,426.3

55,038.2

73,974.6

2 Opt 6176

2.3

24

192

8r2

100

94,761.5

53,855.6

71,438.3

TPC-H SF100


Tpc h sf300

Processors

GHz

Total

Cores

Mem

GB

SQL

SF

Power

Through

put

QphH

4 Opt 8220

2.8

8

128

5rtm

300

25,206.4

13,283.8

18,298.5

8 Opt 8360

2.5

32

256

8rtm

300

67,287.4

41,526.4

52,860.2

8 Opt 8384

2.7

32

256

8rtm

300

75,161.2

44,271.9

57,684.7

8 Opt 8439

2.8

48

256

8sp1

300

109,067.1

76,869.0

91,558.2

4 Xeon 7560

4 Opt 6176

2.26

2.3

48

32

512

640

8r2

8r2

300

300

129,198.3

152,453.1

96,585.4

89,547.7

107,561.2

121,345.6

TPC-H SF300

All of the above are HP results?, Sun result Opt 8384, sp1,

Pwr 67,095.6, Thr 45,343.5, QphH 55,157.5


Tpc h 1tb

Itanium 9350

8 Opt 8439

8 Opt 8439

2.8

1.73

2.8

48

64

48

512

384

512

ASE

O11R2

8R2?

1000

1000

1000

139,181.0

108,436.8

95,789.1

69,367.6

96,652.7

141,188.1

102,375.3

140,181.1

81,367.6

TPC-H 1TB

Processors

GHz

Total

Cores

Mem

GB

SQL

SF

Power

Through

put

QphH


Tpc h 3tb

8 Xeon 7560

16 Xeon 7460

POWER6

Itanium 9350

2.26

1.73

2.66

5.0

64

96

64

64

512

1024

512

512

Sybase

8r2

Sybase

8r2

3000

3000

1000

3000

185,297.7

142,790.7

120,254.8

142,790.7

142,685.6

87,841.4

171,607.4

171,607.4

156,537.3

162,601.7

102,254.8

156,537.3

TPC-H 3TB

Processors

GHz

Total

Cores

Mem

GB

SQL

SF

Power

Through

put

QphH


Tpc h published results2
TPC-H Published Results

Processors

GHz

Total

Cores

Mem

GB

SQL

SF

Power

Through

put

QphH

2 Xeon 5355

2.66

8

64

5sp2

100

23,378

13,381

17,686.7

2 Xeon 5570

2.93

8

144

8sp1

100

72,110.5

36,190.8

51,085.6

2 Xeon 5680

3.33

12

192

8r2

100

99,426.3

55,038.2

73,974.6

2 Opt 6176

2.3

24

192

8r2

100

94,761.5

53,855.6

71,438.3

4 Opt 8220

2.8

8

128

5rtm

300

25,206.4

13,283.8

18,298.5

8 Opt 8360

2.5

32

256

8rtm

300

67,287.4

41,526.4

52,860.2

8 Opt 8384

2.7

32

256

8rtm

300

75,161.2

44,271.9

57,684.7

8 Opt 8439

2.8

48

256

8sp1

300

109,067.1

76,869.0

91,558.2

4 Opt 6176

2.3

48

512

8r2

300

129,198.3

89,547.7

107,561.2

8 Xeon 7560

2.26

64

512

8r2

3000

185,297.7

142,685.6

162,601.7


Sf100 2 way big queries sec
SF100 2-way Big Queries (sec)

Query time in sec

Xeon 5570 with SATA SSD poor on Q9, reason unknown

Both Xeon 5680 and Opteron 6176 big improvement over Xeon 5570


Sf100 middle q
SF100 Middle Q

Query time in sec

Xeon 5570-HDD and 5680-SSD poor on Q12, reason unknown

Opteron 6176 poor on Q11


Sf100 small queries
SF100 Small Queries

Query time in sec

Xeon 5680 and Opteron poor on Q20

Note limited scaling on Q2, & 17


Sf300 32 cores big queries
SF300 32+ cores Big Queries

Query time in sec

Opteron 6176 poor relative to 8439 on Q9 & 13,

same number of total cores


Sf300 middle q
SF300 Middle Q

Query time in sec

Opteron 6176 much better than 8439 on Q11 & 19

Worse on Q12


Sf300 small q
SF300 Small Q

Query time in sec

Opteron 6176 much better on Q2, even with 8439 on others







Sf 3tb 8 7560 versus 16 7460
SF 3TB – 8×7560 versus 16×7460

5.6X

Broadly 50% faster overall, 5X+ on one, slower on 2, comparable on 3



Tpc h summary
TPC-H Summary

  • Scaling is impressive on some SQL

  • Limited ability (value) is scaling small Q

  • Anomalies, negative scaling




Query 2 minimum cost supplier
Query 2 Minimum Cost Supplier

Wordy, but only touches the small tables, second lowest plan cost (Q15)







Q11 important stock identification
Q11 Important Stock Identification

Parallel

Non-Parallel



Q13

Why does Q13 have perfect scaling?



Q18 large volume customer
Q18 Large Volume Customer

Non-Parallel

Parallel



Q20?

Date functions are usually written as

because Line Item date columns are “date” type

CAST helps DOP 1 plan, but get bad plan for parallel

This query may get a poor execution plan


Q21 suppliers who kept orders waiting
Q21 Suppliers Who Kept Orders Waiting

Note 3 references to Line Item



Tpc h studies

TPC-H Studies

Joe Chang

[email protected]

www.qdpma.com


About joe chang4
About Joe Chang

  • SQL Server Execution Plan Cost Model

  • True cost structure by system architecture

  • Decoding statblob (distribution statistics)

    • SQL Clone – statistics-only database

  • Tools

    • ExecStats – cross-reference index use by SQL-execution plan

    • Performance Monitoring,

    • Profiler/Trace aggregation



Tpc h3
TPC-H

  • DSS – 22 queries, geometric mean

    • 60X range plan cost, comparable actual range

  • Power – single stream

    • Tests ability to scale parallel execution plans

  • Throughput – multiple streams

  • Scale Factor 1 – Line item data is 1GB

    • 875MB with DATE instead of DATETIME

  • Only single column indexes allowed, Ad-hoc


Sf 10 test studies
SF 10, test studies

  • Not valid for publication

  • Auto-Statistics enabled,

    • Excludes compile time

  • Big Queries – Line Item Scan

  • Super Scaling – Mission Impossible

  • Small Queries & High Parallelism

  • Other queries, negative scaling

Did not apply T2301, or disallow page locks


Big q plan cost vs actual
Big Q: Plan Cost vs Actual

memory affects Hash IO onset

Plan Cost reduction from DOP1 to 16/32

Q1 28%

Q9 44%

Q18 70%

Q21 20%

Plan Cost @ 10GB

Plan Cost says scaling is poor except for Q18,

Q18 & Q 21 > 3X Q1, Q9

Actual Query time

In seconds

Plan Cost is poor indicator of true parallelism scaling


Big query speed up and cpu
Big Query: Speed Up and CPU

Holy Grail

Q13 has slightly better than perfect scaling?

In general, excellent scaling to DOP 8-24, weak afterwards

Speed up relative to DOP 1

CPU time

In seconds


Super scaling
Super Scaling

  • Suppose at DOP 1, a query runs for 100 seconds, with one CPU fully pegged

    • CPU time = 100 sec, elapse time = 100 sec

  • What is best case for DOP 2?

    • Assuming nearly zero Repartition Threads cost

    • CPU time = 100 sec, elapsed time = 50?

Super Scaling: CPU time decreases going from Non-Parallel to Parallel plan!

No, I have not started drinking, yet


Super scaling1
Super Scaling

CPU normalized to DOP 1

CPU-sec goes down from DOP 1 to 2 and higher (typically 8)

3.5X speedup from DOP 1 to 2 (Normalized to DOP 1)

Speed up relative to DOP 1


Cpu and query time in seconds
CPU and Query time in seconds

CPU time

Query time


Super scaling summary
Super Scaling Summary

  • Most probable cause

    • Bitmap Operator in Parallel Plan

  • Bitmap Filters are great,

    • Question for Microsoft:

    • Can I use Bitmap Filters in OLTP systems with non-parallel plans?


Small queries plan cost vs act
Small Queries – Plan Cost vs Act

Query 3 and 16 have lower plan cost than Q17, but not included

Plan Cost

Q4,6,17 great scaling to DOP 4, then weak

Negative scaling also occurs

Query time


Small queries cpu speedup
Small Queries CPU & Speedup

CPU time

What did I get for all that extra CPU?, Interpretation: sharp jump in CPU means poor scaling, disproportionate means negative scaling

Query 2 negative at DOP 2, Q4 is good, Q6 get speedup, but at CPU premium, Q17 and 20 negative after DOP 8

Speed up


High parallelism small queries
High Parallelism – Small Queries

  • Why? Almost No value

  • TPC-H geometric mean scoring

    • Small queries have as much impact as large

      • Linear sum of weights large queries

  • OLTP with 32, 64+ cores

    • Parallelism good if super-scaling

    • Default max degree of parallelism 0

      • Seriously bad news, especially for small Q

    • Increase cost threshold for parallelism?

Sometimes you do get lucky


Q that go negative
Q that go Negative

Query time

“Speedup”



Other queries cpu speedup
Other Queries – CPU & Speedup

CPU time

Q3 has problems beyond DOP 2

Speedup



Scaling summary
Scaling Summary

  • Some queries show excellent scaling

  • Super-scaling, better than 2X

  • Sharp CPU jump on last DOP doubling

  • Need strategy to cap DOP

    • To limit negative scaling

    • Especially for some smaller queries?

  • Other anomalies



Compression overhead overall
Compression Overhead - Overall

Query time compressed relative to uncompressed

40% overhead for compression at low DOP,

10% overhead at max DOP???

CPU time compressed relative to uncompressed


Query time compressed relative to uncompressed

CPU time compressed relative to uncompressed


Compressed table
Compressed Table

LINEITEM – real data may be more compressible

Uncompressed: 8,749,760KB, Average Bytes per row: 149

Compressed: 4,819,592KB, Average Bytes per row: 82


Partitioning

Partitioning

Orders and Line Item on Order Key


Partitioning impact overall
Partitioning Impact - Overall

Query time partitioned relative to not partitioned

CPU time partitioned relative to not partitioned


Query time partitioned relative to not partitioned

CPU time partitioned relative to not partitioned



Scaling dw summary1
Scaling DW Summary

  • Massive IO bandwidth

  • Parallel options for data load, updates etc

  • Investigate Parallel Execution Plans

    • Scaling from DOP 1, 2, 4, 8, 16, 32 etc

    • Scaling with and w/o HT

  • Strategy for limiting DOP with multiple users


Fixes from microsoft needed1
Fixes from Microsoft Needed

  • Contention issues in parallel execution

    • Table scan, Nested Loops

  • Better plan cost model for scaling

    • Back-off on parallelism if gain is negligible

  • Fix throughput degradation with multiple users running big DW queries

    • Sybase and Oracle, Throughput is close to Power or better





Q1 plan
Q1 Plan

Non-Parallel

Parallel plan 28% lower than scalar, IO is 70%, no parallel plan cost reduction

Parallel


Q9 product type profit measure1
Q9 Product Type Profit Measure

IO from 4 tables contribute 58% of plan cost, parallel plan is 39% lower

Non-Parallel

Parallel


Q9 non parallel plan
Q9 Non-Parallel Plan

Table/Index Scans comprise 64%, IO from 4 tables contribute 58% of plan cost

Join sequence: Supplier, (Part, PartSupp), Line Item, Orders


Q9 parallel plan
Q9 Parallel Plan

Non-Parallel: (Supplier), (Part, PartSupp), Line Item, Orders

Parallel: Nation, Supplier, (Part, Line Item), Orders, PartSupp


Q9 non parallel plan details
Q9 Non-Parallel Plan details

Table Scans comprise 64%,

IO from 4 tables contribute 58% of plan cost



Q13

Why does Q13 have perfect scaling?


Q18 large volume customer1
Q18 Large Volume Customer

Non-Parallel

Parallel


Q18 graphical plan
Q18 Graphical Plan

Non-Parallel Plan: 66% of cost in Hash Match, reduced to 5% in Parallel Plan


Q18 plan details
Q18 Plan Details

Non-Parallel

Parallel

Non-Parallel Plan Hash Match cost is 1245 IO, 494.6 CPU

DOP 16/32: size is below IO threshold, CPU reduced by >10X


Q21 suppliers who kept orders waiting1
Q21 Suppliers Who Kept Orders Waiting

Parallel

Non-Parallel

Note 3 references to Line Item


Q21 non parallel plan
Q21 Non-Parallel Plan

H3

H1

H2

H2

H1

H3



Q21

  • 3 full Line Item clustered index scans

  • Plan cost is approx 3X Q1, single “scan”



Q7 volume shipping1
Q7 Volume Shipping

Non-Parallel

Parallel


Q7 non parallel plan
Q7 Non-Parallel Plan

Join sequence: Nation, Customer, Orders, Line Item


Q7 parallel plan
Q7 Parallel Plan

Join sequence: Nation, Customer, Orders, Line Item


Q8 national market share1
Q8 National Market Share

Non-Parallel

Parallel


Q8 non parallel plan
Q8 Non-Parallel Plan

Join sequence: Part, Line Item, Orders, Customer


Q8 parallel plan
Q8 Parallel Plan

Join sequence: Part, Line Item, Orders, Customer


Q11 important stock identification1
Q11 Important Stock Identification

Parallel

Non-Parallel


Q11

Join sequence: A) Nation, Supplier, PartSupp, B) Nation, Supplier, PartSupp


Q11

Join sequence: A) Nation, Supplier, PartSupp, B) Nation, Supplier, PartSupp



Query 2 minimum cost supplier1
Query 2 Minimum Cost Supplier

Wordy, but only touches the small tables, second lowest plan cost (Q15)


Q2

Clustered Index Scan on Part and PartSupp have highest cost (48%+42%)


Q2

PartSupp is now Index Scan + Key Lookup


Q6 forecasting revenue change1
Q6 Forecasting Revenue Change

Note sure why this blows CPU

Scalar values are pre-computed, pre-converted


Q20?

Date functions are usually written as

because Line Item date columns are “date” type

CAST helps DOP 1 plan, but get bad plan for parallel

This query may get a poor execution plan




Q20 alternate parallel
Q20 alternate - parallel

Statistics estimation error here

Penalty for mistake

applied here





Q12 random io1
Q12 Random IO?

Will this generate random IO?


Query 12 plans
Query 12 Plans

Non-Parallel

Parallel




Q17

Table Spool is concern


Q17

the usual suspects






Speedup from DOP 1 query time

CPU relative to DOP 1


ad