Parallel Execution Plans - PowerPoint PPT Presentation

Parallel execution plans l.jpg
Download
1 / 48

Parallel Execution Plans. Joe Chang jchang6@yahoo.com www.qdpma.com. About Joe Chang. SQL Server Execution Plan Cost Model True cost structure by system architecture Decoding statblob (distribution statistics) SQL Clone – statistics-only database Tools

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Parallel Execution Plans

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Parallel execution plans l.jpg

Parallel Execution Plans

Joe Chang

jchang6@yahoo.com

www.qdpma.com


About joe chang l.jpg

About Joe Chang

  • SQL Server Execution Plan Cost Model

  • True cost structure by system architecture

  • Decoding statblob (distribution statistics)

    • SQL Clone – statistics-only database

  • Tools

    • ExecStats – cross-reference index use by SQL-execution plan

    • Performance Monitoring,

    • Profiler/Trace aggregation


So you bought a 64 core box l.jpg

So you bought a 64+ core box

Now

  • Learn all about Parallel Execution

    • All guns (cores) blazing

    • Negative scaling

    • Super-scaling

    • High degree of parallelism & small SQL

    • Anomalies, execution plan changes etc

  • Compression

  • Partitioning

Yes, this can happen, how will you know

No I have not been smoking pot

How much in CPU do I pay for this?

Great management tool, what else?


Parallel execution plans4 l.jpg

Parallel Execution Plans

Reference: Adam Machanic PASS


Execution plan quickie l.jpg

Execution Plan Quickie

I/O and CPU Cost components

F4

Estimated Execution Plan

Cost is duration in seconds on some reference platform

IO Cost for scan: 1 = 10,800KB/s, 810 implies 8,748,000KB

IO in Nested Loops Join: 1 = 320/s, multiple of 0.003125


Index key lookup scan l.jpg

Index + Key Lookup - Scan

Actual CPUTime (Data in memory)

LU19191919

Scan87368727

(926.67- 323655 * 0.0001581) / 0.003125 = 280160 (86.6%)

True cross-over approx 1,400,000 rows

1 row : page

1,093,729 pages/1350 = 810.17 (8,748MB)


Index key lookup scan7 l.jpg

Index + Key Lookup - Scan

Actual CPUTime

LU 2138321

Scan18622658

(817- 280326 * 0.0001581) / 0.003125 = 247259 (88%)

8748000KB/8/1350 = 810


Actual execution plan l.jpg

Actual Execution Plan

Estimated

Actual

Note Actual Number

of Rows, Rebinds,

Rewinds

Estimated

Actual


Row count and executions l.jpg

Row Count and Executions

Outer

Inner

Source

For Loop Join inner source and Key Lookup,

Actual Num Rows = Num of Exec × Num of Rows


Parallel plans l.jpg

Parallel Plans


Parallelism operations l.jpg

Parallelism Operations

  • Distribute Streams

    • Non-parallel source, parallel destination

  • Repartition Streams

    • Parallel source and destination

  • Gather Streams

    • Destination is non-parallel


Parallel execution plans13 l.jpg

Parallel Execution Plans

Note: gold circle with double arrow, and parallelism operations


Parallel scan and index seek l.jpg

Parallel Scan (and Index Seek)

2X

IO Cost same

CPU reduce by degree of parallelism, except no reduction for DOP 16

DOP 1

DOP 2

8X

4X

IO contributes most of cost!

DOP 8

DOP 4


Parallel scan 2 l.jpg

Parallel Scan 2

DOP 16


Hash match aggregate l.jpg

Hash Match Aggregate

CPU cost only reduces

By 2X,


Parallel scan l.jpg

Parallel Scan

  • IO Cost is the same

  • CPU cost reduced in proportion to degree of parallelism, last 2X excluded?

On a weak storage system, a single thread can saturate the IO channel,

Additional threads will not increase IO (reduce IO duration).

A very powerful storage system can provide IO proportional to the number of threads. It might be nice if this was optimizer option?

The IO component can be a very large portion of the overall plan cost

Not reducing IO cost in parallel plan may inhibit generating favorable plan,

i.e., not sufficient to offset the contribution from the Parallelism operations.

A parallel execution plan is more likely on larger systems (-P to fake it?)


Actual execution plan parallel l.jpg

Actual Execution Plan - Parallel


More parallel plan details l.jpg

More Parallel Plan Details


Parallel plan actual l.jpg

Parallel Plan - Actual


Parallelism hash joins l.jpg

Parallelism – Hash Joins


Hash join cost l.jpg

Hash Join Cost

DOP 4

DOP 1

DOP 2

Search: Understanding Hash Joins

For In-memory, Grace, Recursive

DOP 8


Hash join cost23 l.jpg

Hash Join Cost

CPU Cost is linear with number of rows, outer and inner source

See BOL on Hash Joins for In-Memory, Grace, Recursive

IO Cost is zero for small intermediate data size,

beyond set point proportional to server memory(?)

IO is proportional to excess data (beyond in-memory limit)

Parallel Plan: Memory allocation is per thread!

Summary: Hash Join plan cost depends on memory if IO component is not zero, in which case is disproportionately lower with parallel plans. Does not reflect real cost?


Parallelism repartition streams l.jpg

Parallelism Repartition Streams

DOP 8

DOP 2

DOP 4


Bitmap l.jpg

Bitmap

BOL: Optimizing Data Warehouse Query Performance Through Bitmap Filtering

A bitmap filter uses a compact representation of a set of values from a table in one part of the operator tree to filter rows from a second table in another part of the tree. Essentially, the filter performs a semi-join reduction; that is, only the rows in the second table that qualify for the join to the first table are processed.

SQL Server uses the Bitmap operator to implement bitmap filtering in parallel query plans. Bitmap filtering speeds up query execution by eliminating rows with key values that cannot produce any join records before passing rows through another operator such as the Parallelism operator. A bitmap filter uses a compact representation of a set of values from a table in one part of the operator tree to filter rows from a second table in another part of the tree. By removing unnecessary rows early in the query, subsequent operators have fewer rows to work with, and the overall performance of the query improves. The optimizer determines when a bitmap is selective enough to be useful and in which operators to apply the filter. For more information, see Optimizing Data Warehouse Query Performance Through Bitmap Filtering.


Parallel execution plan summary l.jpg

Parallel Execution Plan Summary

  • Queries with high IO cost may show little plan cost reduction on parallel execution

  • Plans with high portion hash or sort cost show large parallel plan cost reduction

  • Parallel plans may be inhibited by high row count in Parallelism Repartition Streams

  • Watch out for (Parallel) Merge Joins!


Scaling theory l.jpg

Scaling Theory


Parallel execution strategy l.jpg

Parallel Execution Strategy

  • Partition work into little pieces

    • Ensures each thread has same amount

    • High overhead to coordinate

  • Partition into big pieces

    • May have uneven distribution between threads

    • Small table join to big table

      • Thread for each row from small table

  • Partitioned table options


What should scale l.jpg

What Should Scale?

3

2

2

Trivially parallelizable:

1) Split large chunk of work among threads,

2) Each thread works independently,

3) Small amount of coordination to consolidate threads


More difficult l.jpg

More Difficult?

4

3

2

3

2

Parallelizable:

1) Split large chunk of work among threads,

2) Each thread works on first stage

3) Large coordination effort between threads

4) More work

Consolidate


Partitioned tables l.jpg

Partitioned Tables

No Repartition Streams

Regular Table

Partitioned Tables

No Repartition Streams operations!


Scaling reality l.jpg

Scaling Reality

8-way Quad-Core Opteron

Windows Server 2008 R2

SQL Server 2008 SP1 + HF 27


Test queries l.jpg

Test Queries

  • TPC-H SF 10 database

    • Standard, Compressed, Partitioned (30)

  • Line Item Table SUM, 59M rows, 8.75GB

  • Orders Table 15M rows


Cpu sec l.jpg

CPU-sec

Standard

CPU-sec to SUM 1 or 2 columns in Line Item

Compressed


Speed up l.jpg

Speed Up

Standard

Compressed


Line item sum 1 column l.jpg

Line Item sum 1 column

CPU-sec

Speed up relative to DOP 1


Line item sum w group by l.jpg

Line Item Sum w/Group By

CPU-sec

Speedup


Hash join l.jpg

Hash Join

CPU-sec

Speedup


Key lookup and table scan l.jpg

Key Lookup and Table Scan

CPU-sec

1.4M rows

Speedup


Parallel execution summary l.jpg

Parallel Execution Summary

  • Contention in queries w/low cost per page

    • Simple scan,

  • High Cost per Page – improves scaling!

    • Multiple Aggregates, Hash Join, Compression

  • Table Partitioning –

    • alternative query plans

  • Loop Joins – broken at high DOP

  • Merge Join – seriously broken (parallel)


Scaling dw summary l.jpg

Scaling DW Summary

  • Massive IO bandwidth

  • Parallel options for data load, updates etc

  • Investigate Parallel Execution Plans

    • Scaling from DOP 1, 2, 4, 8, 16, 32 etc

    • Scaling with and w/o HT

  • Strategy for limiting DOP with multiple users


Fixes from microsoft needed l.jpg

Fixes from Microsoft Needed

  • Contention issues in parallel execution

    • Table scan, Nested Loops

  • Better plan cost model for scaling

    • Back-off on parallelism if gain is negligible

  • Fix throughput degradation with multiple users running big DW queries

    • Sybase and Oracle, Throughput is close to Power or better


Test systems l.jpg

Test Systems


Test systems46 l.jpg

Test Systems

  • 2-way quad-core Xeon 5430 2.66GHz

    • Windows Server 2008 R2, SQL 2008 R2

  • 8-way dual-core Opteron 2.8GHz

    • Windows Server 2008 SP1, SQL 2008 SP1

  • 8-way quad-core Opteron 2.7GHz Barcelona

    • Windows Server 2008 R2, SQL 2008 SP1

Build 2789

8-way systems were configured for AD- not good!


Test methodology l.jpg

Test Methodology

  • Boot with all processors

    • Run queries at MAXDOP 1, 2, 4, 8, etc

  • Not the same as running on 1-way, 2-way, 4-way server

  • Interpret results with caution


References l.jpg

References

  • Search Adam Machanic PASS


  • Login