making data warehouse easy n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Making Data Warehouse Easy PowerPoint Presentation
Download Presentation
Making Data Warehouse Easy

Loading in 2 Seconds...

play fullscreen
1 / 23

Making Data Warehouse Easy - PowerPoint PPT Presentation


  • 103 Views
  • Uploaded on

Making Data Warehouse Easy. Conor Cunningham – Principal Architect Thomas Kejser – Principal PM. Introduction. We build and implement Data Warehouses (and the engines that run them) We also fix DWs that others build This talk covers the key patterns we use

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Making Data Warehouse Easy


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
making data warehouse easy

Making Data Warehouse Easy

Conor Cunningham – Principal Architect

Thomas Kejser – Principal PM

introduction
Introduction

We build and implement Data Warehouses (and the engines that run them)

We also fix DWs that others build

This talk covers the key patterns we use

We will also show you how you can make your life easier with Microsoft’s SQL technologies

what world do you live in
What World do you Live in?

Hardware should be

bought when I

know the details

I can’t wait for you to

figure all that out

Do it, NOW!

I need to know my

hardware CAPEX before I

decide to invest

sketch a rough model
Sketch a Rough Model
  • Define Roughly on Business Problem
  • Decide on Dimensions
    • Dim columns can wait
  • Build Dimension/Fact Matrix
estimate storage
Estimate Storage

≈ 4B

≈ 8B

≈ 1/3 or sp_estimate_compression

why integer keys are cheaper
Why Integer Keys are Cheaper

Smaller row sizes

More rows/page= more compression

Faster to join

Faster in column stores

pick standard hw configuration
Pick Standard HW Configuration
  • Small (GB to low TB) : Business Decision Appliance
  • Medium (up to 80TB): Fast Track
  • Large (100s of TB): PDW
    • Note: Elastic scale plus for lower sizes too!
  • Careful with sizes, some are listed pre-compression
server config file layout
Server Config / File Layout

Follow FT Guidance!

You probably don’t need to do anything else

why does fast track pdw work
Why does Fast Track/PDW Work?
  • Warehouses are I/O hungry
    • GB/sec
    • This is high (in a SAN terms)
  • We did the HW testing for you
  • Guidance on data layout
implement prototype model
Implement Prototype Model
  • Design schema
  • Analyse data quality with DQS/Excel
    • Probably not what you expected to find!
  • Start with small data samples!
schema tool discussion
Schema Tool Discussion!

SSMS with Schema Designer

SQL Server Data Tools

prototyping hints
Prototyping Hints

Customer Type 1

Customer Type 2

Generate INTEGER keys out of strings keys with hash

Focus on Type 1 Dimensions

PowerPivot/Excel to show data fast

Drive conversation with end users!

prototype what users will teach you
Prototype: What users will teach you
  • They will change/refocus their mind when they see the actual data
  • You have probably forgotten some dimension data
  • You may have misestimated data sizes
schema design hints
Schema Design Hints
  • Build Star Schema
  • Beginners may want to avoid snowflakes (most of our users just use star)
  • Implement a Date Table (use INT key in YYYYMMDD format)
    • Fact.MyDate BETWEEN 20000000 AND 20009999
    • Fact.MyDate BETWEEN ‘2000-01-01’ and ‘2000-12-31’
    • YEAR(Fact.MyDate ) = 2000
  • Identity, Sequences
  • Usually you can validate PK/FK Constraints during load and avoid them in the model
  • Fact Table – fixed sized columns, declared NOT NULL (if possible)
  • For ColumnStore, data types need to be the basic ones…
why facts dimensions
Why Facts/Dimensions?
  • Optimizers have a tough job
  • Our QO generates star joins early in search
  • We look for the star join pattern to do this
    • 1 big table, dimensions joined to it…
  • Following this pattern will help you
    • Reduced compilation time
    • Better plan quality (average)
  • You can look at the plans and see whether the optimizer got the “right” shape
    • Wrong Plan  your query is non-standard OR perhaps QO messed up!
partition index the model
Partition/Index the Model
  • Partition fact by load window
  • Fact cluster/heap?
    • Cluster fact on seek key
    • Cluster fact on date column (if cardinality > partitions)
    • Leave as heap
  • Column Store index on
    • All columns of fact
    • Columns of large dim
  • Cluster the Dim on Key
if followedpattern expect
If(followedpattern) {expect …}
  • Star Join Shape
  • << get plan .bmp>>
  • Properties:
    • Usually all Hash Joins
    • Parallelism
    • Bitmaps
    • Join dimensions together, then scan Fact
    • Indexes on filtered Dim columns helpful if they are covering
the approximate plan
The Approximate Plan
  • Hash

Stream

Aggregate

Partial

Aggregate

Hash

Join

Batch

Build

Dim Seek

Dim Scan

Hash

Join

Batch

Build

Fact CSI Scan

column store plan shapes
Column Store Plan Shapes
  • For ColumnStore, it’s the same shape 
  • Minor differences
    • Batch mode (Not Row Mode)
    • Parallelism works differently
    • Converts to row mode above the star join shape
  • If you don’t get a batch mode plan, performance is likely to be much slower (usually this implies a schema design issue or a plan costing issue)
  • Partitioning Sliding Window works well with ColumnStore (especially since the table must be is readonly)
data maintenance
Data Maintenance
  • Statistics
    • Add manually on Correlated Columns
    • Update fact statistics after ETL load
    • Leave Dim to auto update
  • Rebuilding indexes?
    • Probably not needed
    • If needed, make part of ETL load
  • Switch out old partitions and drop switch target
    • Automate this
serve the data
Serve the Data
  • Self Service
    • Tabular / Dimensional Cubes
    • Excel / PowerPivot / PowerView
  • Fixed Reports
    • Reporting Services
    • PowerView
  • Don’t clean data in “serving engines”
    • Materialisepost-cleaned data as column in relational source
slide22

!

?