what is parallel data warehouse pdw and where does it fit n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
What is Parallel Data Warehouse (PDW) and where does it fit? PowerPoint Presentation
Download Presentation
What is Parallel Data Warehouse (PDW) and where does it fit?

Loading in 2 Seconds...

play fullscreen
1 / 50

What is Parallel Data Warehouse (PDW) and where does it fit? - PowerPoint PPT Presentation


  • 167 Views
  • Uploaded on

What is Parallel Data Warehouse (PDW) and where does it fit? . Mike Lampa Director – Business Analytic Solutions. Agenda. What is Big Data and Where does PDW Fit? PDW Architecture on Dell hardware MPP and Shared Nothing concepts Data distribution and re-distribution

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'What is Parallel Data Warehouse (PDW) and where does it fit?' - dunne


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
what is parallel data warehouse pdw and where does it fit

What is Parallel Data Warehouse (PDW) and where does it fit?

Mike Lampa

Director – Business Analytic Solutions

agenda
Agenda
  • What is Big Data and Where does PDW Fit?
  • PDW Architecture on Dell hardware
  • MPP and Shared Nothing concepts
  • Data distribution and re-distribution
  • EDW Reference Architecture
  • Data Modeling Guidelines
  • ETL Guidelines
  • Resource Skilling Considerations

Feel free to ask questions as we move along – goal is to make this presentation as interactive as possible!

Confidential

slide4

The DataExplosion

  • 988 Exabytes of information in 2010.
  • Everyday 2.5 quintillions of data is created.
  • Volume of data across enterprise doubling every 3 years
  • 80% of enterprise data is unstructured

DELL CONFIDENTIAL

storm rising in data analytics
Storm Rising in Data Analytics

Major technology developments are driving three key mega-trends driving new opportunities for the industry and our customers

Real-time BI and Analytics

Scalable Database Architectures

“Big Data”

Unstructured

NoSQL/NewSQL

Self-Service

Structured

In-Memory

Hadoop

Pig/Hive

Visualization

Columnar/MPP

Cloud BI

Socianalytics

slide6

Big Data complements Analytics (DW & BI)

Data Sources

Processing Infrastructure

Knowledge Capture

Business Value

Models and Production Analytic Applications

Shared Infrastructure

Crawlers

Sensors

Apps

Devices

Bots

Un-Structured

Exploratory Analytics

New IP Creation

BI User

BI Tools

DW

ERP

CRM

Well-defined processing

Data-enriched tools

Structured

PDW Data MgmtSvcs

IT

Professional

Domain

Specialist

App

Developer

DELL CONFIDENTIAL

slide9

What is the PDW Appliance?

Microsoft Parallel Data Warehouse

  • Microsoft software running on Dell hardware
  • High-end data warehouse, scales to 100’s of terabytes
  • Massively Parallel Processing (MPP) for high performance
  • Architected with redundancy throughout the system
  • Based on proven MS SQL Server 2008 R2 platform
  • Low cost of ownership with industry standard Dell hardware
  • Sold as an appliance with software preloaded
  • Extensive consulting and application services available
  • Microsoft and Dell representatives work together to serve the customer

Confidential

slide10

Microsoft PDW ArchitectureScales for Resilience and High Performance, with a Low Cost of Entry

Control Rack

Data Racks (up to 4)

Scale by adding data rack(s)

PowerEdge R610

Database Servers

MD3620f

Storage Nodes

1. PDW Engine

2. Admin Console

3. Metadata

4. Workspace

Control Nodes (R710)

Active / Passive

Client Drivers

Active Directory/DNS

HPC

Setup/patching

Management Servers (R610)

Data Center Monitoring

Dual Fiber Channel

Dual Infiniband

1. SSIS Instance

2. Loader Tool

3. File Staging

Landing Zone (R510)

ETL Load Interface

File store for backups

Std. SQL Backups

Full and differential

Backup Node (R710 and MD3600f w/MD1200’s)

Corporate Backup Solution

Spare Database Server

Corporate Network

Private Network

Confidential

slide11

Rack Configuration for Dell MD Appliance

Control Rack

Data Rack 1

Data Rack 2

Data Rack 3

Data Rack 4

Confidential

pdw core concepts
PDW Core Concepts
  • Distributed Relational Database
    • 10 DBMS servers per Data Rack
    • Data distributed across multiple DBMS instances
  • Massively parallel processing
    • Multiple concurrent resources resolve SQL set operations against Distributed data
      • Compute Node architecture supports 10 parallel instances of DBMS per Data Rack.
      • Each DBMS instance works in parallel on its own “distribution” of a single user query.
  • Shared nothing computing
    • Resource and data independence are maintained within each DBMS instance
      • Each Compute Node reserves its shared resources (CPU, Memory, Disk) for only its distribution of system data
    • Managed by MPP server (Control node)
    • Converting schema & metadata from shared nothing to a common logical view
  • Configured for high redundancy

Confidential

data distribution
Data Distribution
  • Distributed:A table structure with evenly distributed records across multiple shared nothing databases.
    • Distribution Key: A single column in a Distributed table that is used for hash distribution of records across multiple shared nothing databases.
  • Replicated:A table structure that exists as a full copy on each shared nothing database.
  • Ultra Shared Nothing: Design database schema with a mix of replicated and distributed tables to minimize data movement between nodes.
    • Dimensions are replicated
    • Facts are distributed
    • Redistribute rows at run time when distribution incompatibility is encountered in SQL set operation.

Confidential

ultra shared nothing example
Ultra Shared Nothing Example

TD

PD

SF

CD

MD

TD

PD

SF

CD

MD

TD

PD

SF

CD

MD

TD

PD

SF

CD

MD

Confidential

redistribution
Redistribution
  • Redistribution:The movement of data between shared nothing database instances to answer distribution incompatible SQL queries within a PDW Appliance.
    • Shuffle:A redistribution technique that leverages Inifiniband™ network to create temporary distribution compatible data sets.
    • At least one table in the query plan uses a Distribution Key in its join criteria.
    • Any table that is not joined on it’s Distribution Key is targeted for Shuffle first. Leftmost table is chosen if multiple tables meet this criteria.
    • Replication:A redistribution technique that is used to create a temporary full copy of a data set.

Confidential

smp vs mpp
SMP vs MPP

MPP

  • Dedicated resources
  • Scales to PB
  • Applicable > 20 TB
  • Built in HA and redundancy
  • High Concurrency for complex workloads

SMP

  • Shared Resources
  • Limited scaling
  • Applicable < 20TB
  • HA must be architected in

Confidential

edw information layers
EDW Information Layers

Data Rack

Landing Zone

Confidential

data flow
Data Flow

Integration Layer

Data In

Source

Stage

Replication

ETL

LRF

Base & Package Layer

Data Store

  • PDW Load Scripts
  • Load Scheduling

PDWPRD01

Base & Package

100 TB

Package Copy

for Presentation

Data Presentation Layer

Data Store

PDWPRD03

Presentation

60 TB

Dell network

Infiniband

Consumption Layer

Data out SAS, BO, MSAS, BI Tools

Confidential

enabling packages hub and spoke
Enabling Packages - Hub and Spoke
  • Physical Data Marts (Packages) May make sense from consumption perspective.
  • Primary Considerations:
    • Business Function
    • Type of BI Workload
  • Secondary Considerations:
    • User Size, Data Volumes & Performance
    • Security & Sensitivity

Confidential

pdw table geometries
PDW Table “Geometries”
  • Replicated: A table structure that exists as a full copy within each PDW Data Node
  • Distributed: A table structure that is hashed and distributed as evenly as possible across all PDW Data Nodes on the appliance
pdw table geometry example

Date Dim

Item Dim

Date Dim ID

Calendar Year

Calendar Qtr

Calendar Mo

Calendar Day

Prod Dim ID

Prod Category

Prod Sub Cat

Prod Desc

Store Dim

Store Dim ID

Store Name

Store Mgr

Store Size

PDW Table Geometry Example

PDW

Storage Nodes

Compute Nodes

Source System

DD

ID

SF

1

SD

PD

DD

ID

SF

2

SD

PD

Sales Fact

Date Dim ID

Store Dim ID

Prod Dim ID

Mktg Camp Id

Qty Sold

Dollars Sold

DD

ID

SF

3

SD

PD

Promo Dim

DD

ID

SF

4

SD

PD

Mktg Camp ID

Camp Name

Camp Mgr

Camp Start

Camp End

DD

ID

SF

5

SD

PD

Confidential

distribute or replicate
Distribute or Replicate?
  • Use distributed tables when:
    • The table is large – generally > 5GB
    • For fact/detail tables
    • Full table scans do not provide acceptable performance
  • Use replicated tables when:
    • The table is small – generally < 5GB
    • For dimension/lookup tables
    • Multiple foreign keys exist and foreign key joins are common

Confidential

partitioning distributed tables
Partitioning Distributed Tables
  • Distributed tables are already segmented by hashed distributions
  • Will further partition rows within a distribution, based on a partition function (eg: Time_Dim Quarter or Year)
  • Allows for operations efficiency when adding, loading, dropping, and switching partitions
  • Good for fast loading of an unused partition and then switching it in after loading
  • Partition for manageability
    • Typically on a date key (or integer surrogate)
    • Typically same as clustered index key
    • SWITCH partitions OUT for fast delete of history or IN to modify or add a specific historical slice

Confidential

colocation
Colocation

Tables must be designed for performance from the beginning. Performance optimization is not just a DBA thing after development is complete!

    • Colocation: Within a PDW appliance, two individual records with identical keys will always belong to the same Distribution.
    • Single GREATEST performance consideration
    • Beneficial for Join and Aggregation performance (eg Parent & Child join)
    • Distribution Compatibility
  • Choosing the right Distribution Key
    • Identify Commonly used join keys and/or aggregations
    • Choose a single column that limits skew to < 40% (High Domain Cardinality & low Distribution SKEW)
    • Distribution Key should be the first column declared for Distributed Table DDL
    • Consider Surrogate Keys when “business key” is compound

Confidential

colocation example
Colocation Example
  • Ensure that all compatible Distribution Keys are identical data types.
    • [customer.customer_id integer] = [customer_hist.customer_id integer]
    • [customer.customer_id integer] <> [customer_hist.customer_id char(10)]
  • Colocation and Distribution Key Example

Distribution Key

Confidential

handling large dimensions
Handling Large Dimensions

Large Dimensions (>5GB uncompressed)

  • Distribute/Normalize
  • Distribute:
    • If possible, distribute dimension on same key as fact surrogate key.
    • If distribution compatibility not possible, “shuffle” dimension data on the fly (at Query time)
  • Normalize:
    • Normalize large dimension into smaller tables and replicate the core dimension (more manageable replication size)
    • Look at usage pattern, if only a few columns from dimension are used in most of the major queries, separate high use columns from low use columns into separate Dim_tables. Both, have the same surrogate key.
      • Core dimension is replicated which is joined locally to fact table
      • Create a view to combine data from core and outrigger to insulate complexity from users (CAUTION: check the performance)

Confidential

multi level partitioning
Multi Level Partitioning
  • Partitioning
    • Partitioning is a method of distributing a table’s rows among a number of sub-tables (partitions).
    • Partitioning is applied within each Distribution.
  • Multi-Level Partition Support
    • Any combination of up to four total Range, Hash, Or List partition schemes.
    • Each new partition level generates new partitions at a multiple of the previous level.
    • Partition Values: [Range 4 x List 6] will generate 24 partition files per Distribution.

Confidential

benefits of partitioning
Benefits of Partitioning
  • Reduce Table Scans
    • This is the most common use case for partitioning.
    • Relies on query restrictions aligned with: Range, Hash, or List qualifiers.
    • Practice: Partition on commonly restricted fields (query based).
  • Minimize Memory Utilization
    • Join Operations
      • Reduces memory requirements per join.
      • Reduces disk spill if session or operation limits are reached.
    • Aggregation
      • Reduces memory requirements to build result set.
      • Reduces disk spill if session or operation limits are reached.
    • Practice: Hash Partition on join key.

Confidential

multi level partitioning example
Multi Level Partitioning Example

DDL of Multi-Level Partitioned Table

CREATE TABLE member (

memberID BIGINT NOT NULL,

memberType SMALLINT NOT NULL,

lastName VARCHAR(50) NOT NULL,

activeStatus CHAR(1) NOT NULL,

salesTotal FLOAT,

lastLogin DATE NOT NULL)

WITH distribute_on (memberID),

text compressed,

IIpartition=((range on lastLogin

partition p01 values < '2007_01_01',

partition p02 values < '2007_04_01',

partition p03 values < '2007_07_01',

partition p04 values < '2007_10_01',

partition p05 values < '2008_01_01',

partition p06 values >= '2008_01_01')

SUBpartition (hash on memberID 5 partitions)

SUBpartition (list on activeStatus

PARTITION p101 VALUES ('n'),

PARTITION p102 VALUES ('a'),

PARTITION p103 VALUES (default)));

Confidential

pdw data loading design goals
PDW Data Loading Design Goals
  • Load data efficiently and non-obtrusively, respecting concurrent queries and loads
  • Reduce table fragmentation as much as possible
  • Provide system recovery capabilities in the event of data load failure that have minimal impact on concurrent queries
  • Provide multiple load/ETL options for PDW customers
data movement service dms with pdw
Data Movement Service (DMS) with PDW
  • Runs on the following nodes as a Windows service:
      • Control
      • Compute
      • Landing Zone
  • Used to quickly move data in parallel between nodes by using Infiniband network in PDW
  • Uses ADO.NET
      • Uses SqlClient namespace to select data from SQL Server
      • Uses SqlBulkCopy to insert data into Compute nodes
  • Two protocols/networks used by DMS:
    • Data transfer network to move data between nodes
    • Message network to send command and status messages to nodes from Manager
  • DMS closely interacts with the primary PDW Engine Service
  • DMS is used for both loading and querying data

Confidential

etl loading options in pdw
ETL Loading Options in PDW
  • DWLoader Utility
  • SQL Server Integration Services (SSIS)
  • CREATE TABLE AS SELECT (CTAS)
  • Standard SQL DML statements: INSERT/SELECT

Confidential

pdw distributed table load step 1
PDW Distributed Table Load – Step 1

Control Rack

Data Rack

DMS

DMS

Control Node

Compute Nodes

Storage Nodes

Converter

Converter

Sender

Sender

(1) DWLoader invoked/

SSIS

Writer

Writer

Receiver

Receiver

(4) Each row is converted for bulk insert and hashed based on the distribution column

SQL

Server

DMS Ser er

(2) Load Manager creates staging tables

PDW Engine

Infiniband

(5) Hashed row is sent to appropriate node receiver for loading

DMS

Load Manager

DMS

Manager

(3) DMS reads load data and buffers records to send to Compute Nodes round-robin

Landing Zone

(6) Row is bulk inserted into staging table

Load

File/SSIS

SSIS API

Distributor

DMS

Load Client

Confidential

pdw distributed table load step 2
PDW Distributed Table Load – Step 2

STEP 1: DWloader creates topology equivalent staging table

and moves data from LZ file into staging tables using DMS

Staging DB

Destination DB

2nd step process

DWloader uses SQL commandsto move from staging to destination tables

NOTE: distributions of a table are written in parallel when the multi-transactions option is set to true.

Confidential

data loading dwloader
Data Loading – DWloader
  • Command-line utility invoked on the Landing Zone
  • Integrated with DMS
    • Streamlines I/O and minimizes data-loading times through powerful parallel loading functionality against a single text file
    • Optimize data load speeds while maintaining a performance balance so as not to seriously degrade concurrently running queries
  • Characteristics of Dwloader
    • Accommodate initial data loads of large files over 300 GB
    • Achieve data load speeds of up to 2 TB per hour
    • Accommodate multiple and concurrent incremental loads
    • Has settings for canceling and showing status of loads
    • Input file must reside on the Landing Zone
    • Max. concurrency 10, queues up subsequent load

Confidential

data loading ssis
Data Loading – SSIS
  • The SQL Server PDW Destination is an SSIS component that lets you load data into SQL Server PDW by using an SSIS .dtsx package.
  • In the package workflow for SQL Server PDW, you can load and merge data from multiple sources and load data to multiple destinations.
  • The loads occur in parallel, both within a package and among multiple packages running concurrently
  • SQL Server 2008 R2 SSIS includes:
    • SQL Server Parallel Data Warehouse Connection Manager
    • SQL Server Parallel Data Warehouse Destination
  • Similar to dwloader, SSIS leverages DMS for parallel load operations.
  • SSIS can run either on the Landing Zone or on a server outside the PDW appliance.

Confidential

ssis and pdw data types
SSIS and PDW Data Types

When using SSIS to load data from a data source to a SQL Server PDW database:

  • Data is first mapped from the source data to SSIS data types.
  • This allows data from multiple data sources to map to a common set of data types.
  • Then the data is mapped from SSIS to SQL Server PDW data types.

Confidential

leading practices data loading
Leading Practices – Data Loading
  • Minimize page breaks (fragmentation) by designing “partition-friendly” loads.
  • If necessary, drop non-clustered indexes before loading and re-index after all loads are complete.
  • There is no benefit to sorting data before hitting the Landing Zone.

Confidential

leading practices staging databases
Leading Practices – Staging Databases
  • Historic PDW load jobs tend to be the largest. The staging database may be reduced in size for subsequent incremental loads.
  • When creating the staging database, use the following guidelines:
        • Replicated table size should be the estimated size per Compute Node of all the replicated tables that will load concurrently.
        • Distributed table size should be the estimated size per appliance of all the distributed tables that will load concurrently.
        • Log size is typically similar to the replicated table size.

Confidential

leading practices ssis
Leading Practices - SSIS
  • For good PDW loading throughput, it is important to keep a steady stream with minimal starts and stops.
  • PDW connections and queries are very costly to initiate. Use fewer to do more.
  • Data type conversion in the PDW destination adapter is very expensive. Be sure the input types match the destination types, especially for strings and decimals.
  • Consider performing data transformations after loading into the staging database (ELT instead of ETL).

Confidential

etl guidelines1
ETL Guidelines
  • Grouping: Determine the largest set of data that is distribution compatible within the query. This will break queries in multiple compatible steps.

Confidential

etl guidelines2
ETL Guidelines
  • Joining two distribution incompatible tables
    • Scenario where changing the structure of either table is not possible
    • Usually encountered while populating fact tables from underlying BASE tables
    • Create a temporary table with required columns from driver table distributed on a key which makes distribution compatibility possible
    • More controlled “Shuffle”
    • Temp table or Join output can be reused by multiple queries
    • ETL BEST PRACTICE: BREAK YOUR WORKLOAD IN MULTIPLE, MANAGEABLE DISTRIBUTION COMPATIBLE SET OF QUERIES

Confidential

skills consideration
Skills Consideration
  • Platform Skills - Moving from SQL to PDW
    • Retain much of your SQL skills (SQL Server Data Architecture & DBA, SSIS, etc)
    • Heterogeneous platform as you scale from GB to PB!
  • Design Skills:
    • MPP Data Architecture differs from SMP
      • Think in Terms of Distribution Keys vsPrimary_Key and Foreign_Key
      • Think in Terms of Distribution Compatibility vs Indexes for Performance
      • Surrogate Keys lend themselves to Distribution Keys
    • MPP ETL Architecture differs from SMP
      • More use of Load Ready Files with Upsert Logic vs Dynamic Lookup
      • Staging environment strategies simulate CDC Key Lookups
      • Surrogate Keys add complexity to CDC Lookups and Key Generation

Confidential

summary
Summary
  • PDW is an MPP appliance from Microsoft on Dell Hardware
  • Keep in mind MPP and Shared Nothing concepts while designing your EDW on PDW.
  • Traditional SMP concepts are neither sufficient nor applicable.
  • Break your workload in manageable distribution compatible chunks.
  • PDW supports both Normalized and Star schemas.
  • Consider grouping data in logical information layers.
  • Use combination of Dwloader & SSIS depending on unit-of-work
  • Retain your core technology platform skills, augment your DW design skills

Confidential