1 / 39

SQL Server 2012 Data Warehousing Deep Dive

Learn about various topics related to data warehousing, including bitmap filtered hash joins, table partitioning, filtered indexes, indexed views, data compression, window functions, and columnstore indexes.

veaton
Download Presentation

SQL Server 2012 Data Warehousing Deep Dive

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SQL Server 2012 Data Warehousing Deep Dive Dejan Sarka, SolidQ dsarka@solidq.com

  2. Agenda • DW Problems • Bitmap Filtered Hash Joins • Table Partitioning • Filtered Indexes • Indexed Views • Data Compression • Window Functions • Columnstore Indexes

  3. Algorithms Complexity Forever* = about 40 billion billionyears!

  4. SSAS Dimensional Addressing Axis(1) Axis(1).Position(3) Axis(1).Position(1).Members(2) Every cell has an address

  5. SSAS Tabular Problems • SSAS address space: mncells • Maximum number of possible combinations200 * 5000 * 1095 = 109,500,000 • SSAS address space grows exponentially! • Can run out of address space – limited scalability

  6. RDBMS Joins • Merge: complexity ~ O(n) • Needs sorted inputs, equijoin • Hash: complexity ~ O(n) / ~O(n2) • Needs equijoin • Nested Loops: complexity ~ O(n)(indexed), ~ O(n2)(not indexed) • Works always, can become quadratic • Non-equijoins are frequently quadratic • E.g., running totals

  7. Linearize Joins

  8. Bitmap Filtered Star Joins • Optimized bitmap filtering for star schema joins • Bitmap representation of a set of values from a dim table to pre-filter rows to join from a fact table • Enables filtering rows early in the plan, allowing subsequent operators to operate on fewer rows

  9. Bloom Filter (1)* • Bloom filter is a bit array of mbits • Start with all bits set to 0 • kdifferent hash functions defined • Each of which maps some set element to one of the mpositions with a uniform random distribution • To add an element, feed it to each of the k hash functions to get k array positions • Set the bits at all these positions to 1 Source: Wikipedia

  10. Bloom Filter (2) • To test whether and element it is in the set, feed it to each of the k hash functions to get k array positions • If any of the bits at these positions are 0, the element is not in the set • If all are 1, then either the element is in the set, or the bits have been set to 1 during the insertion of other elements

  11. Table Partitioning • Partition function • Partition scheme • Aligned indexes • Partition elimination • Partition switching

  12. Filtered Indexes • Where clause in the Create Index statement • Small B-trees on subset of data only • Useful when some values are selective, while others dense • Index on selective values only

  13. Indexed Views • Useful for queries that aggregate data • Can also reduce number of joins • Depending on edition of SQL Server can be used automatically • No need to change reporting queries • Many limitations

  14. Data Compression • Pre-SQL 2005: variable-length data types • SQL 2005: vardecimal • SQL 2008 • Row compression • Page compression • SQL 2008 R2 • Unicode compression

  15. SQL 2008 Compression • Row compression • Fixed-width data type values stored in variable format • Page compression • Prefix compression • Dictionary compression

  16. Unicode Compression • Works on nchar(n) and nvarchar(n) • Automatically with row or page compression • Savings depends on language • Up to 50% in English, German • Only 15% in Japanese • Very low performance penalty

  17. Window Functions • Functions operating on a window (set) of rows defined by an OVER clause • Types of functions: • Ranking • Aggregate • Distribution SELECT empid, ordermonth, qty, SUM(qty) OVER(PARTITION BY empid ORDER BY ordermonth ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS runqty FROM Sales.EmpOrders;

  18. Window Functions in SQL Server • SQL Server 2005: • Ranking calculations • Aggregates with only window partitioning • SQL Server 2012: • Aggregates with also window ordering and framing • Offset functions: LAG, LEAD, FIRST_VALUE, LAST_VALUE • Distribution functions: PERCENT_RANK, CUME_DIST, PERCENTILE_CONT, PERCENTILE_DISC

  19. SQL Server DW / OLAP Offerings VertiPaq • Personal and team level • PowerPivot for Excel (client) • PowerPivot for SharePoint (server) • Corporate level • SQL Server • SSAS Tabular • SSAS Dimensional • Fast Track Data Warehouse • Parallel Data Warehouse

  20. Trans-Relational Model • Not “beyond” relational • Transformation between logical and physical layer • Steve Tarin, Required Technologies Inc. (1999) • All columns stored in sorted order • All joins become merge joins • Can condense storage • Of course, updates suffer • Logically, this is a pure relational model • SQL Server uses own variant • Order of columns not preserved – optimized for compression • Leverages parallel hash joins rather than merge joins

  21. Columnar Storage (1)

  22. Columnar Storage (2)

  23. Row Reconstruction Table

  24. SQL Server Solution (1)* • Converting rows to column segments Source: SQL Server Column Store Indexes by Per-Åke Larson, et al., Microsoft SIGMOD’10, June 12–16, 2011

  25. SQL Server Solution (2) • Storing column segments as BLOBs • Leverages existing BLOB storage • Additional segment metadata • Multiple compression algorithms

  26. Columnstore Compression • Encoding values to 32-bit or 64-bit integer • Dictionary-based encoding • Value-based (prefix) encoding • Optimal row ordering with VertiPaq™ algorithm to rearrange rows • Optimal ordering for Run-Length Encoding (RLE) for best overall compression • Compression • RLE - data stored as <value, count> pairs • Bit-Pack– use min number of bits for a value

  27. Result: Reduced I/O • Fetches only needed columns from disk • Columns are compressed • Less IO • Better buffer hit rates SELECT region, sum (sales) … C2 C3 C6 C4 C5 C1

  28. Result: Reading Segments • Column segment contains values from one column for a set of about 1M rows • Column segment is unit of transfer from disk • Storage engine can eliminate segments early in the process • Because of additional column segment metadata C1 C4 C5 C6 C3 C2 Set of about 1M rows Column Segment

  29. Reducing CPU Usage • Columnstore indexes reduce disk IO • Bitmap-filtered hash joins can be executed in parallel • Problem: CPU becomes a bottleneck • Solution: reduce CPU usage by processing large numbers of rows • Iterators that do not process row-at-a-time • Process batch-at-a-time

  30. Batch Processing • Orthogonal to columnstore indices • Can support other storage • However, best results with columnstore indices • Sometimes can perform batch operations directly on compressed data • Can mix batch and row operators • Can dynamically switch from batch to row mode

  31. Batch Operators • The following operators support batch mode processing: • Filter • Project • Scan • Local hash (partial) aggregation • Hash inner join • Batch hash table build Source: http://social.technet.microsoft.com/wiki/contents/articles/sql-server-columnstore-index-faq.aspx#Batch_mode_processing

  32. Columnstore Indexes Constraints • Base table must be clustered B-tree or heap • Columnstore index: • Nonclustered • One per table • Must be partition-aligned • Not allowed on indexed view • Can’t be a filtered index

  33. Data Type Restrictions • Unsupported types • Decimal > 18 digits • Binary • BLOB • (n)varchar(max) • Uniqueidentifier • Date/time types > 8 bytes • CLR

  34. Query Performance Restrictions • Outer joins • Unions • Consider modifying queries to hit “sweet spot” • Inner joins • Star joins • Aggregation

  35. Loading New Data Columnstore index makes table read-only Partition switching allowed INSERT, UPDATE, DELETE, and MERGE not allowed Two recommended methods for loading data Disable, update, rebuild Partition switching

  36. Columnstore Indexes Usage • Use when: • Read-mostly workload • Most updates are appending new data • Workflow permits partitioning or index drop/rebuild • Queries often scan & aggregate lots of data • Use on fact (and large dimensions) tables • Do not use when: • Frequent updates • Partition switching or rebuilding index doesn’t fit workflow • Frequent small look up queries • VertiPaq cannot handle your data model

  37. Review • DW Problems • Bitmap Filtered Hash Joins • Table Partitioning • Filtered Indexes • Indexed Views • Data Compression • Windows Functions • Columnstore Indexes

  38. Q & A • Questions? • Thank you for coming to this conference… • …and this presentation!

  39. References • Books: • SQL Server Books OnLine • Dejan Sarka, Grega Jerkičand Matija Lah: MCTS Self-Paced Training Kit (Exam 70-463): Building Data Warehouses with Microsoft SQL Server 2012 • Courses and Seminars • SQL Server 2012 and SharePoint BI Immersion • Advanced Transact-SQL

More Related