1 / 49

DAT353 Analysis Service: Server Internals

DAT353 Analysis Service: Server Internals. Tom Conlon Program Manager SQL Server Business Intelligence Unit Microsoft Corporation. Purpose of this Session. Remove some of the mystery Explain how it is that we do some things so much better than our competitors

Download Presentation

DAT353 Analysis Service: Server Internals

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DAT353Analysis Service: Server Internals Tom Conlon Program ManagerSQL Server Business Intelligence UnitMicrosoft Corporation

  2. Purpose of this Session • Remove some of the mystery • Explain how it is that we do some things so much better than our competitors • Things are easier to understand when the internals are understood • Requirements: • You already know the basics – this is for the experienced

  3. Agenda • Architecture Review • Aggregations • Data and Dimension Storage • Processing • Querying • Server Memory Management • Distinct Count

  4. Analysis Manager SQL Server Data Warehouse OLEDB for OLAP DSO PivotTable Service Other OLE DB Providers OLEDB OLAP Store ADO MD Processing Querying Architecture – Single Server Application Analysis Server

  5. Component Architecture - Query Server Storage Engine METADATA MANAGER METADATA MANAGER FORMULA ENGINE AGENT FORMULA ENGINE MDX CACHE CACHE MMSMDSRV.EXE MSOLAP80.DLL

  6. Component Architecture - Management MSMDGD80.DLL MSMDCB80.DLL DCube Storage Engine METADATA MANAGER DCube PARSER Server Storage Engine METADATA MANAGER METADATA MANAGER FORMULA ENGINE AGENT FORMULA ENGINE MDX DDL CACHE CACHE MMSMDSRV.EXE MSOLAP80.DLL

  7. Component Architecture - Distributed MSMDGD80.DLL MSMDCB80.DLL DCube Storage Engine METADATA MANAGER DCube PARSER Server Storage Engine METADATA MANAGER METADATA MANAGER FORMULA ENGINE AGENT FORMULA ENGINE MDX CACHE CACHE MMSMDSRV.EXE MSOLAP80.DLL MMSMDSRV.EXE Server Storage Engine METADATA MANAGER CACHE

  8. Agenda • Architecture Review • Aggregations • Data and Dimension Storage • Processing • Querying • Server Memory Management • Distinct Count

  9. Why Aggregations? • Aggregations can result in orders of magnitude improvement in performance • Don’t have to access every fact table record to determine query result • Further savings with data compression • Biggest savings: reduce disk scan

  10. Aggregations - Overview Highest Level Aggregation Customers All Customers Country State City Name Product All Products Category Brand Name SKU Intermediate Aggregation Facts

  11. Partial Aggregation • Don’t want to create all possible aggregations • Data explosion! • What if a query is made to a combination of levels where no aggregation exists? • Can compute from lower level aggregations • Don’t need to compute every possible aggregation Queries including a combination of Country and Brand can be answered if aggregation Country by Name exists. Customers All Customers Country State City Name Product All Products Category Brand Name SKU

  12. Aggregations Show me all sales for all products for all . . . Highest level of aggregation (1,1,1,1,…) Most detailed Aggregations (m,m,m,…) Fact Table

  13. Partial Aggregation Show me all sales for all products for all . . . Highest level of aggregation Most detailed Aggregations Fact Table

  14. Pro. Family Pro. Family Quarter Product Month Quarter Products Month Aggregation Design Fact Table

  15. Aggregation Design Results • Result: aggregations designed in waves from the top of the pyramid • At 100% aggregations, ‘waves’ all touch: overkill • 20-30% Generally adequate (0% for the smaller cubes) Fact Table

  16. Aggregation Design • Which aggregations are more important than others? • All are equal • From design perspective, select the ones that result in overall improved query performance • Usage Based Optimization: Weightings on each aggregation based on usage frequency

  17. Flexible and Rigid Aggregations • Changing dimensions allow members to be moved, added and deleted. • After members move, only incremental process of dimension is required • ‘Flexible’ aggregations deleted when a changing dimension is incrementally processed. • ‘Rigid’ aggregations remain valid When member X is moved from a child of A to a child of C, all aggregations involving A or C are invalided C B A X X

  18. Aggregation Data Storage Aggregations including (All) level are rigid (if all other levels in the agg are rigid) • No impact on fact data or rigid aggregation data when changing dimension incrementally processed • Flexible aggregations are invalidated when changing dimension incrementally processed • Data is in three files: • partitionName.fact.data • partitionName.agg.rigid.data • [partitionName.agg.flex.data] Aggregations with this level are always flexible C B A Aggregations with this level are rigid (if all other levels in the agg are rigid) X X

  19. Incremental Dimension Processing (Chg Dimension) • Potential Resource Competition during lazy processing after changing dimension incrementally processed • Fewer Aggregations! • Result: Query performance degradation Query and process dimension data: Keys, member names, member properties For each cube using this dimension Delete flexible Aggs and indexes Start lazy aggregating Start lazy indexing

  20. demo Flexible Aggregation Demo

  21. Agenda • Architecture Review • Aggregations • Data and Dimension Storage • Processing • Querying • Server Memory Management • Distinct Count

  22. Data Storage • No data stored for empty member combinations • With compression – data storage approx 1/3 of space required in RDBMS source • Data is stored by record in pages • Each record contains all measures at an intersection of dimension members Page Record 1: mbrd1, mbrd2,…mbrdnm1, m2,…mn Record 2: mbrd1, mbrd2,…mbrdnm1, m2,…mn … Record 256: mbrd1, mbrd2,…mbrdnm1, m2,…mn

  23. Data Structures • Partition data stored in a file divided into Segments • Each Segment contains 256 pages (each with 256 records) = 64K records Data File Segment n Segment 2 Segment 1 Page 1 Page 2 … Page 3 … … Only last segment has fewer than 256 pages Page 256

  24. Clustering • Physical order of the records in each page and segment is organized to improve performance • Keeps records with same or close members together • Similar in concept to SQL clustered index where data sorted by key values • Try to minimize distribution of records with the same member across segments and pages • Optimized, but no algorithm can keep records for the same member (unless the cube contains a single dimension) • Similarly – SQL can only have a single clustered index • Records with identical dimension members can be in multiple segments • Data is read and processed in chunks (more on this later…)

  25. Indexing • How is the data retrieved? • Cubes can be in the terabyte range • Scanning data files not an option • Need an index by dimension member • Answers question “Where is the data associated with this combination of dimension members?” • Map files provide this

  26. Map Files Segment 1 Page 1 There is a map for each dimension which indicates the page where the member is included in a data record Page 2 Dimension 1 Map Page 3 Page 4 Page 5 Page 6 … Dimension 2 Map Page 256 To resolve a query containing a member from each dimension, get list of pages containing all members

  27. Other Approaches • Array Based • Normally allocates a cell for every combination. • Result: Data explosion - much more disk space and longer processing times • Mix of Record and Array • ‘Dense’ dimensions are record like • Sparse are array like • Bit used per empty cell – sparsity explodes db sizes • User chooses decides whether a dimension is dense or sparse

  28. Agenda • Architecture Review • Aggregations • Data and Dimension Storage • Processing • Querying • Server Memory Management • Distinct Count

  29. Processing Buffer Memory Settings • ‘Read-Ahead Buffer Size’ is the buffer containing data read from source db • Defined in Server Property Dialog. Default: 4Meg • Rarely important – little effect when changed • Data is processed in chunks of ‘Process Buffer Size’ • Defined in Server Property Dialog • Data is clustered within Process Buffer Size • Bigger Process Buffer Size the better – make as big as possible • Data for dimension members is clustered to keep data for ‘close’ members close together • The larger these memory settings are, the more effective clustering

  30. Incremental Processing Original partition Two Step Process • First, a partition is created with the incremental data • Second, the partition is merged with the original • Complete Segments of both partitions left intact – incomplete ones are merged • After many incremental processes, data distributed: degraded performance • Reprocess (if you have a large Process Buffer size) can provide improved performance … … … + … Incremental Partition

  31. Agenda • Architecture Review • Aggregations • Data and Dimension Storage • Processing • Querying • Server Memory Management • Distinct Count

  32. Querying a Cube • CLIENT • Select {[North America],[USA],[Canada]} on rows, Measures.members on columns from myCube • Need two things: • getting the dimension members – the axes • getting the data

  33. Resolve Axis • Dimension members cached on client in ‘Client Member Cache’ • Levels with #members < Large Level Threshold sent in group • Levels with #members > Large Level Threshold retrieved as needed • Large Level Threshold default value:1000, can be changed in server property and in connection string • Where members not cached, members and descendents retrieved to client until needed member retrieved • Levels with members with 1000s of siblings result in degraded performance • Member cache not cleaned except for disconnect or when cube structure changes. Cached members ‘Christmas trees’ Requested member Non-cached members

  34. Client Data Cache • Client retains data of previous queries in client data cache • Client Cache Size property controls how much data is in the client cache • When 0: unlimited • 1-99 (inclusive), percent of physical memory • >99 use up to the value in KB • Default value: 25 • When exceeded, client cache is cleaned at cube granularity

  35. How Cubes Are Queried Client Service Data on disk Query Processor Client Data Cache Partition: Canada Segment 1 Segment 4 Segment 2 Segment 5 Dimension Memory Segment 3 Segment 6 Partition: Mexico Segment 1 Segment 4 Cache Memory Segment 2 Segment 5 Segment 3 Segment 6 Partition: USA Segment 1 Segment 4 Segment 2 Segment 5 Segment 3 Segment 6

  36. Agenda • Architecture Review • Aggregations • Data and Dimension Storage • Processing • Querying • Server Memory Management • Distinct Count

  37. Service Start Up • Minimum Allocated Memory defines the amount of memory completely dedicated to the server • All dimensions in the database are retained in memory • Tip: invalidate a dimension if not used in a cube • Dimension requirements: ~125 bytes per member plus member properties • 1M members: 125M • With 25 char member property (eg, Address): 175M • Large dimensions can migrate to separate process space Dimension Memory Minimum allocated memory

  38. During Processing • Shadow dimensions • 2 copies of dimensions stored in memory while processing • Processing Buffers • Read Ahead Buffer size • Process Buffer Size • If dimension and processing buffers memory requirements exceed Memory Conservation Threshold - no room for data cache Dimension Memory Minimum allocated memory Memory conservation threshold Shadow Dimensions Processing Buffers Available Cache

  39. During Querying Minimum allocated memory Dimension Memory Memory conservation threshold • Data cache stores query data for reuse • Faster than retrieving from storage • If Dimension Memory requirements > Memory Conservation Threshold, no Data Cache • ‘Cleaner’ wakes up periodically to reclaim memory from data cache • BackgroundInterval registry setting. Default value: 30 seconds Available Cache <= 0.5 * (Minimum Allocated Memory+ Memory Conservation Threshold): No cleaning • 0.5 * (Minimum Allocated Memory + Memory Conservation Threshold) and < Memory Conservation Threshold: mild cleaning • Memory Conservation Threshold: aggressive cleaning

  40. demo Setting Server Properties

  41. Agenda • Architecture Review • Aggregations • Data and Dimension Storage • Processing • Querying • Server Memory Management • Distinct Count

  42. Sales Number of Customers All products 8000 200 Hardware 3300 80 Computers 2000 70 Monitors 800 60 Printers 500 30 Software 4700 150 Home 1500 100 Business 2500 100 Games 700 80 Distinct Count • Business Problem: Sales Manager wants to know: • “How many customers are buying Computers?” • “How many active customers do I have?”

  43. Distinct Count: Changes to Data Structure • DC Measure stored with each fact and aggregation record • Just like a new dimension • Data ordered by DC measure • “Order by” included in SQL statement during processing • Number of records can be increased by orders of magnitude • Dependant on number of distinct values per record Sample aggregate record without Distinct Count… …# records increases with distinct count on customers

  44. Distinct Count: Changes to Query • Single thread per partition instead of per segment • Unlike regular cubes, cannot do a single aggregation of results from each segment as a single value of the DC measure can cross segments • Consequently – performance impact • Dimension slice requires much more disk scan than before • Segments clustered by DC measure • Expensive

  45. Distinct Count Tips • Keep DC measures in their own cube • All measures are retrieved on query – even if some are not asked for • Create virtual cube to merge DC with other measures • Incremental processing DC cubes is very expensive • Segments restructured and reordered to keep records ordered by DC measure • Time and memory intensive

  46. Distinct Count Tips • Unlike regular cubes, best to distribute DC values evenly across each partition • Most effective use of multiple threads for query processing • If you have a dimenion that corresponds to DistinctCount Measure • Aggregations recommended only on lowest level • (Example, Customer dimension in cube, Customer as distinct count measure)

  47. Summary • Architecture Review • Aggregations • Data and Dimension Storage • Processing • Querying • Server Memory Management • Distinct Count

  48. Don’t forget to complete the on-line Session Feedback form on the Attendee Web sitehttps://web.mseventseurope.com/teched/

More Related