1 / 19

Developing a Characterization of Business Intelligence Workloads for Sizing New Database Systems

Developing a Characterization of Business Intelligence Workloads for Sizing New Database Systems. Ted J. Wasserman (IBM Corp. / Queen’s University) Pat Martin (Queen’s University) David B. Skillicorn (Queen’s University) Haider Rizvi (IBM Canada). Outline. Background and Motivation

haracha
Download Presentation

Developing a Characterization of Business Intelligence Workloads for Sizing New Database Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Developing a Characterization of Business Intelligence Workloads for Sizing New Database Systems Ted J. Wasserman (IBM Corp. / Queen’s University) Pat Martin (Queen’s University) David B. Skillicorn (Queen’s University) Haider Rizvi (IBM Canada)

  2. Outline • Background and Motivation • Workload Characterization Analysis • Results • Future work

  3. What is Sizing? • Estimating the amount of physical computing resources needed to support a new workload • Processor (CPU), disk, memory • Use of simplifying assumptions, extrapolations, estimations, projections, rules-of-thumb, prior experience, etc. • Due to lack of available information about new application Sizing is more of an art, than a science

  4. Our Sizing Approach • Collect the required high-level input data from the customer • Cross-check and verify input data, making assumptions and estimates if needed • Determine the required system resource demands for each workload class and type • Aggregate the different workload types and classes’ resource demands to determine the overall requirements • Determine which hardware configurations will meet the required resource demands • Produce a ranked list of hardware configurations * For more details, see: Wasserman, T.J., Martin, P., Rizvi, H. Sizing DB2 Servers for Business Intelligence Workloads. In Proc. of CASCON2004, October 2004, Toronto, Canada.

  5. Motivation • Problem: How does a customer describe the workload of their new application? • May not know the exact queries yet • No production-level measurements available • Only vague, high-level information • Solution: Study the characteristics of a proxy workload (TPC-H) and have the customer describe their workload in terms of the approximate performance goals and mix of the different classes of queries inherent in the proxy workload

  6. Workload Characterization Analysis • Partition queries into a few general classes based on their resource usage • Need to keep simple so that customer can understand and relate to partitions • Each class will comprise the queries that are similar to each other based on resource usage and other characteristics

  7. Data Collection • Data from 5 recent TPC-H benchmark power runs used (*pre-audited runs) • Hardware configurations and database scales varied across benchmarks • balanced system configurations were used • Data collected using standard OS monitoring tools at 5 second intervals for each query

  8. Parameter Selection • Query parameters monitored and used in analysis: • Query response time (seconds) • Average (user) CPU utilization • Average MB/second rate • Average IO/second rate • Size of largest n-way table join • The above set was sufficient for our analysis

  9. Data Normalization • Within each benchmark run, data normalized • Each benchmark data set transformed to one with a 0-mean and std. dev. of 1 • Normalized query data from each benchmark combined and used for analysis

  10. Partitioning Techniques • Goal: Partition workload into classes or clusters so that objects within a cluster are similar to each other, but are dissimilar to objects in other clusters • Singular Value Decomposition (SVD) & SemiDiscrete Decomposition (SDD) • Matrix decomposition techniques • Unsupervised data mining • Good at revealing underlying or ‘hidden’ factors in data typical of real-world processes

  11. Singular Value Decomposition (SVD) • A matrix, A, can be decomposed as: A = U S VT where U is n x r, V is r x r, S is an r x r diagonal matrix whose entries are decreasing (the singular values), U and V are orthogonal • The singular values indicate how important each new dimension is in representing the structure of A

  12. Singular Value Decomposition (2) • Can be regarded as transforming the original space to new axes such that as much variation as possible is expressed along the first axis, as much as possible of what remains along the second, and so on

  13. SemiDiscrete Decomposition (SDD) • The SDD of A is given by A = XDY where X is n x k, D is a k x k diagonal matrix, and Y is k x m (for arbitrary k) • The matrices X and Y have entries that are only -1, 0, or +1

  14. SemiDiscrete Decomposition (2) • For any decomposition, the product of the ith column of X, the ith diagonal element of D and the ith row of Y is a matrix of the same shape as A • In SDD, each of these layer matrices describes a `bump’ in the data, a region (not necessarily contiguous) of large magnitude

  15. Analysis Results C1 C2 C4 C3

  16. Results • Four clusters of queries • Cluster 1: Q1, Q3, Q4, Q5, Q6, Q11, Q14, Q14, Q19 • “Moderate Complexity” • Cluster 2: Q2, Q20 • “Simple Complexity” • Cluster 3: Q7, Q8, Q9, Q18, Q21 • “High Complexity” • Cluster 4: Q10, Q13, Q15, Q16, Q22 • “Trivial Complexity”

  17. Results (2) • Queries appear to scale well across different system architectures and database sizes • Attempt to understand meaning of the new “dimensions” of the SVD analysis • U1 – CPU vs. IO-bound queries • U2 – Query Response Times • U3 – Sequential-IO intensive vs. Random-IO Intensive

  18. Future Work • Perform analysis on larger set of data • Use more robust/representative workload • Extend to other workload types (e.g. OLTP)

  19. Fin Thank you.

More Related