Auto administration of databases based on clustering

Auto administration of databases based on clustering Mujiba Zaman Jyotsna Surabattula Le Gruenwald School of Computer Science The University of Oklahoma Norman, Oklahoma, 73019, USA {mujiba, jyotsna, ggruenwald}@ou.edu

Overview of the Presentation • Motivation for auto-indexing • Existing techniques • Proposed Index Selection Technique • Algorithm for Proposed Index Selection Technique • Proposed Re-Indexing Technique • Experiments • Results • Conclusions • Future Work

Motivation of Auto-indexing • Index selection is an important part of physical database design • For large databases it is difficult for DBA’s to analyze data and find an optimal set of indices • The goal of auto-indexing includes • Analyze workload automatically to identify a good set of indexes. • Create those indexes. • Automatically evaluate current indexes and re-index if necessary

Existing Techniques The index selection problem (ISP) has been approached in two different ways to build index selection tools: External Tools: Some basic assumptions are made Cost functions are formulated based on the assumptions Attempt to minimize the cost function using combinatorial optimization technique or by some heuristic method Optimizer Based Tools: Utilize the query optimizer to give cost estimates for various index configurations and suggest a configuration with the least cost estimation.

Advantages and Disadvantages of existing techniques External Tools: Advantages: Applications using the DBMS will not suffer processing delays Disadvantages: Disconnected from the optimizer –Index suggested may not be used by the optimizer Becomes obsolete if optimizer changes

Advantages and Disadvantages of existing techniques Optimizer Based Tools: Advantages: Suggested indexes will be used by the optimizer Disadvantages : Performs expensive operation of optimizer invocation Longer processing time for other applications using the DBMS when indexes are being suggested.

Proposed Index Selection Technique Combines both the approaches. Based on the intuition that the attributes that occur more commonly and frequently in a group of similar queries are likely to be useful for indexing. Use a Data Mining Clustering technique to group queries which are similar in terms of their use of attributes. Extract Indexable attributes in all the queries in each group as indexes. These indexes can be single-column or multi-column

Proposed Index Selection Technique • For multi-column indexes, the order of the columns is determined by assigning weights to attributes based on where they are used in the queries. • A clustered index is also chosen by assigning weights to the attributes depending on where they occur in the queries. • Extracted indexes are then submitted to the query optimizer for final selection for the given workload. • The indexes not selected by the optimizer are eliminated. The remaining indexes are the final indexes suggested by our tool. • Re-index if total table scan time using currentindexes is higher than that using new indexes.

Algorithm for proposed technique 1. Input is a workload of queries 2. Parse the workload to find out all the indexable attributes and build two matrix: - Query-Attribute matrix (1: presence of an attribute; 0 otherwise) • Attribute-frequency matrix (frequency of the attribute indicated by the number) Consider the following queries: Q1. Select T1.A, T1.B, Sum(T1.C) from T1, T4, where T1.A=T4.K and T1.C=T3.H and T1.B<=30 group by T1.C Q2. Select Ave(G) from T3 where F like ‘this%’ and G between 10 and 20 order by G

Algorithm for proposed technique • A corresponding query attribute matrix with several other queries could be as follows: Table 1 Query Attribute Matrix

Algorithm for proposed technique • A corresponding attribute frequency matrix could be as follows: Table 2 Attribute Frequency Matrix

Candidate Index Selection Table size of T1 = 5000 Table size of T2 = 2000 Table size of T3 = 100 Table size of T4 = 20 T = table size/100 Let threshold1 = 5, threshold2 = 30 3. Choose candidate index set Freq > threshold1 OR Freq * T > threshold2 T = table size/constant Table 3. Attribute-frequency matrix

Non-Clustered Index Selection 4 a. Assign weights to the candidate attribute set as follows: Where Clause: 3 Group by/Order by: 2 Aggregate functions: 1 Table 4. Candidate Attribute Set

Non-Clustered Index Selection 4 b. Order the columns in the candidate attribute set in each table in descending order of their weights for determining non-clustered indexes. Table 5. Ordered Candidate Attribute Set

Non-Clustered Index Selection 4 c. Build a Query-Attribute Matrix with the ordered columns Table 6. Query Attribute Matrix with ordered candidate index set

Non-Clustered Index Selection 5. Use a data mining clustering technique on the Query-Attribute matrix to group similar queries. A possible clustering result is [Q1 Q4] [Q2 Q3 Q5] (Table 6) 6. Extract the index sets for each table from the clusters obtained. These are all the common attributes from all the queries clustered together. For the cluster [Q1, Q4] Indexes are: (T1.C, T1.B, T1.A) in that order For the cluster [Q2 Q3 Q5] Indexes are (T3.F, T3.G) in that order

Clustered Index Selection 6. Choose single column clustered index as follows: a. During parsing assign the following weight to the attributes Range queries: 2 Join: 1 Group by/Order by: 1 b. Rank the attributes according to their weight Table 7. Ranking according to weight for clustered index

Clustered Index Selection c.Determine the selectivity of columns and rank them according to the selectivity for each table. Table 8. Ranking according to selectivity for clustered index

Clustered Index Selection d. Find the sum of rank with selectivity and rank with weight for each column.The indexable attribute with the largest sum is suggested as clustered index for the table. If more than one indexable attributes have the highest sum then select the attribute with higher rank in weight. Table 9. Sum of Ranks with Weight and Selectivity for Clustered Index

Algorithm for proposed technique 7. Provide both clustered and non- clustered indexes to the query optimizer 8. Let the optimizer display the estimated execution plan to execute the workload in the database. 9. Select the indexes used by the optimizer as the final suggested index.

Proposed Re-indexing Technique • An auto-indexing tool should be capable of re-indexing whenever the current indexes are no longer good • The DBMS can periodically monitor the cost of total table scan for a particular size of workload • When this value exceeds a limit the DBMS triggers the index selection tool to suggest new set of indexes. • The limit to trigger the tool can be determined from therelationship between increase in table scan cost and performance gain due to re-indexing. This can be set by the DBA.

Proposed Re-indexing Technique Produce a Chart showing “Re-indexing performance gain vs. Table scan cost” => Guildelines for DBA by doing the following: • Assume the current workload is Wc, current index set is INc, compute total current table scan cost TSc. • Obtain a set of different workload samples over a long history: W2, W3,.. Wn. • Compute total table scan cost for each of workload sample using the current index set: TS2c, TS3c,…, TSnc.

Proposed Re-Indexing Technique (Cont.) • Run our index selection tool on W2, W3, …, Wn to get the corresponding recommended sets of indexes: IN2r, IN3r,…, INnr. • Compute total table scan cost for each workload using its new recommended set of indexes: TS2r, TS3r,…, TSnr. • Compute the percentage of performance improvement due to reindexing for each workload Wj: [(TSc – TSjr)/TSc)* 100] for j = 2,.., n.

Proposed Re-indexing Technique

Proposed Re-indexing Technique • If the DBA has the chart in advance, the DBA can set up the limit wherehe/she wants the DBMS to trigger index selection tool. • The existing index set is then compared with the new index set • Indexes which are part of new but not part of existing set are created, those which are part of existing set and not in new set are dropped and those which intersect remain • The process of dropping and creating indexes in the system follows similar methodology as Oracle’s Automated Index-Rebuild System which can be done either online or offline.

Experiments • Performance Metric : average query response time = time taken to execute the workload in minutes divided by the total number of queries in the workload • All experiments are conducted on the system Intel Pentium 4-M, CPU 2.0GHz, 512 MB RAM. • Experiments conducted on TPC-R benchmark with its 22 read-only queries • Experiments conducted on Microsoft SQL Server 2000

Experiments • Clustering algorithms used in experiment: 1. MACQueen’s k – means clustering algorithm • Used hamming distance as the distance function 2. KEROUAC (knowledge explicit, rapid, off beat and user-centered algorithm for clustering) • K-means is a well-established data clustering algorithm • KEROUAC is a clustering algorithm for practical advantage: it doesn’t require the final clusters number setting • Both algorithms have low computational cost

Results With K-means Chart 1

Results with KEROUAC Chart 2

Results with Index Suggestion Time

Conclusions • Performance of the tool critically depends on the choice of  (for KEROUAC) and k (for k-means) and the threshold value. • Increasing  (k) will group queries with higher similarity to each other in a cluster. • Increasing  (k) beyond some point has no effect (all identical queries are already in the same cluster). • We can therefore achieve desired results by choosing  (k) = workload size. • Increasing threshold means more attributes eliminated from consideration • Experiments show that a threshold value about workloadsize/4 works good. • Our tool chooses these parameters to operate on the best performance range.

Conclusions • We compared our results with Microsoft SQL Server’s Index Selection Tool (An optimizer Based Tool) and also with Frequent Itemsets Mining (An External Tool) • Best performance improvement using k – Means clustering compared with Frequent Itemsets Mining is 71.43 % • Best performance improvement using k – Means clustering compared with Microsoft Index Selection Tool is 16.2 % • Best performance improvement using KEROUAC clustering compared with Frequent Itemsets Mining is 73.26 % • Best performance improvement using KEROUAC clustering compared with Microsoft Index Selection Tool is 21.5 % • The index suggestion time for Microsoft Index Selection Tool was 4 times higher than our tool for a workload size of 240

Future Work • Test the dependence of the technique on different clustering algorithms • Test with different sizes of workload • Test with update queries in the workload • Test with index elimination technique

Thanks! Questions?

Indexable Attributes Indexable Attributes/Columns: • Columns which belong to WHERE, GROUP BY, ORDER BY clauses • Operators {=, <, >, <=, >=, BETWEEN, IN} Example: SELECT * FROM table1,table2 WHERE table1.column1 = table2.column1 AND (table2.column2 BETWEEN 0 AND 1000) the indexable attributes are: table1.column1, table2.column1, table2.column2

Selectivity of a column Selectivity: • Selectivity ratio of a column/index = number of unique values in a column/index of the table divided by total number of rows in that table SELECT COUNT (DISTINCT (column name)) FROM table name; • If a column/index has high selectivity then it is more useful to the optimizer and has more chances to be picked up by the optimizer while executing a query.

Multicolumn Indexes Multicolumn indexes: • Column involved in multicolumn index should be joined with AND clause and not with OR clause • Order of the columns in a multicolumn index is important. Order should be based on selectivity and also the first ordered column should be the most used column in queries. • An index (a, b, c) is used by queries involving a, b, c both a and b or a but not in any other combinations Example: An index (major, minor) is suitable for the following query SELECT name FROM test WHERE major = constant AND minor = constant;

Frequent Itemset Mining Technique • Input is a workload of queries • Extract indexable attributes. • Create a query attribute matrix. Q1: SELECT * FROM T1, T2 WHERE A BETWEEN 1 AND 10 AND C = D Q2: SELECT * FROM T1, T2 WHERE B LIKE ‘%this%’ AND C=5 AND E<100 Q3: SELECT * FROM T1, T2 WHERE A=30 AND B>3 GROUP BY C HAVING SUM(E)>2 Table 1 Q4: SELECT * FROM T1 WHERE B>2 AND E IN (3,2,5) Q5: SELECT * FROM T1, T2 WHERE A=30 AND B>3 GROUP BY C HAVING SUM(E)>2 Q6: SELECT * FROM T1, T2 WHERE B>3 GROUP BY C HAVING SUM(E)>2

Frequent Itemset Mining Technique • “A closed itemset is a maximal set of items (attributes) that are common to a set of transactions (queries)” • The candidate indexes selected for minimal support greater than or equal to 2/6 in the example are as follows:[1] {(AC,3/6), (BE, 5/6), (ABCE, 2/6), (BCE, 4/6)} Table 2

SQL-Server Index Selection Tool Figure 1

SQL-Server Index Selection Tool Candidate index selection: • For a given workload W that consists of n queries, n workloads Wi..Wn, each consisting of one query are generated. • For each workload Wi, the set of indexable columns of the query Ii is the starting candidate indexes. • Let Ci be the configuration picked by the index selection tool for Wi • The candidate index set for W is the union of all Ci

SQL-Server Index Selection Tool Configuration enumeration: Problem • Pick k indexes from a set of n candidate indexes. Algorithm 1 Let S=the best m index configuration using the naïve enumeration algorithm. If m=k then exit 2 Pick a new index I such that Cost (S U I) < Cost (S U I’) for any choice of I’ != I 3 If Cost (S U I )>= Cost (S) then exit else S=S U I 4 If |S| = k then exit 5 Go to 2

SQL-Server Index Selection Tool Cost Evaluation: • Reduce the number of invocations of the optimizer by deriving costs from already evaluated costs. • A cost of a non atomic configuration can be derived from atomic configuration • “A configuration C is atomic for a workload if for some query in the workload there is a possible execution of a query in the workload by the query engine that uses all indexes in C”. • Not every atomic configuration needs to be evaluated for every single query in the workload

SQL-Server Index Selection Tool Multicolumn index generation: • For given K columns K! multicolumn indexes are possible • Iterative approach • First iteration single column indexes are considered • Only the selected single column indexes are input to the two column indexes in the first iteration • This set of two-column indexes along with single column ones are input to the third iteration and so on.

Clustered Index Clustered indexes: • “A page allocated to an index is called a data page.” • For tables having clustered index the data rows of each data page are stored in order and the data pages are linked together by doubly-linked list. • For table having no clustered index the data rows are not stored in any particular order. Figure 2

Clustered Index Accessing data with a clustered index: Figure 3

Criteria to choose clustered index • Queries that return large result sets • Columns with a number of duplicate values that are searched frequently • Columns other than primary key that are frequently used in join clauses • Columns searched within a range of values • Columns used in ORDER BY or GROUP BY queries

MACQueen’s k – Means Clustering algorithm 1) First k data units are chosen as clusters of one member each. 2) Remaining m – k data units are assigned to the clusters whose centroid is nearest to the data unit under consideration. Centroid is recomputed after every gain in the cluster 3) Iterate through the data set assigning each data unit to its nearest cluster taking the existing cluster centroids as fixed seed points until a certain criteria is reached

MACQueen’s k – Means Clustering algorithm • Hamming distance is the number of positions in which two binary words differ • For k = 3; Q1, Q2, Q3 are initial clusters • Hamming distance of Q1 and Q4 is 5 Q2 and Q4 is 1 Q3 and Q4 is 2 • Clusters at step 2 are cluster1: [Q1] cluster2: [Q2, Q4, Q6] cluster3: [Q3,Q5] Table 3

KEROUAC Clustering algorithm KEROUAC: Knowledge Explicit, Rapid, OFF beat and User-centered Algorithm for Clustering New Condorcet Criterion (NCC) : NCC(Pz) =  Sim(Ei,Ej) + α x Dissim(Ei) NCC represents the degree of dissimilarity of objects belonging to the same cluster and the degree of similarity between clusters. It is desired to be minimum. α is called the granularity factor

Auto administration of databases based on clustering

Auto administration of databases based on clustering

Presentation Transcript

Topic9: Density-based Clustering

Keyword-based Search and Exploration on Databases

Frequent Item Based Clustering

Object-Based Databases

Software Clustering Based on Information Loss Minimization

K -MST -based clustering

Personalization in Folksonomies Based on Tag Clustering

Density based Clustering

Clustering of location-based data

Pattern-based Clustering

Graph Clustering based on Random Walk

Cut-based clustering algorithms

Physics Portals based on Distributed Databases

AUTO-OPTIMIZED WORKFORCE ADMINISTRATION

K -MST -based clustering

ChemAxon’s Chemical Fingerprints-Based Clustering to Assess AurSCOPE Databases Chemical Diversity

Object-Based Databases

Spike Sorting based on Dominant-Sets clustering

Clustering on Highways: Study “ Clustering ” of Traffic on Highways

Automatic Clustering of Win32 Applications Based on Failure Behavior

Software Clustering Based on Information Loss Minimization