CSIS7101: Course summary

CSIS7101: Course summary Spatial data Spatiotemporal data Multimedia and Time-series data Data mining I (association rules and sequence patterns) Data mining II (clustering and classification) Data warehousing and OLAP Strings and biological data Semi-structured and XML data Storage and query processing on modern machines Cache conscious indexes Advanced Database Technologies

Spatial Data • The R-tree and the R*-tree • What are they? • Dynamic, balanced trees that index Minimum Bounding Rectangles • Each entry in a directory node is an <MBR,ptr> pair. Each entry in a data node is an <MBR,oid> pair • How are they constructed/updated? • They use special insert/split algorithms • The entry MBRs in directory nodes should have (i) minimal margins – more “square-like”, (ii) minimal overlap – good search behavior, (iii) minimal area – small “dead space”. Advanced Database Technologies

Spatial Data (cont’d) • Spatial Joins Using R-trees • How is the spatial join performed? • By synchronous traversal, following recursively directory node entries that overlap. • How is the computational cost minimized? • The space restriction technique • The plane sweep heuristic • How is the I/O cost minimized? • Using some ordering and pinning techniques (optional reading) Advanced Database Technologies

Spatial Data (cont’d) • Nearest Neighbor Search Using R-trees • Which is the optimal algorithm for NN search? • INN method in the “distance browsing” paper. • How does it work? • It uses a priority queue that organizes <dist, ptr> pairs according to their (smallest) distance from the query object. • Initially all entries of the root are placed in the heap • At each step, the pair with the minimum distance is retrieved. • If it is an object, it is output • If it is an object MBR, the actual object is fetched an inserted in the queue • entry at a directory level, the entries in the node pointed by it are loaded and inserted in the queue. Advanced Database Technologies

Spatiotemporal Data • Two types of problems: • Indexing the current positions and movements of objects and querying their anticipated future positions. • Indexing and querying the past movements of mobile objects. On Indexing Mobile Objects Indexing the Positions of Continuously Moving Objects MV3R-Tree (optional reading) 1 2 Advanced Database Technologies

Spatiotemporal Data (cont’d) • Indexing current/future locations mobile objects (problem 1): • Dual transformation • Transform the line trajectories to a dual representation and index the resulting points • How is the 1D problem solved in the dual plane? • The TPR-tree • Like the R-tree, but the MBRs are time-parameterized to conservative bounding intervals (CBI). • How are the CBI computed? What is the best way to group objects into a CBI? • By minimizing an objective function (e.g., overlap) over the time the TPR-tree is valid. • How do we answer queries using the TPR-tree? Advanced Database Technologies

Multimedia and time-series data • When is Nearest Neighbor Search Meaningful? • Consider a query object q, with distance dist(q, NN) from its nearest neighbor NN and with distance dist(q, FN) from its farthest neighbor FN. • If dist(q,NN)/dist(q,FN) converges to 1 with the increase of dimensionality then NN search is not meaningful. • In other words, if the distance of the nearest neighbor is not statistically different to the distance of the farthest neighbor, then NN search is meaningless. • If the data form clusters in the high dimensional space, then NN search is meaningful. Also, if the data actually lie in a (hidden) embedded space, which is low dimensional then NN search is meaningful. Advanced Database Technologies

Multimedia and time-series data • What is FASTMAP? • A technique to map objects from a high-dimensional (unknown) space to points in a low dimensional space, so that the original distance between any pair of objects is preserved as much as possible after the transformation. • How does FASTMAP work? • It picks the furthest pair of objects (Oa and Ob) at a time, called pivots, and the coordinates of all points in that dimension is defined by their projection on the line segment OaOb. Advanced Database Technologies

Multimedia and time-series data • Indexing Time-Series Data (optional) • The similarity between two time-series is computed using an expensive dynamic time-warping technique. • In order to index the data and reduce the similarity search cost, the series are approximated by an envelope that bounds it and at the same time makes DTW applicable on it. • The lower-upper bounds of the envelope are in turn approximated by MBRs and indexed using a high-dimensional index (e.g., R-tree). • An algorithm like INN is applied on the R-tree to filter fast series that may not be similar to the query time-series, in a branch-and-bound way. Advanced Database Technologies

Knowledge Data Mining • Data mining topics covered: • Mining association rules: • If A and B is bought then also C is bought. • Mining frequent sequence patterns: • If in a sequence of event transmissions B is close and after A, then C will appear soon. • Classification: • Given a training sample, create a classifier, that predicts the class labels of unclassified data. • Clustering: • Classify a dataset into k clusters (of unknown class label), such that the distance between tuples (objects) in the same cluster is small, and the distance between tuples in different clusters is large. focus Advanced Database Technologies

Mining Association Rules • First find, frequent itemsets, then create association rules (how?). • How does the Apriori Algorithm work? • Finds frequent itemsets level-by-level • First, frequent items are found (i.e., frequent sets with only one item) • Frequent itemsets of level l are joined to generate candidate itemsets of level l+1. • Candidates of level l+1 with a subset not appearing in the frequent itemsets of level l are pruned. The frequency of the remaining ones is counted by scanning the database. Advanced Database Technologies

Mining Association Rules • How does the FP-tree method work? • First finds the items with support>s and sorts them in decreasing support value. • Then the FP-tree is constructed for these items by considering the tuples found in the database as paths of the tree. Finally, the nodes of the tree with the same label (corresponding to the same item) are linked using a linked list. • Then the frequent itemsets are found from the FP-tree, for each item individually by accessing the paths containing the item. Advanced Database Technologies

Clustering • How does the basic k-means clustering algorithm work? • First the objects are distributed (randomly) to arbitrary clusters. • The center of each cluster is computed (by averaging the values in each dimension). • Rellocate each object to the cluster whose center is closest to it. • Reapply 2-3 until the clusters do not change. Advanced Database Technologies

Clustering • How do k-medoids clustering algorithm work? • They initially pick randomly k medoids (representatives). • Each object is assigned to the cluster defined by the nearest medoid. • Find the medoid-object pair, such that if they are swapped (the object takes the place of the medoid) the benefit is maximized. • Repeat 2-3 until there is no change. Advanced Database Technologies

Hierarchical Clustering • What is hierarchical clustering? • Bottom-up or agglomerative: Initially each object is a cluster. Then the closest clusters are merged iteratively until only k clusters remain. • Top-down or divisive: Initially all objects are in one cluster. Then at each step the cluster with the best split is iteratively split until k clusters remain. Advanced Database Technologies

CURE • How does CURE work? • It is an agglomerative algorithm, that creates the clusters bottom-up. • Medoid-based, but each cluster has more than one representatives. • The representatives are scattered, thus (i) non-spherical clusters can be discovered, (ii) clusters linked weakly are not merged (as a density-based method would do), (iii) outliers are not included in the clusters. • Random sampling is used to avoid the excessive O(n2) cost of applying the algorithm on the whole dataset. Advanced Database Technologies

PROCLUS • What is PROCLUS? • An algorithm that discovers clusters in a subspace of the original high-dimensional space. • Why is PROCLUS useful? • In many cases, clustering (and nearest-neighbor search) is not meaningful in a high dimensional space, because there are no natural clusters there. • However, clusters could be found in dimensional subspaces, where not all dimensions are relevant to the cluster. Advanced Database Technologies

PROCLUS • How does PROCLUS work? • Use a greedy method to find a set M of >k medoids • Pick k medoids at random • Approximate the optimal set of dimensions for each medoid • Assign points to clusters and evaluate the cluster goodness • Replace “bad” medoids with random ones from M • Repeat 3-5 until only good enough medoids remain. • Approximate the optimal set of dimensions for each medoid and assign the points to clusters. Advanced Database Technologies

12 34 8 98 11 Data Warehousing and OLAP • What is a data warehouse? • a large collection of heterogeneous data which have been cleaned, integrated and consolidated. The warehouse contains old, historical data which are useful for data analysis tasks. • What is On-Line Analytical Processing? • Data analysis on Data Warehouses and decision making. • What is the Data cube? • Models all combinations of multidimensional aggregate views of the warehouse. • How are the multidimensional views related? • A lattice models the partial order of the views and shows if a view can be computed from another. Advanced Database Technologies

12 34 8 98 11 Data Warehousing and OLAP • What defines the hierarchical relationship between two views: • View A can be computed by view B if the dimensions of A are a subset of the dimensions of B. • Example: We can compute the cuboid (product,city) from the cuboid (product, city, time). • View A can be computed by view B if each dimension in A is at the same or higher hierarchical conceptual level than the corresponding dimension in B. • Example: We can compute the cuboid (product,country) from the cuboid (product, city). Advanced Database Technologies

12 34 8 98 11 View Selection in Data Warehouses • What is the view selection problem? • If we compute and materialize the whole cube we can answer efficiently every OLAP query. • However: • The size of the cube can be very large, thus we may not be able to store all views due to space constraints. • When the warehouse is updated, we may not have enough time to update all materialized views (maintenance cost constraint). • Therefore, given space and update time constraints we have to choose which set of views to materialize based on their benefit in answering queries Advanced Database Technologies

12 34 8 98 11 View Selection in Data Warehouses • Two approaches in view selection: • Static view selection: Given specific query statistics, space and time constraints select the best views to materialize (paper: Implementing Data Cubes Efficiently). • Dynamic view materialization: The set materialized views is determined and dynamically based on (i) the frequency with which they are retrieved, (ii) their benefit to answer other queries, (iii) their space, (iv) their maintenance cost (paper: DynaMat). Advanced Database Technologies

12 34 8 98 11 View Selection in Data Warehouses • Static view selection: • How are the views selected? • Using a greedy algorithm. • Dynamic view materialization: • How are the views replaced in the buffer pool? • Using a replacement policy that considers the benefit of a materialized result dynamically. • Which views are maintained during an update? • Those which can be maintained fast and provide a larger benefit. Advanced Database Technologies

Strings and Biological Data • What is approximate string matching? • Given a text t, a query q and a distance threshold k, find all substrings in t whose distance from q is at most k. • What distance function is used? • The edit distance, which counts the minimum number of operations (insert, delete, replace) to transform one string to the other. • How is the problem solved exactly? • Using an (expensive) dynamic programming algorithm. Advanced Database Technologies

Strings and Biological Data • How does the dynamic programming algorithm work? • Problem: find the edit distance between strings x and y. • Create a (|x|+1)(|y|+1) matrix C, where Ci,j represents the minimum number of operations to match x1..i with y1..j. The matrix is constructed as follows. • Ci,0 = i • C0,j = j • Ci,j = • Ci-1,Cj-1 if xi=yi, • 1+min(Ci-1,Cj, Ci,Cj-1, Ci-1,Cj-1), else. • How is it adapted for substring matching? • The difference is that we set C0,j=0 for all j, since any text position could be the potential start of a match. • If the similarity distance bound is k, we report all positions, where Cm ≤k (m is the last row – m = |q|). Advanced Database Technologies

Strings and Biological Data • What can we do to reduce the cost of DP? • We can use cheap filtering techniques that avoid applying DP at each position of t. • They do not provide solutions, but aim to reduce the search effort. • They are based on the idea that it is most probable for a text area not to match a query rather than to match it. • They work in two steps. • First a cheap heuristic is used to determine whether an area of the text t could match with the query q. • If not search is abandoned for this area, otherwise an expensive search algorithm (i.e., dynamic programming) is applied Advanced Database Technologies

Strings and Biological Data • Other ways to perform substring matching • Build a suffix-tree for t and use it for approximate and exact queries. • If t is large the suffix tree cannot be built in memory. • We can solve the problem, by splitting the possible substrings into groups and build a suffix tree in memory for each of them. • Paper: A Database Index to Large Biological Sequences (optional reading) Advanced Database Technologies

Strings and Biological Data • Other ways to perform substring matching • All substrings are transformed to high dimensional points and indexed. • A query is then applied on the indexes using a lower bound of the edit distance. • Given an alphabet Σ = {α1,α2,...,αd}, we can transform a string to a d-dimensional point, called frequency vector, which stores the global frequence of each character in the substring. • The idea is extended to capture also the local frequencies of the characters in the substring, using wavelet transforms. Advanced Database Technologies

Semi-structured Data <A> xxx <C> <D>yyy</D> <D>zzz</D> </C> </A> • What are semi-structured data? • In various application domains, the data are semi-structured; the database schema is loose-defined. • What is XML? • A language like HTML, where tags describe the data itself. Tags are called elements in XML. XML can be used to describe semi-structured data. • Why do we need special management techniques for semi-structured data? • There is no well-defined schema, so we cannot store the data in relational tables in an efficient way. Advanced Database Technologies

Semi-structured Data <A> xxx <C> <D>yyy</D> <D>zzz</D> </C> </A> • How do we store semi-structured data in a database? • Solution 1: • Use specialized storage methods, query languages and query evaluation techniques for semi-structured data. • Solution 2: • Represent XML data in relational tables, transform queries to SQL, and use the mature relational DB technology. Advanced Database Technologies

Semi-structured Data <A> xxx <C> <D>yyy</D> <D>zzz</D> </C> </A> • How can we model semi-structured data? • An XML document can be represented as a node-labeled graph. • The labels of the graph are element tags, attribute names and values. • Most documents can be represented by trees. The edges that transform a tree to a graph come from ID references. Advanced Database Technologies

Semi-structured Data <A> xxx <C> <D>yyy</D> <D>zzz</D> </C> </A> • What types of queries apply on XML data? • absolute path expressions. • Example: book/author/name/lastname/Smith • simple path expressions. • Example: //author/name/lastname/Smith • regular path expressions. • Example: //author//Smith • expressionsof complex structure. • Example: book[/title/XML][//author//Smith] Advanced Database Technologies

Indexing Semi-structured Data <A> xxx <C> <D>yyy</D> <D>zzz</D> </C> </A> • Several storage schemes and indexes have been proposed for XML queries. • Some of them index the paths or subgraphs of the XML structures. • Example: the A(k)-index • Some decompose the information and flatten it into relational DB tables. • Example: The method described in the Structural Joins paper Advanced Database Technologies

Indexing Semi-structured Data <A> xxx <C> <D>yyy</D> <D>zzz</D> </C> </A> • The A(k)-index: • Indexes simple paths. Useful for simple path queries. • Creates a structural summary of the XML graphs. All the paths up to length k in the documents are preserved into the summary graph. • Based on the notion of k-bisimilarity. Advanced Database Technologies

Flattening XML data to relational tables <A> xxx <C> <D>yyy</D> <D>zzz</D> </C> </A> • The position of each element/attribute occurrence is represented as a 3-tuple (Document-id, StartPos:EndPos, LevelNum) • Values (text) is encoded using (Document-id, StartPos, LevelNum): • Document-id is the id of the document that contains the element • StartPos is the number of words from the beginning of the document until the start of the element • EndPos is the number of words from the beginning of the document until the end of the element • LevelNum is the nesting depth of the element Advanced Database Technologies

book title author Smith XML XML query processing <A> xxx <C> <D>yyy</D> <D>zzz</D> </C> </A> • The query is broken into binary parent-child or ancestor descendant relationships. • Example: • book[/title/XML][//author//Smith] • Broken to: • book/title • title/XML • book//author • author//Smith Advanced Database Technologies

XML query processing <A> xxx <C> <D>yyy</D> <D>zzz</D> </C> </A> • Each binary query is executed as a join, and their results are “stitched” together to formulate the results of the whole query. • Thus the “heart” of XML query processing is the algorithm that joins the elements table to retrieve the results for each individual query component. • A simple, tree merge join may perform many passes to the “inner” table, one for each element in the outer table that matches the elements there. In order to avoid this a stack-tree join algorithm is proposed. Advanced Database Technologies

XML query processing <A> xxx <C> <D>yyy</D> <D>zzz</D> </C> </A> • How does the stack-tree join algorithm work? • It keeps a stack to keep nested AList elements which are in the same path as the current element from DList. • When a qualifying element in DList is found, all elements of AList in the stack are output. Advanced Database Technologies

DBMSs on Modern Machines • Memory and disk latency are not improving as fast as CPU processing speed, memory bandwidth and disk bandwidth: CPU(kHz) mem.lat(ns) • The new bottleneck is memory access and DB operators have to be tuned 100 disk.lat(ms) mem.bw 10 • Also random accesses become very expensive and indexes less efficient disk.bw 1 1985 1990 1995 2000 • DBMSs have to be redesigned for the characteristics of the new machines Advanced Database Technologies

The new bottleneck: Memory Access • Memory is now hierarchical: two levels of caching CPU L1 cache-line CPU die L1 cache L2 cache-line L2 cache Memory page Main memory • Memory-latency: the time needed to transfer 1 byte from the main memory to the L2 cache. • Cache (L2) miss: if the requested data is not in cache and needs to be fetched from main memory • Cache-line: The transfer unit from main memory to cache (e.g., L2 cache-line = 128 bytes) Advanced Database Technologies

The new bottleneck: Memory Access • Why is there memory latency? • Accessing a specific address from memory loads also the information next to it in the cache-line (in parallel over a wide bus). A cache-line at a time is accessed from main memory. • If the requested information is already in cache (e.g., the gender of the second tuple), main memory is not accessed. Otherwise, a cache-miss occurs. • “trash” may be loaded to cache together with the useful information. Therefore data which are likely to be accessed together by a program should be placed close in main memory. Advanced Database Technologies

Storage schemes • How are database relations stored in memory/disk? • The Normal Storage Model stores the tuples sequentially. • The Decomposed Storage Model breaks the relation into binary tables <id,att1>, <id,att2>, ... , <id,attn>. • The Partition Attributes Accross (PAX) model stores the relation like NSM, but decomposes the information in the pages, in order to reduce the cache misses (optional reading). • What is the impact of these storage schemes in query processing • How many cache misses will occur during a selection on a relation if it stored by NSM, or DSM? What if we use unary tables? • There is a trade-off between fetching garbage to trash (NSM) and joining decomposed data (DSM). Advanced Database Technologies

Performing Joins in Main Memory • Join operators (which may require multiple passes on the data) should be implemented with care in main memory. • What techniques can we use to speed-up join processing? • Preprocess the data to get rid of irrelevant information to the join. • Use blocking and partitioning to reduce the problem into small problems that can be performed in cache. • Use radix-clustering to perform hashing in multiple passes that reduce the overall hash-join cost. Advanced Database Technologies

Database Indexing in Main Memory • In future databases all data but few large tables will be memory-resident. • Therefore is it important to build efficient main-memory indexes. • These indexes should consider the hierarchical memories and the memory-access bottleneck. • Indexes like the B+-tree and the R-tree are also suitable for main memory, if the node size is set to the cache-linesize. Advanced Database Technologies

Database Indexing in Main Memory • What techniques are used to improve the performance of these structures in memory? • Pointer elimination. Pointers are eliminated and the children of a specific node are allocated in sequential memory blocks. • How is the size of the structure and the search cost affected if pointer elimination is used? • The capacity of each node (nearly) doubles. Thus the height of the tree becomes shorter and queries are processed faster. • Hard-wiring binary search. The binary search code is replaced by if-else statements. We may choose to leave some key-slots empty in order to reduce the depth of the search tree. • Quantization and compression (for R-trees). We replace exact MBR coordinates by relative coordinates to the MBR of the containing node. We can compress further these coordinates by approximating them in a quantized grid. Advanced Database Technologies

The format of the final exam • It consists of simple problems which require that you have digested the material • There will be problems from (almost) all topics covered in the course. • The problems do not require knowledge of the technical details. • However, you should be able to apply the simple versions of the algorithms if you are asked to for simple problems: • Example 1: I may give you an AList and a DList, and I want you to show me the steps (and stack) of the structural join algorithm (topic: semi-structured data). • Example 2: I may give you the data warehouse settings and ask you to find the views selected for materialization if the greedy algorithm is applied (topic: data warehousing). Advanced Database Technologies

CSIS7101: Course summary