Chapter 8. Storage Management and Indexing Techniques

Chapter 8. Storage Managementand Indexing Techniques Seoul National University Department of Computer Engineering OOPSLA Lab.

Table of Contents • Storage Techniques for Relational DBMS • Storage Techniques for Objects • Clustering Techniques • Indexing Techniques for OODBMS • Object Identifiers • Swizzling

Storage Techniques for Relational DBMS • Disk Organization • Storing Records in RDBMS • Addressing Records with a Slot Vector

Disk Organization • Disk  partitions  segments  pages/blocks • Disk header • # of partitions • the address and the size of each partition • log for recovery in case of a system crash • Page addresses for each segment are stored in tables • Page = page header + offsets of objects + objects

l1 ln … … DISK partition1 N log partitionn l1 ln header … … segment1 segmentm page1 … … pagei total free space array of offset header adjacent free space Z A F B

Storing Records in RDBMS • Fixed length records • normally stored contiguously on the disk • all the records of a relation can be stored in a single file • Variable length records • stored directly on the disk with an ID • structure of ID is important on the retrieval speed • Structure of ID in System R • high order bits for the segment and the page of the file • low order bits for a record within a page

SLOT RECORD Addressing Records with a Slot Vector • Advantages • as fast as using the complete address of a record • the length of records can be changed • the records can be relocated • often faster than using the purely logical ID

Storage Techniques for Objects • Structure of Objects • Access Patterns to Objects • Approaches to Storage Organization for Objects • Storage and Variable Length and Large Attributes • Storage and Inheritance Hierarchy

Structure of objects • Storage/memory organization must support • objects with both atomic and complex attributes • objects with multi-valued attributes • objects with variant attributes • objects with long field attributes such as multimedia information, texts, images, voice, etc • Efficiency of storage organization depends on • structure of objects and their relations • access pattern which is the way in which the application programs access the objects

Categories of Access Patterns • Access based on the whole object • for applications which execute complex manipulations of objects by means of specialized program • whole object is copied onto the application's memory • direct model • Access based on the attributes of the object • appropriate when large objects need to be accessed • used to retrieve attributes of objects along the aggregation hierarchy • normalized model

Direct Model of Storage Organization(1) • Objects are stored in the same way in which they are defined in the conceptual schema • storage unit = semantic unit • objects of the same class are stored in the same file • Advantages • simplest and same as the one used in RDBMS • transferring of a whole objects is a very efficient • Disadvantages • accesses to a set of attributes of an object can be very expensive

Direct Model of Storage Organization(2) • Situations where direct model is inefficient • variable length attribute • new attributes • the majority of attributes have the null value

Normalized Model of Storage Organization • Decompose an object into atomic components • Each component are stored in different files • Relation between the components is maintained by OIDs

Intermediate Approach • Complex objects are decomposed • Components are grouped together according to access patterns to be stored in the same file • Problem • efficiency depends on having prior knowledge of the exact access pattern for applications

Variable-length and Large Attributes • Normalized method • Property list method • Stream (or demand-page) mechanism • portions of the object can be transferred in increments

Property list(1) • Sequence of triples < identifier, size, value > • identifier : which attribute of the object is stored • size : # of bytes stored • value : that (of varying size) of the attribute

Property List(2) • Advantages • variable length attributes • different set of attributes • sparse attributes • attributes can be stored in different physical locations • Disadvantages • whole property list scanning to find the desired attribute • transformation of the property list to the proper format for the application programming language

Storage and Inheritance hierarchy • Attributes of the superclass should be stored • Single inheritance • storing the attributes of superclass first, then those of subclass • variable length attribute alongside with the property list • Multiple inheritance • property lists • storing objects separately • each of above contains the fields for superclass, and linked to one another

Clustering Techniques • Clustering in DBMS • Clustering in RDBMS • Clustering in OODBMS • Static Clustering • Dynamic Clustering • Clustering for Multiple Relations

Clustering in DBMS • Focus • partitioning objects in the database • placing these partitions on disk • Aim • reduce the number of I/O operations on disk • Consideration • structure of the objects • access pattern of applications

Clustering Techniques for RDBMS • Tuples of a relation in the same page segment • on the basis of the value of an attribute or of a combination of attributes in a relation • Tuples of more than one relation in the same segment • one or more attributes in common with the same values • efficient for processing queries with join operation

Clustering Techniques for OODBMS • New considerations compared with RDBMS • complex objects • single or multiple inheritance • methods • Linear clustering sequence for complex object • all the descendant nodes of each node p in the hierarchy are stored immediately after p in depth-first order • efficient on retrieval of an object and all its descendants

Basic Options for Clustering for OODBMS • Proposed by Won Kim in 1990 • both clustering techniques as in RDBMS • clustering all the instances of classes which belong to an aggregation hierarchy • clustering all the instances of classes which belong to the inheritance hierarchy • combination of the two previous strategies • The clustering strategies above are static

Static Clustering • Unchangeable at run-time • Problems • no considerations on the dynamic evolution of objects • objects can be shared among several objects • clustering schema based on the single access pattern

Dynamic Clustering • The sequence of creation of objects would NOT be the same as the desired clustering sequence. • Reorganizing and recompacting pages in a cluster • Types of file reorganization • on-line : optimal one is NP-complete problem • off-line : when the reorganization will be done? • On-line reorganization technique by Chen, Hurson • chunks(set of pages) as the unit of clustering • cost model • ratio between the read and write operations

Clustering for Multiple Relations • Certain relationships can be used more frequently • Direct graph • nodes for objects • arcs for relationships • weights for ordering relationships • Clustering algorithm with levels by Chen, Hurson • arranges all the nodes of the graph in a linear sequence • nodes connected by heavier arcs are nearer than others • access time is around half that for objects randomly

Indexing Techniques for OODBMS • Indexing Techniques for Aggregation Hierarchy • Index Structures and Operations • Comparison of Index Organization • Indexing Techniques for Inheritance Hierarchy • Precomputing and Caching

Preliminary Definitions • Path • a branch in an aggregation hierarchy • Path instantiation • a sequence of objects obtained by instantiating the path • Nested index • an index for a direct connection between the starting object and the ending object of the path instantiation • Path index • an index for storing instantiation of a path • same index key as nested index Index Key

Project Company Division Person Example of Aggregation Hierarchy

Definition of Path • Given an aggregation hierarchy H, a path P is defined as C1.A1.A2…..An(n  1) where • C1 is a class in H • A1 is an attribute of class C1 • Ai is an attribute of class Ci in H, such that Ci is the domain of the attribute Ai - 1 of class Ci - 1 (1< i  n ) • length(P) : the length of the path • classes(P) : the set of classes along the path • dom(P) : the domain of attribute An of class Cn

Examples of Path • P1:Project.main_contracting_company.divisions.head.name • length( P1) = 4 • classes( P1) = { Project, Company, Division, Person } • dom( P1) = STRING • P2 : Person.divisions.city • length(P2) = 2 • classes(P2) = { Project, Division } • dom(P2) = STRING

Definition of Complete Instantiation • Complete instantiation is a sequence of objects along path • Given the path P = C1.A1.A2…..An , CI is denoted as O1.O2…..On+1 , where • O1 is an instance of class C1 • Oi is the value of the attribute Ai - 1 of object Oi - 1 • Oi = Oi - 1 .Ai - I or Oi  Oi - i . Ai - i (1 i  n +1) • Examples of CI, where path is given as P1 • Project[i].Company[k].Division[k].Person[x].Jones • Project[j].Company[i].Division[h].Person[y].Smith

Definition of Partial Instantiation • Partial instantiation is the part of CI, which ends at the last object of CI • Given a path P = C1.A1.A2…..An, PI is denoted as O1.O2…..Oj (j<n+1), where • O1 is an instance of class Ck in Class(P) such that k+j-1=n+1 • Oi is the value of attribute Ai - 1 of an object Oi - 1 • Examples of PI, where path is given as P1 • Division[k].Person[x].Jones • Division[h].Person[y].Smith

Definition of Redundancy • Given a PI as O1.O2…..Oj, it is not redundant • if there are no CI or PI as O'1.O'2…..O’k, where k>j and Oi = O’k - j + 1 (i=1,...,j) • Examples of redundant PI • Division[k].Person[x].Jones is redundant to Project[i].Company[k].Division[k].Person[x].Jones • Division[h].Person[y].Smith is redundant to Project[j].Company[i].Division[h].Person[y].Smith

Definition of Projection of Path • Projection of Path is the part of CI or PI, which begins from the first object of it • <m>(p) denotes a projection of p with a length m • P = C1.A1.A2…..An • as PI (or CI) of P, p= O1.O2.O3…..Oj(j  n+1) • <m>(p)= O1.O2.…..Om (m<j) • Example • <2>(Project[i].Company[k].Division[k].Person[x].Jones) == Project[i].Company[k]

Multi-index • Index to each of the classes constituting the path • Multi-index is a set of n simple indices I1, I2 ,…,In • given a path P = C1.A1.A2…..An • Ii is an index defined on Ci .Ai, 1 i  n • Solving a nested predicate scans n indices • first scanning the last index In on the path • the results of the scan using Ii are used as keys for Ii-1 • Only for reverse traversal scanning strategies • Low updating cost

Examples of Multi-index • First index I1 on Project.main_contracting_company • (Company[k], {Project[i]}) • (Company[i], {Project[j], Project[l]}) • Second index I2 on Company.divisions • (Division[h], {Company[i]}) • (Division[i], {Company[i]}) • (Division[k], {Company[k]}) • Third index I3 on Division.city • (Boston, {Division[h]}) • (New York, {Division[i]}) • (Los Angeles, {Division[k]})

Example of Using Multi-index • Select all the projects with a main contracting company which has a division in Los Angeles • Scanning index I3 with the key-value = Los Angeles • {Division[k]} • Scanning index I2 with the key-value = Division[k] • {Company[k]} • Scanning index I1 with the key-value = Company[k] • {Project[i]} • Result: {Project[i]}

Join Index • To perform joins in relational model efficiently • Binary join index for binary relation (r, s) • one index clustered on r • the other index clustered on s • BJI can be used in a multi-index organization • reverse traversal • faster forward traversal in cases of high access costs to objects since no database access for objects • more suitable for complex queries

Nested Index • Direct association between the ending object and the starting object in path • Given a path P = C1.A1.A2…..An, nested index on P is defined as a set of pairs (O,S) • S = {O' such that there is O1.O2…..On+1 as a CIwhere O' = O1 and O = On+1} • Examples • (Boston, {Project[j]}) • (New York, {Project[j], Project[k], Project[l]}) • (Los Angeles, {Project[i]})

Properties of Nested Index • Retrieval is quite fast for scanning only one index • Problem on update operation • the access to several objects • forward traversal to determine the value of the indexed attribute • reverse traversal to determine all instances at the beginning of the path ==> inverse references

Path Index • Given a key, all the path instantiations are stored • Given a path P=C1.A1.A2…..An, a path index on P is defined as a set of pairs (O,S) where S={<j-1>(pi), • pi = O1.O2.O…..On (1 j n+1) is a CI or non-redundant PI of P • Oj = O } • Examples • (Boston, {Project[j].Company[i].Division[h]}) • (New York, {Project[j].Company[i].Division[i], Project[k].Company[m].Division[j], Project[l].Company[i].Division[i]})

Properties of Path Index • For nested predicates in all classes along the path • Updates of a path index • only forward traversals are required • Identical with nested index where n = 1

Access Relations • Similar to path indices • storing all instantiations along a path in a relation • Examples • <Project[i], Company[k], Division[k], Los Angeles> • <Project[j], Company[i], Division[h], Boston> • <Project[j], Company[i], Division[i], New York> • <Project[k], Company[m], Division[j], New York> • <Project[l], Company[i], Division[h], Boston> • <Project[l], Company[i], Division[i], New York> • Several subpaths to different relations

Index Structures using B+tree • Structure of the internal node • n records of <key-length, key, pointer> • A record of a leaf node in a nested index • record-length • key-length, key-value • # of OIDs associated with the key • list of OIDs • A record of a leaf node in a path index • record-length • key-length, key-value • # of the path instantiations associated with the key • list of path instantiations

Operations with Nested Index • To solve a predicate against a nested attribute An of class C1 • single index scan • same cost to solve the predicate on a simple attribute of C1 • For update operation • one forward traversal to find the old key value • another one forward traversal to find the new key value • one reverse traversal to find the OID of associated object

Operations with Path Index • To solve a predicate against the nested attribute An of class Ci (1 i  n) • one index scan • determine the PI or CI associated with the key value • extract the OIDs occupying the i-th position of them • For update operation • one forward traversal to find the old path instantiation • another one forward traversal to find the new path instantiation

Comparisons of Index Organizations(1) • Degree of reference sharing • important in evaluating an index organization • reference is shared when two or more objects refer to the same object • Retrieval operation • nested index has the lowest cost • path index has a lower cost than the multi-index • nested index has better performance than the path index • path index allows predicates to be solved for all the classes along a path but not nested index

Comparisons of Index Organizations(2) • Update operation • the multi-index has the lowest cost • for paths with a length 2 • nested index has slightly lower cost than the path index • for paths with a length greater than 2 • nested index has slightly lower cost than the path index if the updates are executed on the first two classes • In other cases • nested index involves a significantly higher cost

Indexing Techniques for Inheritance Hierarchies • Scope of a query • only a given class C • the class C and the inheritance hierarchy rooted in C • Solution based on conventional indices • construct an index on an attribute for each of the classes of the subgraph • scan all these indices • perform the union of their result

Chapter 8. Storage Management and Indexing Techniques

Chapter 8. Storage Management and Indexing Techniques

Presentation Transcript

Overview of Storage and Indexing

Chapter 8: Storage and Handling

Overview of Storage and Indexing

DBMS Storage and Indexing

Indexing Techniques

Overview of Storage and Indexing

Overview of Storage and Indexing

Indexing Techniques

Graph Indexing Techniques

Overview of Storage and Indexing

File Organizations and Indexing Chapter 8

Chapter 8 Indexing and Searching

Indexing Techniques

Why Concerning Storage and Indexing?

XML Indexing Techniques

Indexing Techniques

DBMS Storage and Indexing

Overview of Storage and Indexing

File Storage and Indexing

File Storage and Indexing

INDEXING TECHNIQUES

Overview of Storage and Indexing