1 / 137

CMPT 454

File Storage and Indexing. CMPT 454. Files and Indexing. File Organizations Indices Types of index Tree based indexing Hash based indexing. Finding Records. Accessing a table Consider SELECT * FROM Customer To make access to a table efficient Store the table on adjacent blocks

donat
Download Presentation

CMPT 454

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. File Storage and Indexing CMPT 454

  2. Files and Indexing • File Organizations • Indices • Types of index • Tree based indexing • Hash based indexing

  3. Finding Records • Accessing a table • Consider SELECT * FROM Customer • To make access to a table efficient • Store the table on adjacent blocks • On the same cylinder, or • Adjacent cylinders • But many queries include where clauses • File organizations should support the efficient retrieval of records within a file

  4. File Organizations

  5. File Organizations • Data in a database consists of collections of records, or data files • Each file consists of one or more blocks on a disk • Afile organization is a method of arranging records in a file • File organizations make some operations more efficient • In addition, files can be indexed, to provide multiple ways to access records efficiently • Index files contain search key values and references to records

  6. External Storage - Reminder • Database data must be persistent, so must be stored on secondary memory, such as a hard disk • Disk access is relatively inefficient, in the order of 10 to 15 milliseconds to access a single page • Hundreds of thousands of times more than an equivalent access to a main memory location • The cost of disk I/O dominates the cost of database operations • The unit that data is read from or written to disk is a block, typically 8 kilobytes • Reading several pages in sequence from a disk takes much less time than reading several random pages

  7. File Organizations • The minimum set of file operations is: • Create and destroy files • Insert and delete records • Scan an entire file • A scan brings all of the records in the file into main memory • A single DB table is usually stored as a single file • Every record in a file has a unique record ID, or rid • A rid consists of a block address, and a slot number • The simplest file structure is an unordered file, or heap file

  8. Heap Files • Heap files support insertion and deletion of records and file scans • Because the entire file can be scanned, individual records or collections of records can be found • New records are inserted where there is room • Either in slots that contained previously deleted records, or • At the end of the file • When a record is deleted, no other records are affected • That is, there is no need to reorganize remaining records

  9. Sorted Files • The records in a sorted file, or sequential file, are stored in order • Based on the sort key of the file • The attribute of the record that the file is sorted on • The basic organization assumes that file pages are full, to conserve space • Pages should be maintained in sequence • To allow for more efficient disk access • Insertions result in records being shuffled up • Deletions result in records being shuffled down

  10. Sorted Files in Practice • To avoid inefficiencies involved in inserting and deleting records from sorted files • Pages have only partial occupancy • i.e. space for future insertions is left in each page • Overflow pages can be attached (by pointers) to pages that become full • Records can be locally reorganized in adjacent pages • Sorted files may need to be periodically re-ordered

  11. Indices (or Indexes)

  12. Indexes • An index is a data structure that organizes data to optimize the retrieval of records on some criteria • An index supports efficient retrieval of records based on the search key of the index • An index can be created for a file to speed up searches that are not efficiently supported by the file's organization • A file can have more than one index • An index is a collection of data entries which must contain: • A search key value, k, and • Information to find data records with that search key value

  13. Primary Indexes • An index on a sequential file • The search key of the index is the same as the sort key of the file • Primary indexes can be either dense or sparse There are different kinds of keys Primary key Candidate key Superkey Sort key Search key

  14. Dense Indexes • A dense index is a sequence of blocks containing search key : rid pairs • The rid contains addresses to records with the search key • The blocks of the index are in the same order as the file • Searching an index is faster than searching the file • The index is smaller • The index is sorted so binary search can be used • The index may be small enough to fit in main memory • If so, once the index has been read, records can be found with one disk I/O

  15. Dense Index Two data records or four index records fit on one block file index

  16. Sparse Indexes • A sparse index usually contains one data entry for each blockof records in a data file • It is only possible to use a sparse index if the data file is sorted by the search key of the index • Sparse indexes are smaller than dense indexes • Sparse indexes are searched in much the same way as dense indexes • Except that the index is searched for the largest key less than or equal to the target value • The rid is then followed to a block of the data file

  17. Sparse Index Two data records or four index records fit on one block file index

  18. Multiple Level Indices • An index on a large data file can cover many blocks • Even using binary search, multiple disk I/Os may be needed to find a record • An alternative is to build a multiple level index • The first level of the index may be dense or sparse • Subsequent levels of the index are sparse indexes on the preceding level of the index

  19. Multiple Level Index multiple level index file

  20. Secondary Indexes • We define a primary index is an index whose search key is the same as the sort key of a sequential file • There can only be one primary index for a file • Terminology – a primary index sometimes refers to an index where the search key includes the primary key, we do not use this definition • More terminology – primary indices are also referred to as clustered • A secondary indexis an index whose search key is not the sort key of the file • Secondary indexes must be dense • Why? • Secondary indices are also referred to as unclustered

  21. Secondary Index file index

  22. Using Secondary Indexes • The pointers in a block of a secondary indexes may go to many blocks of the data file • Making secondary indexes less efficient than primary indexes for retrieval of a range of records • Heap (data) files are not ordered so require secondary indexes

  23. Clustered Files • A clustered file contains the records of two tables • Consider two tables in a one to many relationship • Where queries containing a join between the two tables are made frequently • Note the unfortunate re-use of the word clustered ... SELECT pet_name, species FROM Owner, Pet WHERE Owner.sin = 111 AND Owner.sin = Pet.sin

  24. Clustered File Organization Pets of owner1 Pets of owner2 Pets of owner3 Pets of owner4

  25. Clustered Files and Indexes • Consider a primary index on Owner • The file is sorted by owner • Individual owners can be retrieve rapidly • Pets of owners can also be retrieved rapidly • Efficiency is reduced when retrieving a range of owner data • Any index on attributes of Pets would be a secondary index

  26. Indirection • A secondary index may waste space if search key values are repeated • Each record ID is paired with a search key • Create a bucket for each set of rids associated with a search key • Follow a pointer to the bucket, then • Follow the rids in the bucket to the records • Saves space if search key values are larger than record IDs • And each key appears at least twice on average

  27. Indirection file index buckets

  28. Multiple Secondary Indexes • Multiple secondary indexes using indirection can improve efficiency on queries with complex criteria • Collect all the rids from the buckets that meet each of the criteria • Then intersect them • And only retrieve records using the result • This avoids retrieving records that match some, but not all, of the criteria

  29. Document Retrieval • There are issues related to the storage and efficient retrieval of documents • Keywords are used to identify documents • More and more documents are maintained on the web • A document can be considered as a record in a table • The record can be thought of as having Boolean attributes for each possible word in the document • An attribute is true if the word is in the document and false otherwise

  30. Inverted Indexes • There is a secondary index on each attribute (word) of a document • Only those records where an attribute is true is contained in the index • So the index leads to documents with particular words • Indices for all attributes (words) are combined into a single index • Known as an inverted index • The index uses indirect buckets for space efficiency

  31. Matching Documents • Inverted indexes of words in multiple documents can be used to satisfy queries • Using intersection documents that contain multiple target words can be found • Additional information can be maintained to match words in sections of documents • Titles, headers, and other sections • Number of occurrences of words • ...

  32. Composite Indices • An index’s search key can contain several fields • Such search keys are referred to as composite search keys or concatenated keys • e.g. {fName, lName} • For searches on equality the values for each field in the search key must match the values in record • e.g. 'Joe Smith' does not match 'Joe Jones' or 'Fred Smith' • For range queries, ranges may be specified for any fields in the search key. • If no values are specified for a field it implies that any value is acceptable for that field

  33. Tree Based Indexes

  34. Tree Index Introduction • Multiple level indexes can be very useful in speeding up queries • There is a general data structure that is used in commercial DBMSs • Known as B trees • We will look at B+ trees, a commonly used variant • B trees have two desirable properties • They keep as many levels as are required for the file being indexed • Space on tree blocks is managed so that each block is at least ½ full

  35. B Tree Structure • B trees are balanced structures • All paths from the root to a leaf have the same length • Most B trees have three levels • But any number of levels is possible • B trees are similar to binary search trees • Except that B tree nodes contain more than two children • That is, they have greater fan-out • B tree node size is chosen to be the same as a disk block

  36. B+ Tree Node Structure • The number of data entries in a node is determined by the size of the search key • Up to nsearch key values and n + 1 pointers • The value n is chosen to be as large as possible and still allow n search keys and n + 1 pointers to fit on a block • Example • If block size is 4,096, and the keys are 4 byte integers and pointers are 8 byte addresses • Find the largest value n such that 4n + 8(n + 1) ≤ 4,096 • n = 340

  37. Leaf Nodes • Search keys in leaf nodes are copies of the keys in the data file • The leaf nodes contain the keys in order • The left most n pointers point to records in the data file • A leaf node must use at least (n + 1)/2 of these pointers • The right most pointer points to the next leaf next leaf in the tree record with key 12 record with key 24 record with key 29

  38. Interior Nodes • In interior nodes, pointers point to next level nodes • Label the search keys K1 to Kn, and pointers p0 to pn • Pointer p0 points to nodes whose search key values are less than K1 • Other pointers, pi, point to nodes with search keys greater than or equal to Kiand less than Ki+1 • An interior node must use at least (n + 1)/2 pointers K < 12 12 ≤ K < 24 24 ≤ K < 29 K ≥ 29

  39. Example B+ Tree in this example n = 3 note that (n+1)/2 = 2

  40. B+ Tree Applications • B+ trees can be used to build many different indexes • The B tree could be a sparse index on a sorted data file, or • A dense index on a data file • We will assume for now that there are no duplicate values for search keys • That is the search key is a candidate key for the relation • The meaning of interior nodes changes slightly if there are duplicate search key values

  41. Searching a B Tree • The B+ tree search algorithm is similar to a BST • To search for a value K start at the root and end at a leaf • If the node is a leaf and the ith key has the value K then follow the ith pointer to the record • If the node is an interior node follow the appropriate pointer to the next (interior or leaf) node • Searching a B+ tree index requires a number of disk I/O operatuions equal to the height of the tree • Plus one I/O to retrieve the record • However the root of the tree may be in main memory

  42. B+ Tree Searches which nodes are visited in a search for 22? which nodes are visited in a search for 16?

  43. Range Queries • B trees are useful for processing range queries • A range query typically has a WHERE clause that specifies a range of values • Assume query specifies values from x to y • Search the tree for the leaf that should contain value x • Follow the leaf pointers until a key greater than y is found • The tree can also be used to satisfy queries that have no lower bound or no upper bound

  44. B+ Tree Insertions • Insert the record in the in the data file • Retaining the ridand the search key value, K • Insert the entry in the appropriate place in a leaf • Use the search algorithm to find the leaf node • Insert a data entry, if it fits the process is complete • If the target leaf node is full then split it • The first (n + 1) / 2 entries stay in the original node • Create a new node with the remaining (n + 1) / 2 entries to the right of the original node • Insert an entry with the first search key value from the new leaf in its parent node that points to the new leaf

  45. B +Tree Insertions • Adding an entry to an interior node may cause it to split • After inserting a new entry there should be n + 1 keys (and n + 2 pointers) • The first (n + 2) / 2 pointers stay in the original node • Create a new node with the remaining (n + 2) / 2 pointers to the right of the original node • Leave the first n / 2 keys in the original node and move the last n / 2 keys to the new node • The remaining key 's value falls between the values in the original and new node • This left over key is inserted into the parent of the node along with a pointer to the new interior node

  46. B+ Tree Insertions continued • Moving a value to a higher, interior level of the tree, may again cause a split • The same process is repeated until no further splits are required, or until a new root node has been created • If a new root is created it will initially have just one key and two children • So will be less than half full • This is permitted for the root (only)

  47. B+ Tree Insertion Example insert 2, 21 and 11 the values are maintained in order in the index pages data file

  48. B+ Tree Insertion Example insert 8 create new root with the first value of the new leaf node create a new node with the last ½ of the values chain the new node to the original node

  49. B+ Tree Insertion Example insert 64, then 5 insert 23 ... both leaf nodes are now full ...

  50. B+ Tree Insertion Example ... inserting 23 ...

More Related