1 / 65

Indexing

Indexing. CS157B Lecture 9. Contents. Basic Concepts Ordered Indices B+ - Tree Index Files B- Tree Index Files. Basic Concepts. Index Evaluation Metrics. Ordered Indices. B+-Tree Index Files. B+-Tree Node Structure. Typical node K i are the search-key values

macha
Download Presentation

Indexing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Indexing CS157B Lecture 9

  2. Contents • Basic Concepts • Ordered Indices • B+ - Tree Index Files • B- Tree Index Files

  3. Basic Concepts

  4. Index Evaluation Metrics

  5. Ordered Indices

  6. B+-Tree Index Files

  7. B+-Tree Node Structure • Typical node • K i are the search-key values • Pi are pointers to children (for no-leaf nodes) or pointers to records or buckets of records (for leaf nodes). • The search-keys in a node are ordered • K1<K2<K3<…<Kn-1

  8. Example of a B+-tree

  9. Example of B+-tree ! Leaf nodes must have between 2 and 4 values (.(n–1)/2. and n –1, with n = 5). ! Non-leaf nodes other than root must have between 3 and 5 children (.(n/2. and n with n =5). ! Root must have at least 2 children.

  10. Queries on B+-Trees

  11. Queries on B+-Trees (Cont.) above difference is significant since every node access may need a disk I/O, costing around 20 milliseconds!

  12. B+-Tree File Organization • Index file degradation problem is solved by using B+-Tree indices. Data file degradation problem is solved by using B+-Tree File Organization. • The leaf nodes in a B+-tree file organization store records, instead of pointers. • Since records are larger than pointers, the maximum number of records that can be stored in a leaf node is less than the number of pointers in a nonleaf node. • Leaf nodes are still required to be half full. • Insertion and deletion are handled in the same way as insertion and deletion of entries in a B+-tree index.

  13. B+-Tree File Organization (Cont.) Example of B+-tree File Organization

  14. B-Tree Index Files • Similar to B+-tree, but B-tree allows search-key values toappear only once; eliminates redundant storage of search keys. • Search keys in nonleaf nodes appear nowhere else in the B tree; an additional pointer field for each search key in a nonleaf node must be included. • Generalized B-tree leaf node Nonleaf node – pointers Bi are the bucket or file record pointers.

  15. B-Tree Index File Example

  16. B-Tree Index Files (Cont.) Advantages of B-Tree indices: • May use less tree nodes than a corresponding B+-Tree. • Sometimes possible to find search-key value before reaching leaf node. Disadvantages of B-Tree indices: • Only small fraction of all search-key values are found early • Non-leaf nodes are larger, so fan-out is reduced. Thus B-Trees typically have greater depth than corresponding B+-Tree • Insertion and deletion more complicated than in B+-Trees • Implementation is harder than B+-Trees. Typically, advantages of B-Trees do not out weigh disadvantages.

  17. Data File Block 1 Block 2 Block 3 Adams Becker Dumpling Actual Value Address Block Number Dumpling Harty Texaci ... 1 2 3 … Getta Harty Mobile Sunoci Texaci Index Sequential

  18. 001 003 . . 150 251 . . 385 Key Value Key Value Key Value Address Address Address Key Value Address 455 480 . . 536 785 805 536 678 150 385 3 4 5 6 1 2 385 678 805 7 8 9 … 605 610 . . 678 705 710 . . 785 791 . . 805 Indexed Sequential: Two Levels

  19. Indexed Random • Key values of the physical records are not necessarily in logical sequence • Index may be stored and accessed with Indexed Sequential Access Method • Index has an entry for every data base record. These are in ascending order. The index keys are in logical sequence. Database records are not necessarily in ascending sequence. • Access method may be used for storage and retrieval

  20. Becker Harty Actual Value Address Block Number Adams Becker Dumpling Getta Harty 2 1 3 2 1 Adams Getta Dumpling Indexed Random

  21. F | | P | | Z | B | | D | | F | H | | L | | P | R | | S | | Z | Devils Hawkeyes Hoosiers Minors Panthers Seminoles Aces Boilers Cars Flyers Btree

  22. Inverted • Key values of the physical records are not necessarily in logical sequence • Access Method is better used for retrieval • An index for every field to be inverted may be built • Access efficiency depends on number of database records, levels of index, and storage allocated for index

  23. Student name Course Number CH 145 101, 103,104 CH145 cs201 ch145 ch145 cs623 cs623 Adams Becker Dumpling Getta Harty Mobile Actual Value Address Block Number CH 145 CS 201 CS 623 PH 345 1 2 3 … CS 201 102 CS 623 105, 106 Inverted

  24. Direct • Key values of the physical records are not necessarily in logical sequence • There is a one-to-one correspondence between a record key and the physical address of the record • May be used for storage and retrieval • Access efficiency always 1 • Storage efficiency depends on density of keys • No duplicate keys permitted

  25. So Far • So far, when we do runtime analysis, we give each operation one time unit • Actually, we've been assuming that they are close enough in run time that every operation can be considered the same • This, of course, is not realistic, as things like hard drives and networking are much slower than anything we can calculate in the computer

  26. Why the Processor and Main Memory Are Good • A processor can do about 2.5 billion instructions per second on a higher-end home PC these days • Data stored in main memory is accessible in a speed to match the processor's speed • Imagine that we are storing a tree of 100 million elements (imagine, the number of bank transactions for a given month or year)

  27. Continued • Even if it takes 20 CPU instructions (too many) to traverse a single node of binary search tree (accessing data and processing it), we can still access 125 million nodes per second • This is all of the elements of a completely linear tree 1.25 times per second • Imagine that it takes 32 bytes to represent a key into the tree (what we order on) and 1k bytes to store the data, then we need to store roughly 100,000,000,000 bytes of data, or need 100 GB of RAM to run this procedure

  28. Continued • Of course, 400 MB RAM is not out of question, but that leaves no other RAM to run anything else (like the OS perhaps) among other things • So, processor/main memory is fast, but not very practical, storing 100 GB of data on a hard drive is nothing these days, however

  29. Hard Drives • While speeds of processors go up rapidly, hard drive capacity goes up • But what about speed? • Most drives today run 7,200 RPM • To get data, we have to rotate a disk 0.5 of a rotation, so about 4.1 ms • So we can do about 250 accesses per second • Remember processors? That was 125 million accesses per second

  30. Rough Comparisons • Based on our rough comparisons, a piece of data in main memory can be accessed 500,000 faster than data on a hard drive • But the reality is that main memory is very, very expensive, and hard drives are cheap with lots of storage • So we want to go with hard drives, but we need a way to improve the parts of the runtimes that are slow for hard drives

  31. Past BSTs • If you look at our past BSTs, even the good balanced trees, we'd have to do, at best, an average of O(log n) calculations to find an element • log 100000000 = about 26 • So, in a balanced BST, we would have to do 26 node accesses in the worst case • This is a nearly immeasurable fraction of time if we are using only main memory, but its about 1/10th of a second from a hard drive

  32. Goal • Our goal is to make a tree such that the number of accesses to find a node is greatly reduced • If we can do a lot of heavy, fast calculations to do very few disk operations, we will have improved the runtime greatly • Its okay to do a lot of calculations -- in the time it takes for one disk access, we can do 125 million instructions in the processor

  33. Height • The biggest problem is tree height • Think about what happens in a BST... access a node decide which way to go repeat for the direction desired • We have to do this for every node in the path, which can cause problems • We want a tree with a smaller height, and level, we'll access the disk only once • To reduce height, we have a relatively simple solution: increase branching

  34. The Good Ol' N-ary Tree • We mostly always talk about binary trees, for a good reason: in main memory, who cares what the height is? We have plenty of speed O(log n) is good enough • In that, is hidden details, as log n is actually log2n • In a trinary tree, the runtime is O(log n) also, but (as with constants) we left out the base, and it really goes in log3n time • This may not make a huge difference for main memory, but for disk access, its 10 less accesses, or about 0.4 sec (a huge improvement)

  35. The B-Tree • What the B-tree is is a collection of data all collected at the leaf level, made up of an n-ary tree, • An n-ary tree works like a BST, only we do a slightly more complicated calculation to figure out which path to take • We also need to guarantee that the n-ary tree will be balanced, or it may worsen into a regular BST, this is the first property of a B-Tree, good balancing

  36. B-Tree Properties • Like a red-black tree, we have certain rules to follow in our n-ary tree • 1) All data is stored at the leaf level2) Nonleaf nodes store up to n-1 keys to help searching for which path3) The root is a leaf or has 2 to n children4) All nonleaf nodes (except the root) have between M/2 and M children5) All leaves are at the same depth and have between M/2 and M children, except if the root is a leaf

  37. What It Looks Like 1, 10, 12 16, 21 30, 41, 42, 43 50, 51 Note that even though we have 11 elements in the tree, the runtime will only involve 2 node accesses, differed by 2.18 for the binary

  38. Better? • This is much better • Now we've decreased the number of nodes to be accessed and decreased the height, making everything more shallow • 4-ary isn't really the way to go here, what ends up happening is usually a much bigger number, like say, 200-ary, but either way, it's still better than binary • Note: we have to assume that all the information on one node is stored contiguously in memory, meaning the hard drive doesn't have to reposition itself (slow!) just for one unfinished node

  39. B-Tree • Now we're ready to get into B-Trees • We need to know how to add to and remove from the B-Tree, and as always, we'll focus on insertion first

  40. Summary of Why • Writing to the hard disk is an expensive operation • In a large system, we'll have to read and write from the disk often, and we want to minimize the number of expensive operations for a good runtime • We need a structure that will take into account the cost of accessing a disk and do that as little as possible while using the structure

  41. Summary of How • When the HD is accessed, a set size block of memory is read • We'll make each node of a tree one of these blocks • For efficiency's sake, we want to maximize the amount of data in these nodes/blocks • With this setup, we can define quite a few algorithms to deal with large-scale blocks of data such as the one governing the structure of a B-tree

  42. The Rules 1) All data is stored at the leaf level.2) Non-leaf nodes store keys to help search with path to a leaf to find the element at3) The root is a leaf or has 2 to n children4) All non-leaf nodes have between M/2 and M children (at least half-full)5) All leaves are at the same level6) All leaves are always at least half-full, unless the root is a leaf itself

  43. Starting Out • So what do we really have if we have an empty B-tree? • Essentially, we have one empty storage container, which is the size of a block on the HD (say, b), and can store b/n elements of size n • So, the root starts out as a leaf, which just stores data

  44. Inserting the First Elements • The root/leaf has a set amount of space, that we can keep adding elements to • We may want to keep it sorted for easier searching • In fact, is there any reason to not keep a leaf sorted? • No, because the time it will take us to order the elements is miniscule compared to how long it will take to write the result to the hard drive • Okay, so, while we have space in the leaf, we add a new element in sorted order

  45. Notes: Element? • This is more of a databases topic, but what is normally stored in the B-Tree is a reference to another object or piece of data, which may be, for instance, a memory location to somewhere else on the disk • We also need to store a key that something is being searched on • We are keeping, then, a {key, object} pair, where key is simply a search item (like a ID number) and an object is just a pointer to something else • We'll learn lots more about {key, object} pairs when we start talking about hashtables

  46. Uh Oh! Full! • It works okay to keep adding elements, but as we said, each block is a set amount of space, what happens when we fill it up with key-object pairs? • Recall one of the rules of a B-tree: all leaves are always at least half-full • If we know this, we know we can take that original, full leaf node, and split it into two nodes

  47. Notes: Split? • Every block on an HD is the same size, and each element (key, object) occupies the same size • If we split the number of elements down the middle, then we'll create two half-full nodes

  48. Notes: Full? • Just when does a tree become full, or, more importantly, when do we do a split of a leaf node? • The popular and common way is whenever you try to add one element to a completely full node, you split the node and then add the element again • Another way I like is to check if a split is needed after an addition, this way, you don't have to keep track of the element to be added during the split -- it's already there

  49. The Split • When the node gets full, we'll split it into two nodes • But what happens then? This is a tree, we need connections to our leaf nodes • In the first case of a split, that is, when the root is the leaf and it splits, we end up with two leaves, which would give us reason to believe we should make a new, non-leaf node which connects the two leaf nodes together

  50. Notes: The Non-leaf Node • A leaf node stores key-object pairs in sorted order...what would the non-leaf node store? • One of our rules: Non-leaf nodes store keys to help search with path to a leaf to find the element at • So, the leaf-nodes will store keys, they also need to store links to other nodes

More Related