1 / 55

External Sorting and Searching

External Sorting and Searching. B-Trees, etc. m-Way Search Trees. In a binary search tree, there is one key value per node and two children. There is no reason why I couldn’t have (at most) m-1 key values per node and m children. Such trees are called m-way search trees.

gotzon
Download Presentation

External Sorting and Searching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. External Sorting and Searching B-Trees, etc.

  2. m-Way Search Trees • In a binary search tree, there is one key value per node and two children. • There is no reason why I couldn’t have (at most) m-1 key values per node and m children. • Such trees are called m-way search trees.

  3. m-Way Search Tree Example • Here is a 3-way search tree; each node has a maximum of 3 children. 120, 240, 97 200 360, 440

  4. m-Way Search Tree Example II 97 • Here is another one. 120, 240 360, 440 500

  5. m-Way Time Complexity • Clearly, the search and insert time for an m-way search tree is still O(n). • The number of nodes visited is O(n/m) • For each, we must look at m values. • We could search in O(log2(m)) time, yielding a best case of O(n/m * log2(m)). • Of course, as n gets much larger than M, this is still O(n).

  6. B-Trees • What I want is a height-balanced m-way search tree to achieve the best search time. • These are called B-Trees. • As with height-balanced BSTs, we will have a re-balancing algorithm to run after every insert and delete.

  7. B-Tree Properties • The root may have between 2 and m children. • All other nodes must have between M/2 and m children. • A node that has k children will have k-1 key values. • Thus, the root may have only 2 children; all other nodes must be at least half full.

  8. B-Tree Properties II • If a B-Tree has k children (T0, T1, ...TK-1) and k-1 ordered key values (D1, D2,...DK-1), then all the key values in Ti are greater than Di but less than Di+1 for i=1...k-2. • All the key values in T0 are less than D1. • All the key values in Tk-1 are greater than DK-1. • This simply means it is a search tree.

  9. B-Tree Insertion • All insertions are done at the terminal level. • First search for terminal level node to insert the new key value into. • If the number of children of this node does not exceed m, stop. • If the number of children does exceed m...

  10. B-Tree Node Splitting • Split this node into two nodes: • Take the middle value out. • Create one node with the lower half of the key values and one with the upper half. • Insert middle value into the parent node. • Continue recursively until either the node can hold the new key value, or you split the root.

  11. B-Tree Insert Example • A B-Tree of order 3 (i.e. m=3) is the smallest possible. • It is also the easiest to draw, so we’ll use this order for our example. • This is also called a “2-3 Tree” because each node may have a maximum of 2 key values and 3 children.

  12. B-Tree Example Key values left to insert: 360, 240, 200, 97, 440, 280 • Insert 120. A new root node is created and this value is placed into it. 120

  13. B-Tree Example Key values left to insert:240, 200, 97, 440, 280 • Insert 360. It goes into the root. No further action is required. 120, 360

  14. B-Tree Example Key values left to insert: 200, 97, 440, 280 • Insert 240. It goes into the root. Since this node has 3 values, it must be split. 120, 240, 360

  15. B-Tree Example Key values left to insert: 200, 97, 440, 280 • This shows the result of the split. 120 and 360 go into nodes by themselves, and 240 is placed into a new root node. 240 120 360

  16. B-Tree Example Key values left to insert: 97, 440, 280 • Insert value 200. It goes into the node with 120. No further action is required. 240 120, 200 360

  17. B-Tree Example Key values left to insert: 440, 280 • Insert value 97. It goes into the node with 120 and 200. Since this node contains too many values, it must be split 240 97, 120, 200 360

  18. B-Tree Example Key values left to insert: 440, 280 • This shows the result of the split. 97 and 200 are placed into their own nodes, and 120 is moved up to the parent. The parent node is OK. 120, 240, 97 200 360

  19. B-Tree Example Key values left to insert:280 • Insert 440. It goes into the node with 360. No further action is required. 120, 240, 97 200 360, 440

  20. B-Tree Example Key values left to insert:DONE • Insert the value 280. It goes into the node with 360 and 440. Since this node has 3 values, it must be split. 120, 240, 97 200 280, 360, 440

  21. B-Tree Example • This shows the result of the split. 280 and 440 go into nodes by themselves, and 360 is moved up to the parent node. 120, 240, 360 97 200 280 440

  22. B-Tree Example 240 • The parent node must be split as well. Because it is the root, we must create a new root node. 120 360 97 200 280 440

  23. Time Complexity • What is the order of a B-tree search? To answer this, we need to determine the worst case number of levels in a B-Tree of order m that has n key values. • Let’s look at the number of nodes per level: • The root must have 1 node; • Level 2 must have 2 nodes; • Level 3 must have 2* M/2 nodes; • Level 4 must have 2* M/2 2 nodes; • Level L must have 2* M/2 L-2 nodes.

  24. Time Complexity II • Observation: in any list of n elements, there are n+1 ways for the search to fail. • In a B-tree, all the ways to fail are at level L+1 (these are sometimes called Failure Nodes). • Thus, this is a relationship between the number of key values and the height of the tree:

  25. Time Complexity III • Because the previous analysis is a worst case, the number of nodes at level L+1 must be less than or equal to N+1: • 2 * ém/2ù L-1 <= (N+1) • ém/2ù L-1 <= (N+1)/2 • L-1 <= Log ém/2ù [(N+1)/2] • L <= Log ém/2ù [(N+1)/2] + 1

  26. Time Complexity IV • One node at each level must be accessed, so L gives the number of nodes to access. • Each node contains ém/2ù -1 key values, so the total number of comparisons is • {Log ém/2ù [(N+1)/2]+1} * {Log2[ém/2ù -1]}

  27. Fun With Math • Removing the constants, we may say this search is • O{ Log ém/2ù (N) * Log2[ém/2ù] } • O{Log2(N) / Log2ém/2ù * (Log2[ém/2ù) } • O{Log2(N)}

  28. Summing it up: • WHAT??? ALL THIS WORK FOR THE SAME ORDER AS AN AVL-TREE!!! • What’s going on here???

  29. What Really Happens • Remember this is external sorting, so accessing the information and doing comparisons are a much different cost. • Each node in the B-tree is stored in a “block” on the disk; a “block” is the minimum amount of information which can be retrieved with one disk access.

  30. What Really Happens II • Thus, the number of disk accesses is the bottle-neck; this is given by L. • A B-tree is built on a field of a data file to speed access to that field. • A “Clustered” or “Primary” B-tree stores the entire record of the file in the B-Tree. • An “Unclustered” or “Secondary” B-tree stores the field’s value and the record number in the node.

  31. What Really Happens III • It is the secondary B-trees that one usually means when one says “B-tree”. • Thus, to do a search for a record on a field which has a B-tree: • Search the B-tree for the key value. • When found, retrieve its associated record number. • Retrieve that record from the data file.

  32. A Real Example. • What follows is a real example of how a B-tree is used.

  33. Sample Data File

  34. B-Tree on Schedule# • This is the way we would normally view it: 100 45 120 23 46 110 140,210

  35. B-Tree on Schedule# • This is how it really looks in a file :

  36. Deleting in a B-tree • To delete from a B-Tree, first locate the key value with the normal search routine. • If the key value is not located in a terminal node, replace it with its in order successor and delete the in order successor. • Thus, all deletes which reduce the number of key values occur at the terminal level.

  37. Deleting From the Terminal Level • Good news: because there are no children to worry about, we can just remove it from the list. • Bad news: what if this removal reduces the number of children below ém/2ù ? • Reality: at some point we will need to reduce the number of nodes...

  38. The “Borrow” Algorithm • When a node is reduced below ém/2ù children, first try and borrow a key value from one of its neighbors. • If a neighbor has more than the minimum, then rotate the appropriate key to the parent and the appropriate key from the parent down to the reduced child.

  39. Borrow Example 120, 240 • Suppose I want to delete 200 from this b-tree of order 3. • To do so, rotate 240 into middle child, and 360 up to root: 97 200 360, 440

  40. Borrow Example 120, 360 • This shows the result. • Problem: what if I now want to delete 240? • Borrowing won’t work... 97 240 440

  41. Combining Nodes • When borrowing won’t work, combine the node with the key value from the parent AND the neighbor node with minimum children. • Repeat the deletion algorithm from the parent, looking first to borrow if possible. • Now, let’s delete 240...

  42. Combining Example 120, 360 • First, remove 240. 97 240 440

  43. Combining Example 120, 360 • Next, attempt to borrow. • Borrowing fails. • Combine empty node with 360 and 440. 97 <empty> 440

  44. Combining Example 120 • This shows the result. • The parent is OK, so we are done... 97 360, 440

  45. A Larger Example 260 • Delete 280 • This is a “borrow” case: 120, 180 360 97 150 200 280 440, 500

  46. A Larger Example 260 • Delete 360 • This is a “combine” case: 120, 180 440 97 150 200 360 500

  47. A Larger Example 260 • First, remove 360... 120, 180 440 97 150 200 <empty> 500

  48. A Larger Example 260 • Next combine node with its neighbor (500) and 440 from the parent... 120, 180 440 97 150 200 <empty> 500

  49. A Larger Example 260 • Parent now has a problem... • This is a borrow case: 120, 180 <empty> 97 150 200 440, 500

  50. A Larger Example 180 • Children must now be considered. What do I do with the node with 200? 120 260 97 150 200 440, 500

More Related