COMPSCI220 Review & Questions

COMPSCI220 Review & Questions Ulrich Speidel (ulrich@cs.auckland.ac.nz)

Topics requested • Divide-and-conquer principle • Recurrence relations: formulating, solving, telescoping, recognizing • Predicting running times based on algorithm complexity • Big-Oh & Co. • Trees: m-ary, red-black, B-trees

Divide and conquer • Basic idea: You have a big and potentially messy problem (such as sorting a list of ten million numbers, or traversing a large graph that I don’t know in advance, or…). Lots of possible strategies and solutions. • Divide and conquer: Identify how to break up the big problem into (progressively) smaller ones that you can solve easily. • Examples: • Mergesort: Divide up large list recursively into smaller ones that you can sort easily once they reach length 2 or 1. Then conquer by combining (merging) your small sorted lists into larger sorted lists. Quicksort is another similar example here. • DFS: Divide problem by conquering a graph one tree at a time. In each tree divide problem further by performing the same sub-algorithm on each node you visit (mark node gray, check for white outneighbours, visit them recursively, mark node black). BFS is another example in this respect.

Recurrence relations • Can’t formulate every problem as a recurrence relation • Even if we can, there is no guarantee that we can solve it (=find a closed-form asymptotic solution for its running time formula) • Idea: formulate running time T(n) for n items in terms of the running time of a lower number of items (typically n-1 or n/2 or somesuch) • Can also include other terms that depend on n

Example: formulating a recurrence relation • Quicksort “worst case”: one of the sublists empty at each level due to pivot choice: T(n) = T(n-1) + cn Time taken to inspect and move elements to correct side of pivot Time taken to sort sublist of length n-1 Time taken to sort sublist of length n-1

Exercise: formulating a recurrence relation

Solving recurrence relations • …means: Getting rid of the T() on the right hand side • Three strategies • Recognise as a known formula: There are many known ones and if you happen to recognise, e.g., that your recurrence is the same as that for insertion sort… you have a solution! Hint: Try and remember the recurrences from the course for the exam – this includes the ones for the various sorting functions! • Telescoping (with various tricks) • Guessing: Try a candidate function and see whether it “fits”

Solving a recurrence relation by recognition Remember the recurrence relation for quicksort?Identify c=5 and then you know that the average case time complexity for quicksort is…?

Solving a recurrence relation by telescoping 1 1/4 1/16 1/64 Always >1 and always < 2 (actual value converges to 4/3) For our purposes a constant! Try telescoping:

Telescoping • Try and find some sort of regularity by substituting the recurrence relation for smaller n • Being good at maths (or knowing someone who is) helps • No guarantees!

Solving a recurrence relation by guessing • Guess that T(n) is 4n2/3 plus lower order terms How do you check? Plug in the suspected asymptotic function (n2): 4n2/3 = 4(n/2)2/3+n2gives4n2/3 = n2/3+n2If need be, divide the LHS of the equation by the RHS and ask what happens as n becomes really large. If you’ve guessed right, the result should converge to 1.

Predicting running times based on empirical data • Questions of the form “If quadratic/cubic/exponential/… algorithm X takes time t (or uses x elementary ops) to do 10 items, how long does it take to do 100 items?” Linearithmic = O(n log n) Now, with log to base 2: 10 log 10 ~= 10 * 3 = 30. So there’s seemingly very little in terms of factors or additive constants. 1000 log 1000 ~= 1000 * 10 = 10000, and 9000 comes closest to this number.

Predicting running times based on empirical data - exercise • We have a cubic time algorithm that takes 1 day to process a list of 100 input items. How long – roughly - does it takes to process 400 input items?

Predicting running times based on empirical data - exercise • We have a cubic time algorithm that takes 1 day to process a list of 100 input items. How long – roughly - does it takes to process 400 input items? One week, two months, six months, one year, three years • Cubic = Θ(n3) • 4003/1003=43=64 days=roughly 2 months

Useful things to know in running time prediction • One day has 86400 seconds • One year has roughly 400 days (not quite, but close enough) • Three years have roughly 1000 days (or 100,000,000 seconds) Know how to make rough calculations like these

Predicting running times based on empirical data - exercise

Coming to grips with T(n) • We can potentially characterise an algorithm in three ways: • By a function f(n) describing the maximum time within which the algorithm completes regardless of input. • By a function of f(n) and two multiplicative constants c1 and c2 describing the time in which the algorithm always completes, such that the actual running time is between c1f(n) and c2 f(n) • By a function f(n) describing the minimum time the algorithm needs to complete regardless of input. • These three concepts are formalised as O(f(n)) (aka "big-Oh"), ϴ(f(n)) (aka "big-Theta"), and Ω(f(n)) (aka "big-Omega"), respectively

"Big-Oh" notation • Consider two functions g(n) and f(n) such that g(n) >0 and f(n) >0 for all positive n • The function g(n) is said to be O(f(n)) (read "g(n) is Big-Oh of f(n)") if and only if we can find a positive real constant c and a positive integer n0 such that g(n) ≤ c f(n) for all n>n0 • Big-Oh notation is the most frequently used time complexity measure as it documents the worst-case scenario • Colloquially, we often drop the “big” and say “oh of f(n)”

Notes on Big-Oh • Big-Oh describes an upper limit on asymptotic behaviour, that is a property that g(n) has for large n (n that are larger than some n0). As such it defines a class of functions that all exhibit this behaviour. • It's not unique: for example, a linear algorithm is always O(n), but it is also O(n2), O(n3), O(en), etc. - O(f(n)) for any function f(n) that features a more rapid growth rate than n, in fact. Typically, we’ll be trying to quote the function with the lowest growth rate (i.e., O(n) for a linear algorithm) • Multiplicative constants in f(n) are omitted, e.g., we normally do not write O(3n2) as it is the same as O(n2). • Similarly, additive terms of a lower order are omitted: O(3n2 + 55n) is still just O(n2). • Question to test your understanding: Why can we write O(n log n) without giving the base of the logarithm? (Hint: logax = logbx/logba).

Big-Oh examples • The following functions are all O(n):log(n), sqrt(n), 2n+1, 2n+log(n) • The following functions are not O(n):n log(n), nsqrt(n), n3, 2n + 0.0001n2 • The following functions are O(n3log(n)):nsqrt(n), n3, 2n + 0.0001n2, 5(n+5)3-(n-2)3+8 • The following functions are not O(n5):n5log2(n), n6

Hierarchy of function growth rates • A function g(n) is O(f(n)) if it has the same or a lower rate of growth as f(n) (except for constant multiplicative factors and any additive terms of lower growth rate than f(n)). • We can use the following list to rank growth rates: From lowest growth rate to highest (obviously by no means a complete list!): • Constant: 1 • Logarithmic: log(n) • Log-squared: log(n)log(n)=log2(n) • Square root: sqrt(n) • Linear: n • Linearithmic: n log(n) • Quadratic: n2 • Cubic: n3 • Exponential: 2n

Big-Oh: alternative definition • Sometimes, an alternative definition of Big-Oh is used • The function g(n) is said to be O(f(n)) (read "g(n) is Big-Oh of f(n)") if and only if we can find two positive real constants c1 and c2 such that g(n) ≤ c1f(n) + c2for all n • This definition is equivalent to the previous one

“Big-Theta” notation • Again consider two functions g(n) and f(n) such that g(n) >0 and f(n) >0 for all positive n • The function g(n) is said to be ϴ(f(n)) (read "g(n) is Big-Theta of f(n)") if and only if we can find two positive real constants c1 and c2 and a positive integer n0 such that g(n) ≤ c1f(n) and g(n) ≥ c2f(n) for all n>n0

Notes on Big-Theta • Big-Theta notation aims to capture the actual running time of an algorithm – not just the worst case scenario. It defines a class of functions that are equivalent under the Big-Theta concept • For some algorithms, there is no Big Theta class to which they belong. • Multiplicative constants and additive lower-order terms aside, the f(n) in Big Theta notation is unique. • If g(n) is ϴ(f(n)), it means that g(n) has the same rate of growth as f(n) (except for the usual constant multiplicative factor and lower-order additive terms) • If g(n) is ϴ(f(n)), then f(n) is ϴ(g(n))

“Big-Omega” notation • The Big-Omega notation is the counterpart to Big-Oh: it describes a bound on the best case • Again consider two functions g(n) and f(n) such that g(n) >0 and f(n) >0 for all positive n • The function g(n) is said to be Ω(f(n)) (read "g(n) is Big-Omega of f(n)") if and only if we can find a positive real constant c and a positive integer n0 such that g(n) ≥ c f(n) for all n>n0

Notes on Big-Omega • Big-Omega is not always unique, e.g., our original “find pairs of numbers adding up to 10” has basically the same best and worst case running time of O(n2) and Ω(n2), respectively. • But if it is Ω(n2), it is also by implication Ω(n) . • If a function g(n) is ϴ(f(n)), then it is by implication also O(f(n)) and Ω(f(n)).

Balanced search trees • AVL trees: BST with ϴ(log n) operations in worst case, but memory hungry and computationally expensive • Red-black trees: BST with ϴ(log n) operations • M-ary balanced search trees

AVL trees • Named after their inventors, Adelson-Velskii and Landis • AVL trees have the property that for any node, the difference in height of the left and right subtree is at most 1 • This gives a certain amount of balance. • Can show that the height of an AVL tree with n nodes is ϴ(log n), which ensures our operations can all be done in ϴ(log n) • If the AVL condition is violated by an insertion or deletion, it can be restored with a type of operation called a rotation

AVL trees • Our now-familiar BST tree qualifies as an AVL tree: 28 9 69 4 15 45 73 9 17 66 142

AVL trees • This tree isn't an AVL tree: 28 9 69 4 45 73 For which node(s) is the AVL tree condition not met? 9 66 142

AVL trees • This tree isn't an AVL tree either: 28 9 69 4 15 For which node(s) is the AVL tree condition not met? 9

Red-black trees • A red-black BST assigns colours red and black to each node of the BST • The root itself is always black • Every child of a red node must be black • Every path from the root to a leaf has exactly b black nodes • This restricts the tree to a height of at most 2b and forces rebalancing before extending • Search, insert and delete operations are all O(log n) • Rebalancing on insert is performed by repainting of individual nodes and/or a variety of rotations similar to those in AVL trees. What exactly happens depends on the actual scenario encountered.

m-ary search trees • Binary search trees don't work too well in very large data sets where every node needs to be retrieved via disk I/O • In this case, even log2n comparisons in a search are too many because each comparison translates into a time-consuming disk I/O operation (which in its own right could be an O(log n) operation). • We'd like to minimise disk I/O – if we turn the binary into an m-ary (with m>2) search tree, each internal node can have up to m branches and the longest paths to leaf nodes in our trees will become shorter • Each node now contains up to m-1 keys to support all branches, and all data items are stored in the leaf nodes • Price to pay: We now need to make up to m-1 comparisons between the search key and the keys in each node

m-ary search trees 5, 19, 24 1,4 10,12,16 21,23 27,35 0 1 4 19 22 23 6 11 14 16 26 29 42

B-trees • B-trees are m-ary search trees that store multiple data items in each leaf • The items in each leaf are stored as a list of items with keys smaller than the next higher key in the parent node • Lists are retrieved in one disk I/O process • List lengths vary at most by a factor of two – when a list becomes longer or shorter, the tree is rebalanced

B-trees 50, 190, 246 10,42 101,120,162 211,235 279,352 023467 12131618 19222930 3840 4245 46 47 48 5658 61 63 74 76 81 90 111114 115 116 117 118 119 122 130 131 142 143 145 146 147 154 164 167 172 173 178 181 184 192 199 201 204 208 220 225 228 229 232 234 235 236 238 239 240 243 245 258 261 262 263 265 268 269 273 276 277 283 299 305 308 311 329 377 387 389 393 397 406 419 435 497 499

COMPSCI220 Review & Questions