The Discrepancy Method

The Discrepancy Method An Introduction UsingMinimum Spanning Trees

Did you know? “Recursion comes from the verb recur. There is no verb recurse.” http://www.cse.ucsc.edu/~larrabee/ce185/reader/node191.html

Did you know? If m/n = W(logO(1) n), then a(m, n) = O(1). Linear-Time Pointer-Machine Algorithms for Least Common Ancestors, MST Verification and Dominators by Buchsbaum, Kaplan, Rogers and Westbrook

Credit The Discrepancy MethodBernard Chazelle

Credit Finding MST in O(ma(m,n)) TimeSeth Pettie

The Discrepancy Method MinimumFinding

Minimum Finding For boys and girls: Given an unsorted array A of n unique integers, find the minimum.

One Possible Algorithm Divide A into 100 parts. For each part, recursively find the minimum within that part. Find the minimum among the 100 minimum elements from each part.

A Term To Introduce Those 100 elements obtained from recursion form a low discrepancy subset.

"Definition" Roughly speaking, a low discrepancy subset is a subset that is representative, i.e. (we hope) the solution of the problem using this subset is “close to” the solution using the original set.

One Possible Algorithm Divide A into 100 parts. For each part, recursively findthe minimum within that part. Find the minimum among the 100 minimum elements from each part.

About Size How big should the low discrepancy subset be? We could have divide A into • 100 parts • log(n) parts • n/5 parts

Let's Try Sampling Sample u.a.r. 1% of the elements. Recursively find their minimum x. Reject elements larger than x. What is the expected number of elements to remain? 100k/(k+1) ~= 100

"Close To" ??? These 1% of the elements also form a low discrepancy subset. Their minimum is “close to” the true minimum.

The Discrepancy Method MedianFinding

Median Finding For adults: Given an unsorted array A of n unique integers, find the median.

A Randomized Algorithm Pick a pivot element such that it splits A into a (¼, ¾) split (or even better). Recur on the correct side.

A Deterministic Algorithm Divide A into groups of 5. Find the median of each group. Recursively find the median x of these medians and use x to pivot, then recur. • How far away is x from the true median?

Answer: Not Far

A Deterministic Algorithm Divide A into groups of 5. Find the median of each group. Recursively find the median x of these medians and use x to pivot, then recur. • How far away is x from the true median? What is the low discrepancy subset here?

The Discrepancy Method Identify a low discrepancy subset. Solve the problem using it (probably with divide-and-conquer). Patch the almost-right solution to obtain the surely-right solution.

About Size How big should the low discrepancy subset be? We could have divide A into • 100 parts • log(n) parts • n/5 parts

Lower discrepancy More time to get a solution Less work to patch solution Trade-off Bigger subset Doesn’t seem to have an easy answer. Be creative!

A Lossy Data Structure Soft Heaps

Soft Heaps Items with keys from a total order • create(S) • delete(S, x) • findmin(S) • meld(S, S’) • insert(S, x) Amortized O(1) O(log 1/e)

The Catch Soft heaps can corrupt the keys! Consider a mixed sequence of operations that includes n inserts.For any error rate0 <e· ½, a soft heap can contain at most en corrupted keys at a time.

Key Corruption The values of certain keys can be increased at the sole discretion of the soft heap. Once corrupted, a key will remain corrupted.

Even Worse When you delete a (corrupted) key, some other keys may be corrupted during the deletion. It is possible for all keys to become corrupted over the lifetime of the soft heap.

The Worst Because of deletions, the proportion of corrupted items inside a soft heap could be much greater than e. (The theorem says “in a mixed sequence of operations that includes n inserts…”.)

Median Once More Note its online nature. 1. Pick e be ¼. 2. Insert n integers. 3. Do n/2 findmins, each followed by a deletion. • Among the keys deleted, how far away is the largest original key from the true median?

Answer: Not Far We have n/2 elements left. At most n/4 of them are corrupted. The worst case is when those n/4 elements were small in the beginning. So the largest original key we deleted is at most rank 3n/4. Now pivot!

Structure of Soft Heaps A binomial tree of rank k has 2^k nodes. A soft heap is a sequence of modified binomial trees of distinct rank, called soft queues.

Soft Queues • A binomial tree with possibly a few sub-tree pruned. • The rank of a node is the number of children in the originaltree. (Hence it’s an upper bound.) • Rank invariant: The root has at least brank(root)/2c children.

Item Lists • A node contains an item list. • ckey is the common value of all keys in the list (upper bound of all keys). • A soft queue is heap-ordered w.r.t. ckeys. • Let r = 2dlog 1/ee + 2. We require all corrupted items stored at nodes of rank >r.

Example Soft Heap ckey The result of melding two soft queues of rank 2. item list 2,4 4 3,7 7 6 6 5 5 8 8 9 9

Sift sift(S) • If S has one node, done. • v = child of root with smallest key • Move key of v to root • sift(sub-tree rooted at v)

Sift Sift Sift sift(S) • If S has one node, done. • v = child of root with smallest key • Move key of v to root • sift(sub-tree rooted at v) • If height of S is now odd, goto 1.

Final Word About Soft Heaps Optimality

Minimum Spanning Tree Overview

Graph Model • G = (E,V) • n vertices • m edges,multiple edges allowed, but no self-loops • edge cost of e is c(e),WLOG assume all distinct

A Brief History of MST 1926 Boruvka O(m log n) 1930 Jarnik .. 1956 Kruskal .. 1957 Prim .. 1974 Tarjan (unpublished) O(m (log n)1/2) 1975 Yao O(m log log n) 1976 Cheriton and Tarjan O(m log logd n), d=max(2, m/n) 1986 Fredman and Tarjan O(m b(m,n)) 1986 Gabow et al. O(m log b(m,n)) 1990 Fredman and Willard O(m), RAM 1994 Klein and Tarjan O(m), randomized 1997 Chazelle O(m a(m,n) log a(m,n)) 1999 Chazelle, Pettie O(m a(m,n))

Two Rules CutThe cheapest edge crossing any cut is in the MST. CycleThe costliest edge on any cycle is not in the MST.

Cycle Rule CycleThe costliest edge on any cycle is not in the MST. That also means if an edge is not in the MST, there must be a cycle that witnesses this fact. (What is that cycle?)

The Sampling Advantage Most methods use divide-and-conquer by splitting up the graph using either • the distribution of the edge costs, or • the combinatorial structure, but rarely both. Sampling allows us to do both at once.

Deterministic Sampling Find a low discrepancy subgraph whose own MST bears witness to the non-MST status of many edge. (Still remember the cycle rule?)

Two Notations G*R The graph derived from G by raising the costs of all edges in R µ E. G\C The graph derived from G by contracting the subgraph C into a single vertex c.

Contraction 23 23 8 4 9 4 9 2 6 6 14 12 12 C 17 17 G\C G

The Past There is a high degree of freedom in choosing the contractions. (That’s why we have so many different algorithms.) But these algorithms all confront the same dilemma…

Dilemma Many MST algorithms identify the cheapest edge crossing a cut by maintaining all eligible edges in a heap. But as the graph gets contracted, the degree of vertices tend to grow. So, finding the cheapest edge becomes more and more difficult.

The Discrepancy Method