CSE 326: Data Structures: Sorting

CSE 326: Data Structures: Sorting Lecture 13: Wednesday, Feb 5, 2003

Today • Finish extensible hash tables • Sorting • Will take several lectures • Read Chapter 7 ! • Except Shellsort (7.4)

Hash Tables on Secondary Storage (Disks) Main differences: • One bucket = one block, hence may hold multiple keys • Open chaining: use overflow blocks when needed • Closed chaining never used

Hash Table Example • Assume 1 bucket (block) stores 2 keys + pointers • h(e)=0 • h(b)=h(f)=1 • h(g)=2 • h(a)=h(c)=3 0 1 2 3

Searching in a Hash Table • Search for a: • Compute h(a)=3 • Read bucket 3 • 1 disk access 0 1 2 3

Insertion in Hash Table • Place in right bucket, if space • E.g. h(d)=2 0 1 2 3

Insertion in Hash Table • Create overflow block, if no space • E.g. h(k)=1 • More over-flow blocksmay be needed 0 1 2 3

Hash Table Performance • Excellent, if no overflow blocks • Degrades considerably when number of keys exceeds the number of buckets (I.e. many overflow blocks).

Extensible Hash Table • Allows has table to grow, to avoid performance degradation • Assume a hash function h that returns numbers in {0, …, 2k – 1} • Start with n = 2i << 2k , only look at first i most significant bits

Extensible Hash Table • E.g. i=1, n=2i=2, k=4 • Note: we only look at the first bit (0 or 1) 1 0 1 1

Insertion in Extensible Hash Table • Insert 1110 1 0 1 1

Insertion in Extensible Hash Table • Now insert 1010 • Need to extend table, split blocks • i becomes 2 1 0 1 1

Insertion in Extensible Hash Table 1 00 01 2 10 11 2

Insertion in Extensible Hash Table • Now insert 0000, then 0101 • Need to split block 1 00 01 2 10 11 2

Insertion in Extensible Hash Table • After splitting the block 2 2 00 01 2 10 11 2

Extensible Hash Table • How many buckets (blocks) do we need to touch after an insertion ? • How many entries in the hash table do we need to touch after an insertion ?

Performance Extensible Hash Table • No overflow blocks: access always O(1) • More precisely: exactly one disk I/O • BUT: • Extensions can be costly and disruptive • After an extension table may no longer fit in memory

Sorting • Perhaps the most common operation in programs • The authoritative text: • D. Knuth, The Art of Computer Programming, Vol. 3

Material to be Covered • Sorting by comparision: • Bubble Sort • Selection Sort • Merge Sort • QuickSort • Efficient list-based implementations • Formal analysis • Theoretical limitations on sorting by comparison • Sorting without comparing elements • Sorting and the memory hierarchy

Bubble Sort Idea • We want A[1]  A[2]  …  A[N] • Bubble sort idea: • If A[i-1] > A[i] then swap A[i-1] and A[i] • Do this for i = 1, …, n-1 • Repeat this until it’s sorted

Bubble Sort procedure BubbleSort (Array A, int N) repeat { isSorted = true; for (i=1 to N-1) { if ( A[i-1] > A[i] ){ swap( A[i-1], A[i] ); isSorted = false; } until isSorted

Bubble Sort Improvements • After the 1st iteration: • largest element  A[n-1] • After the 2nd iteration: • Second largest element  A[n-2] • Question: what is the max number of iterations, and, hence the worst case running time ? • Improvement: stop the iterations earlier: • for (i=1 to N-1) • for (i=1 to N-2) • ... • for (i=1 to 1) • In fact we may be lucky, and be able decrease i more aggresively

Bubble Sort procedure BubbleSort (Array A, int N) m = N; repeat { newM = 1; for (i=1 to m-1) { if ( A[i-1] > A[i] ){ swap( A[i-1], A[i] ); newM = i-1; } m = newM; while m > 1

Bubble Sort • So the worst-case running time is T(n) = O(n2) • Is the worst-case running time also (n2) ? • You need to find a worst-case input of size n for which the running time is n2.

Selection Sort procedure SelectSort (Array A, int N) for (i=0 to N-2) { /* find the minimum among A[i],...,A[n-1] */ /* place it in A[i] */ m = i; for (j=i+1 to N-1) if ( A[m] > A[j] ) m = j; swap(A[i], A[m]); }

Selection Sort • Worst case running time: • T(n) = O( ?? ) • T(n) = ( ?? )

Insertion Sort procedure InsertSort (Array A, int N) for (i=1 to N-1) { /* A[0], A[1], ..., A[i-1] is sorte */ /* now insert A[i] in the right place */ x = A[i]; for (j=i-1; j>0 && A[j] > x; j--) A[j+1] = A[j]; A[j] = x; }

Insertion Sort • Worst case running time: • T(n) = O( ?? ) • T(n) = ( ?? )

Merge Sort The Merge Operation: given two sorted sequences: A[0]  A[1]  ...  A[m-1] B[0]  B[1]  ...  B[n-1] Construct another sorted sequence that is their union Merge (A[0..m-1],B[0..n-1]) i1=0, i2=0 Whilei1<m, i2<n IfT1[i1] < T2[i2] Next is T1[i1] i1++ Else Next is T2[i2] i2++ End If End While Merging Cars by key [Aggressiveness of driver]. Most aggressive goes first. Photo from http://www.nrma.com.au/inside-nrma/m-h-m/road-rage.html

Merge Sort Function MergeSort (Array A[0..n-1]) if n  1 return A Merge(MergeSort(A[0..n/2-1]), MergeSort(A[n/2..n-1]))

Merge Sort Running Time Any difference best / worse case? T(1) = b T(n) =2T(n/2) + cn for n>1 T(n) = 2T(n/2)+cn T(n) = 4T(n/4) +cn +cn substitute T(n) = 8T(n/8)+cn+cn+cn substitute T(n) = 2kT(n/2k)+kcn inductive leap T(n) = nT(1) + cn log n where k = log n select value for k T(n) = (n log n) simplify

Merge Sort • Works great with lists, or files • Problems with arrays: • We need a scratch array, cannot sort ‘in situ’

Heap Sort • Recall: a heap is a tree where the min is at the root • A heap is stored in an array A[1], ..., A[n]

Heap Sort • Start with an unsorted array A[1], ..., A[n] • Build a heap • How much time does it take ? • Get minimum, store in out array; repeat n times:

Heap Sort • But then we need an extra array ! • How can we do it ‘in situ’ ?

Heap Sort • Input: unordered array A[1..N] • Build a max heap (largest element is A[1]) • For i = 1 to N-1: A[N-i+1] = Delete_Max() 7 50 22 15 4 40 20 10 35 25 50 40 20 25 35 15 10 22 4 7 40 35 20 25 7 15 10 22 4 50 35 25 20 22 7 15 10 4 40 50

Properties of Heap Sort • Worst case time complexity O(n log n) • Build_heap O(n) • n Delete_Max’s for O(n log n) • In-place sort – only constant storage beyond the array is needed

QuickSort Picture from PhotoDisc.com • Pick a “pivot”. • Divide list into two lists: • One less-than-or-equal-to pivot value • One greater than pivot • Sort each sub-problem recursively • Answer is the concatenation of the two solutions

QuickSort: Array-Based Version Pick pivot: Partition with cursors < > 2 goes to less-than < >

QuickSort Partition (cont’d) 6, 8 swap less/greater-than < > 3,5 less-than 9 greater-than Partition done.

QuickSort Partition (cont’d) Put pivot into final position. 5 2 6 3 7 9 8 Recursively sort each side. 2 3 5 6 7 8 9

QuickSort Complexity • QuickSort is fast in practice, but has (N2) worst-case complexity • Friday we will see why

CSE 326: Data Structures: Sorting