Sorting Data

Sorting Data • Considerations • Average, best, worst case complexity • For swaps and compares • Is extra memory required? • Difficulty to program? • Stability of equal keys • What is the fastest possible sort using comparisons?

Elementary Sorting MethodsComplexity O(n2), All are Stable • Bubble Sort (N(N-1)/2)comparisons, about N(N-1)/4 swaps) • Selection Sort (N(N-1)/2 comparisons, N-1 swaps) • Minimizes the number of swaps • Worst case equals average case • Insertion Sort (N(N-1)/4 comparisons and copies) • Good for lists that are nearly sorted (O(N) best case)

Bubble Sort Pairwise compares of adjacent elements; swap where necessary pass = 0; swaps = true; while (pass < n && swaps == true) { swaps = false; for (index=0; index<n-pass; index++) { if (sortArray[index] > sortArray[index+1]) { swap(sortArray, index, index+1); swaps = true; } } pass++; }

Selection Sort Find Minimum n-1 times for (i=0; i<n-1; i++) { minimum = i; for (j=i+1; j<n; j++) { if ( sortArray[j] < sortArray[minimum]) minimum = j; } swap(sortArray, i, minimum); }

Insertion Sort Insert next entry into a growing sorted table for (i=1; i<n; i++) { j = i; save = sortArray[i]; while (j>0 && save < sortArray[j-1]) { sortArray[j] = sortArray[j-- - 1); } sortArray[j] = save; }

Proof by Induction • Select Base Case (n = 1) • State the Hypothesis (assume for n=k) • State what is to be proved (prove for n=k+1) • Example: Base case: For n = 1, 1 = 1 * 2 / 2 = 1 Hypothesis: Assume for n=k, 1 + 2 + … + k = k * (k+1)/2 To Prove: 1 + 2 + … + k+1 = (k+1) * (k+2) /2 1 + 2 + … + k+1 = 1 + 2 + … + k + (k+1) By the hypothesis, this equals k * (k+1)/2 + (k+1) = (k+1)(k/2 + 1) = (k+1)(k/2 + 2/2) = (k+1)(k+2)/2 Therefore by induction, the relationship holds for all positive k >= 1

RecursionUseful for advanced sorts and for divide & conquer algorithms • Relationship to mathematical induction • Key design principals • Relationship between algorithm(n) and algorithm(m) where m<n. • Base Case (How does it stop?) • When is it useful? What is the overhead? • Relationship between n and m • Tail recursion with a single recursive call • Replace by manually creating stacks • Examples • simple loop, factorial, gcd, binary search, tower of hanoii

Recursion Examples • Factorial: 5! = 5 * 4! • Greatest Common Denominator: gcd(x,y) = gcd(y%x,x) if x<y • Binary Searchint binSearch( array, low, high, value) { if (high – low <= 1) return -1; // Base case middle = (low + high) / 2; if value < array[middle]) binSearch(array, low, middle-1, value) else if value > array[middle]) binSearch(array, middle+1, high, value) else return middle; }

Breaking the O(N2) BarrierBased on either bubble or insertion sortComplexity from O(N7/6) to O(N3/2) based on gap selection • Shell Sort while (gap > 0) { for (index=gap; index<n; index++) { temp = sortArray[index]; compareIndex = index; while(compareIndex>=gap && sortArray[compareIndex-gap]>=temp) { sortArray[compareIndex]=sortArray[compareIndex-gap]; compareIndex -= gap; } sortArray[compareIndex] = temp; } adjustGap( gap ); // different patterns (/=2, (gap-1)/3, (gap+1)/2 }

Shell sort (based on bubble) int index; while (gap > 0) { swaps = true; while (swaps) { swaps = false; for (index = 0; index < gap; index++) { if (sort[index] > sort[index + gap]) { swap(sort, index, index + gap); swaps = true; } } } adjustGap( gap ); }

Merge SortAlways O(N lgN) but need more memory • Merge Sort void mergeSort(double[] sortArray, int low, int high) { int mid = (low+high)/2; if (low == high) return; mergeSort(sortArray, low, mid); mergeSort(sortArray, mid+1, high); merge(sortArray, low, mid, high); } • Merge method must: • Allocate an array for copying • Merge two sorted arrays together • Copy back to original array

Merge method void merge(double[] sort, int low, int middle, int high) { int n = high – low + 1, int lowPtr = low, highPtr = middle+1, spot = 0; double work = new double[high – low + 1]; while(low <= middle && high <=top) { if (sort[lowPtr]<sort[highPtr]) work[spot++] = sort[lowPtr++]; else work[spot++] = sort[highPtr++]; } while (lowPtr<=middle) work[spot++] = sort[lowPtr++]; while (highPtr <= top) work[spot++] = sort[highPtr++]; lowPtr = low; for (spot=0; spot<high-low+1; spot++) sortArray[lowPtr++] = workArray[spot]; }

Analysis of Merge Sort 16 8 8 4 4 4 4 2 2 2 2 2 2 2 2 Work at each level totals to 16, lg 16 = 4 levels, complexity = 16 lg16

Quick SortO(NlgN) average case, in place void quickSort(double[] sortArray, int left, int right) { if (right <= left) return; double pivot = sortArray[right]; int middle = partition(sortArray, left, right, pivot); quickSort(sortArray, left, middle-1); quickSort(sortArray, middle+1, right); } • Refinements to avoid O(N2) worst case and speed up. • Choice of pivit • Combining with insertion sort • Other uses (find the kth biggest number).

Quick Sort Partitioning int partition(double[] sortArray, int left, int right, int pivot) { int origRight = right; left -= 1; for (;;) { while(sortArray[++left] < pivot); while(sortArray[--right] > pivot); if (left >= right) break; swap(sortArray, left, right); } swap(sortArray, left, origRight); return left; }

Radix Sort (First Version) • Choose the number of buckets (b) • Drop next significant part of data into buckets • Gather data from buckets back into original array • Repeat the above two steps, finishing at the most significant piece of data • Notes • Maximum memory needed for each bucket • Complexity: O(p * 2n) where • p = (Max + b – 1)/b • 2n because dropping and gathering touches each element twice

Radix Sort example Notes: Each pass is a digit of the data (x / 10 (pass – 1)) % 10 Two passes because largest number < number of buckets squared Complexity is: O( 2pn) = O(pn) where p is the number of passes In this case, only two elements are in each bucket, but we couldn’t depend on that in the general case

Refined Radix Sort • Create and initialize an array (Counts) of size buckets + 1 • Initialize the array to zeroes • Store actual bucket sizes into Counts array (starting index = 1) • Perform a prefix sum on Counts array to compute starting offsets • Use Counts array to drop elements into a second array of numbers • Advantages: • Use alternating arrays to avoid the gather operations • Only two times the memory is needed • Complexity: O(p(2n + 2b)) = O(p(n+b)) • Notes: • Increased buckets can reduce the number of passes, but prefix sum overhead will limit performance benefits. • Radix sort does no comparisons , so O(n lg n) limitation doesn’t apply.

Refined Radix Example • Dump from original array to alternate array • No gather operation needed • Index to store into count array is one bigger than the bucket. count array index 5 in the above example has a count of 1 because it corresponds to bucket 4.

Optimal Comparison Sort • There are n! possible comparison sorts • All sorts can be modeled with a decision tree • Optimal sort will be completely balanced • Depth of the balanced decision tree is O(lg(n!) Decision Tree compare compare compare <= > <= > > <=

Prove optimal sort <= O(n lg n) • Optimal comparison sort <= O(n lg n) lg (n!) = lg(n) + lg (n-1) + lg (n -2) + … + lg(1) < lg n + lg n + lg n + … + lg n = n lg n = O(n lg n) • Optimal comparison sort >= O(n lg n) lg (n!) = lg(n) + … + lg(n/2+1) + lg(n/2) + … + lg(n/4+1) + lg(n/4) + … + lg(n/8+1) + … > n/2 lg(n/2) + n/4 lg(n/4) + n/8 lg(n/8) + … = n/2 lg(n) – n/2lg 2 + n/4 lg(n) – n/4lg 4 + n/8 lg n – n/8 lg 8 ≈ n lg (n) – ½ n – 2/4 n – 3/8 n – 4/16 n - … = n lg (n) – n (1/2 + 2/4 + 3/8 + 4/16 + … = n lg n – 2n But O(n lg n – 2n) = O(n lg n) • Therefore optimal sort = O(n lg n) • The series is well known: ½ + 2/4 + 3/8 + … = ∑ n/2n ≈ 2 • Proof: S(2n) – S(n) = S(n) = 1 + 1 + 6/8 + 8/16 + 10/32 – ½ - 2/4 – 3/8 -…= 1 + ½ + ¼ + 1/8 +…= 2

Sorting Data

Sorting Data

Presentation Transcript

Restricting and sorting data

2 Restricting and Sorting Data

Restricting and Sorting Data

WCI 2004 session 3: DATA SORTING

2 Restricting and Sorting Data

CSE 326: Data Structures: Sorting

Excel: Sorting and Filtering Data

Restricting and Sorting Data

Sorting Data

Restricting and Sorting Data

Restricting and Sorting Data

Data Sorting in Excel

Restricting and Sorting Data

CSE 326: Data Structures: Sorting

CS203 Programming with Data Structures Sorting

Restricting and Sorting Data