Chapter 12—Searching and Sorting

Java The Art and Science of An Introduction to Computer Science ERIC S. ROBERTS C H A P T E R 1 2 Searching and Sorting Chapter 12—Searching and Sorting “I weep for you,” the Walrus said, “I deeply sympathize.” With sobs and tears he sorted out Those of the largest size —Lewis Carroll, Through the Looking Glass, 1872 12.1 Searching 12.2 Sorting 12.3 Assessing algorithmic efficiency 12.4 Using data files

Searching • This chapter looks at two operations on arrays—searching and sorting—both of which turn out to be important in a wide range of practical applications. • The simpler of these two operations is searching, which is the process of finding a particular element in an array or some other kind of sequence. Typically, a method that implements searching will return the index at which a particular element appears, or -1 if that element does not appear at all. The element you’re searching for is called the key. • The goal of Chapter 12, however, is not simply to introduce searching and sorting but rather to use these operations to talk about algorithms and efficiency. Many different algorithms exist for both searching and sorting; choosing the right algorithm for a particular application can have a profound effect on how efficiently that application runs.

Linear search is straightforward to implement, as illustrated in the following method that returns the first index at which the value key appears in array, or -1 if it does not appear at all: private int linearSearch(int key, int[] array) { for (int i = 0; i < array.length; i++) { if (key == array[i]) return i; } return -1; } Linear Search • The simplest strategy for searching is to start at the beginning of the array and look at each element in turn. This algorithm is called linear search.

private int linearSearch(int key, int[] array) { for ( int i = 0 ; i < array.length ; i++ ) { if (key == array[i]) return i; } return -1; } private int linearSearch(int key, int[] array) { for ( int i = 0 ; i < array.length ; i++ ) { if (key == array[i]) return i; } return -1; } private int linearSearch(int key, int[] array) { for ( int i = 0 ; i < array.length ; i++ ) { if (key == array[i]) return i; } return -1; } public void run() { int[] primes = { 2, 3, 5, 7, 11, 13, 17, 19, 23, 29 }; println("linearSearch(17) -> " + linearSearch(17, primes)); println("linearSearch(27) -> " + linearSearch(27, primes)); } i i i key key key array array array 27 27 27 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 8 9 10 10 2 3 5 7 11 13 17 19 23 29 0 1 2 3 4 5 6 7 8 9 LinearSearch Simulating Linear Search public void run() { int[] primes = { 2, 3, 5, 7, 11, 13, 17, 19, 23, 29 }; println("linearSearch(17) -> " + linearSearch(17, primes)); println("linearSearch(27) -> " + linearSearch(27, primes)); } primes 2 3 5 7 11 13 17 19 23 29 0 1 2 3 4 5 6 7 8 9 linearSearch(17) -> 6 linearSearch(27) -> -1 skip simulation

A Larger Example • To illustrate the efficiency of linear search, it is useful to work with a somewhat larger example. • The example on the next slide works with an array containing the 286 telephone area codes assigned to the United States. • The specific task in this example is to search this list to find the area code for the Silicon Valley area, which is 650. • The linear search algorithm needs to examine each element in the array to find the matching value. As the array gets larger, the number of steps required for linear search grows in the same proportion. • As you watch the slow process of searching for 650 on the next slide, try to think of a more efficient way in which you might search this particular array for a given area code.

201 202 203 205 206 207 208 209 210 212 213 214 215 216 217 218 219 224 225 228 229 231 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 234 239 240 248 251 252 253 254 256 260 262 267 269 270 276 281 283 301 302 303 304 305 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 307 308 309 310 312 313 314 315 316 317 318 319 320 321 323 325 330 331 334 336 337 339 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 347 351 352 360 361 364 385 386 401 402 404 405 406 407 408 409 410 412 413 414 415 416 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 417 419 423 424 425 430 432 434 435 440 443 445 469 470 475 478 479 480 484 501 502 503 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 504 505 507 508 509 510 512 513 515 516 517 518 520 530 540 541 551 559 561 562 563 564 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 567 570 571 573 574 575 580 585 586 601 602 603 605 606 607 608 609 610 612 614 615 616 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 617 618 619 620 623 626 630 631 636 641 646 651 660 661 662 678 682 701 702 703 704 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 706 707 708 712 713 714 715 716 717 718 719 720 724 727 731 732 734 740 754 757 760 762 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 763 765 769 770 772 773 774 775 779 781 785 786 801 802 803 804 805 806 808 810 812 813 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 814 815 816 817 818 828 830 831 832 835 843 845 847 848 850 856 857 858 859 860 862 863 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 864 865 870 878 901 903 904 906 907 908 909 910 912 913 914 915 916 917 918 919 920 925 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 928 931 936 937 940 941 947 949 951 952 954 956 959 970 971 972 973 978 979 980 985 989 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 Linear Search (Area Code Example) 650

The Idea of Binary Search • The fact that the area code array is in ascending order makes it possible to find a particular value much more efficiently. • The key insight is that you get more information by starting at the middle element than you do by starting at the beginning. • When you look at the middle element in relation to the value you’re searching for, there are three possibilities: • If the value you are searching for is greater than the middle element, you can discount every element in the first half of the array. • If the value you are searching for is less than the middle element, you can discount every element in the second half of the array. • If the value you are searching for is equal to the middle element, you can stop because you’ve found the value you’re looking for. • You can repeat this process on the elements that remain after each cycle. Because this algorithm proceeds by dividing the list in half each time, it is called binary search.

0 + 285 Start with the element at index , which is the 602 at index 142: 2 143 + 285 Continue with element , which is the 805 at index 214: 201 202 203 205 206 207 208 209 210 212 213 214 215 216 217 218 219 224 225 228 229 231 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 2 The key 650 is greater than 602, so discard the first half. 234 239 240 248 251 252 253 254 256 260 262 267 269 270 276 281 283 301 302 303 304 305 143 + 213 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 Continue with element , which is the 708 at index 178: 2 The key 650 is less than 805, so discard the second half. 307 308 309 310 312 313 314 315 316 317 318 319 320 321 323 325 330 331 334 336 337 339 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 143 + 177 347 351 352 360 361 364 385 386 401 402 404 405 406 407 408 409 410 412 413 414 415 416 Continue with element , which is the 630 at index 160: 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 The key 650 is less than 708, so discard the second half. 2 417 419 423 424 425 430 432 434 435 440 443 445 469 470 475 478 479 480 484 501 502 503 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 The key 650 is greater than 630, so discard the first half. 161 + 177 504 505 507 508 509 510 512 513 515 516 517 518 520 530 540 541 551 559 561 562 563 564 Continue with element , which is the 662 at index 169: 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 2 567 570 571 573 574 575 580 585 586 601 602 603 605 606 607 608 609 610 612 614 615 616 The key 650 is less than 662, so discard the second half. 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 161 + 168 617 618 619 620 623 626 630 631 636 641 646 651 660 661 662 678 682 701 702 703 704 Continue with element , which is the 646 at index 164: 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 2 The key 650 is greater than 646, so discard the first half. 706 707 708 712 713 714 715 716 717 718 719 720 724 727 731 732 734 740 754 757 760 762 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 165 + 168 The key 650 is less than 651, so discard the second half. Continue with element , which is the 651 at index 166: 763 765 769 770 772 773 774 775 779 781 785 786 801 802 803 804 805 806 808 810 812 813 2 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 814 815 816 817 818 828 830 831 832 835 843 845 847 848 850 856 857 858 859 860 862 863 The key 650 is equal to 650, so the process is finished. Binary search needs to look at only eight elements to find 650. 220 221 222 223 224 225 226 227 165 + 165 228 229 230 231 232 233 234 235 236 237 238 239 240 241 , which is the 650 at index 165: Continue with element 864 865 870 878 901 903 904 906 907 908 909 910 912 913 914 915 916 917 918 919 920 925 2 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 928 931 936 937 940 941 947 949 951 952 954 956 959 970 971 972 973 978 979 980 985 989 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 Binary Search (Area Code Example) 650

Implementing Binary Search • The following method implements the binary search algorithm for an integer array. private int binarySearch(int key, int[] array) { int lh = 0; int rh = array.length - 1; while (lh <= rh) { int mid = (lh + rh) / 2; if (key == array[mid]) return mid; if (key < array[mid]) { rh = mid - 1; } else { lh = mid + 1; } } return -1; } • The text contains a similar implementation of binarySearch that operates on strings. The algorithm is the same.

Efficiency of Linear Search • As the area code example makes clear, the running time of the linear search algorithm depends on the size of the array. • The idea that the time required to search a list of values depends on how many values there are is not at all surprising. The running time of most algorithms depends on the size of the problem to which that algorithm is applied. • In many applications, it is easy to come up with a numeric value that specifies the problem size, which is generally denoted by the letter N. For most array applications, the problem size is simply the size of the array. • In the worst case—which occurs when the value you’re searching for comes at the end of the array or does not appear at all—linear search requires N steps. On average, it takes approximately half that time.

On each step in the process, the binary search algorithm rules out half of the remaining possibilities. In the worst case, the number of steps required is equal to the number of times you can divide the original size of the array in half until there is only one element remaining. In other words, what you need to find is the value of k that satisfies the following equation: 1 = N / 2 / 2 / 2 / 2 . . . / 2 • You can simplify this formula using basic mathematics: k times 1 = N / 2k Efficiency of Binary Search • The running time of binary search also depends on the number of elements, but in a profoundly different way. 2k = N k = log2N

N log2N 10 3 100 7 1000 10 1,000,000 20 1,000,000,000 30 Comparing Search Efficiencies • The difference in the number of steps required for the two search algorithms is illustrated by the following table, which compares the values of N and the closest integer to log2N: • For large values of N, the difference in the number of steps required is enormous. If you had to search through a list of a million elements, binary search would run 50,000 times faster than linear search. If there were a billion elements, that factor would grow to 33,000,000.

Sorting • Binary search works only on arrays in which the elements are arranged in order. The process of putting the elements of an array in order is called sorting. • There are many algorithms that one can use to sort an array. As with searching, these algorithms can vary substantially in their efficiency, particularly as the arrays become large. • Of all the algorithms presented in this text, sorting is by far the most important in terms of its practical applications. Alphabetizing a telephone directory, arranging library records by catalogue number, and organizing a bulk mailing by ZIP code are all examples of sorting that involve reasonably large collections of data.

The Selection Sort Algorithm • Of the many sorting algorithms, the easiest one to describe is selection sort, which is implemented by the following code: private void sort(int[] array) { for (int lh = 0; lh < array.length; lh++) { int rh = findSmallest(array, lh, array.length); swapElements(array, lh, rh); } } The variables lh and rh indicate the positions of the left and right hands if you were to carry out this process manually. The left hand points to each position in turn; the right hand points to the smallest value in the rest of the array. • The method findSmallest(array, p1, p2) returns the index of the smallest value in the array from position p1 up to but not including p2. The method swapElements(array, p1, p2) exchanges the elements at the specified positions.

private int findSmallest(int[] array, int p1, int p2) { int smallestIndex = p1; for ( int i = p1 + 1 ; i < p2 ; i++ ) { if (array[i] < array[smallestIndex]) smallestIndex = i; } return smallestIndex; } private void sort(int[] array) { for ( int lh = 0 ; lh < array.length ; lh++ ) { int rh = findSmallest(array, lh, array.length); swapElements(array, lh, rh); } } private void sort(int[] array) { for ( int lh = 0 ; lh < array.length ; lh++ ) { int rh = findSmallest(array, lh, array.length); swapElements(array, lh, rh); } } public void run() { int[]test={809,503,946,367,987,838,259,236,659,361}; sort(test); } smallestIndex i lh lh p1 rh rh p2 array array array 0 10 1 1 1 2 2 3 3 3 3 4 4 5 5 6 6 0 0 6 6 7 7 7 7 8 8 8 9 9 9 10 10 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 809 503 946 367 987 838 259 236 659 361 Simulating Selection Sort public void run() { int[]test={809,503,946,367,987,838,259,236,659,361}; sort(test); } test 809 503 946 367 987 838 259 236 659 361 skip simulation

Efficiency of Selection Sort • As with the search algorithms presented on earlier slides, it is useful to understand how the running time of selection sort depends on the size of the array. • One strategy is to measure the actual time it takes to run for arrays of different sizes. In Java, you can measure elapsed time by calling System.currentTimeMillis before and after some operation and noting the difference. Using this strategy, however, requires some care: • If an algorithm runs very quickly, the system clock will not be precise enough for accurate measurement. In such cases, you may need to run the algorithm several times and divide by the number of repetitions. • Most algorithms show some variability depending on the data. To avoid having this variability distort the results, it make sense to run several independent trials with different data and average the results. • It is important to take account of the fact that Java will sometimes need to take actions—most notably garbage collection—that can take large amounts of time but are not actually relevant to the algorithm.

The following table shows the average timing of the selection sort algorithm after removing outlying trials that differ by more than two standard deviations from the mean. The column labeled  (the Greek letter mu, which is the standard statistical symbol for the mean) is a reasonably good estimate of running time. Because timing measurements are subject to various inaccuracies, it is best to run several trials and then to use statistics to interpret the results. The table below shows the actual running time for the selection sort algorithm for several different values of N, along with the mean () and standard deviation (). The table entries shown in red indicate timing measurements that differ by more than two standard deviations from the average of the other trials (trial #8 for 1000 elements, for example, is more than five times larger than any other trial). Because these outliers probably include extraneous operations, it is best to discard them. m s Trial 1 Trial 2 Trial 3 Trial 4 Trial 5 Trial 6 Trial 7 Trial 8 Trial 9 Trial 10 N= 10 .0021 .0025 .0022 .0026 .0020 .0030 .0022 .0023 .0022 .0025 .0024 .00029 20 .006 .007 .008 .007 .007 .011 .011 .007 .007 .007 .007 .007 .00139 .00036 30 .014 .014 .014 .015 .014 .014 .014 .014 .014 .014 .014 .00013 40 .028 .024 .025 .026 .023 .025 .025 .026 .025 .027 .025 .0014 50 .039 .037 .036 .041 .042 .039 .140 .140 .039 .034 .038 .039 .049 .0025 .0323 100 .187 .152 .168 .176 .146 .146 .165 .146 .178 .154 .162 .0151 500 3.94 3.63 4.06 3.76 4.11 3.51 3.48 3.64 3.31 3.45 3.69 0.272 1000 13.40 12.90 13.80 17.60 12.90 14.10 12.70 81.60 81.60 16.00 15.50 14.32 21.05 1.69 21.33 5000 322.5 355.9 391.7 321.6 388.3 321.3 321.3 398.7 322.1 321.3 346.4 33.83 10000 1319. 1388. 1388. 1327. 1318. 1331. 1336. 1318. 1335. 1325. 1319. 1326. 1332. 20.96 7.50 Measuring Sort Timings

10 .0024 100 0.162 1000 14.32 10000 1332. N • As the running times on the preceding slide make clear, the situation for selection sort is very different. The table on the right shows the average running time when selection sort is applied to 10, 100, 1000, and 10000 values. time Selection Sort Running Times • The linear search algorithm has the property that the running time of the algorithm is proportional to the size of the array. If you multiply the number of values by ten, you would expect the linear search algorithm to take ten times as long. • As a rough approximation—particularly as you work with larger values of N—it appears that every ten-fold increase in the size of the array means that selection sort takes about 100 times as long.

In the selection sort implementation, the section of code that is executed most frequently (and therefore contributes the most to the running time) is the body of the findSmallest method. The number of operations involved in each call to findSmallest changes as the algorithm proceeds: N values are considered on the first call to findSmallest. N-1 values are considered on the second call. N-2 values are considered on the third call, and so on. • In mathematical notation, the number of values considered in findSmallest can be expressed as a summation, which can then be transformed into a simple formula: N Nx (N+1) ∑ i = 1 + 2 + 3 + . . . + (N- 1) + N = 2 i= 1 Counting Operations • Another way to estimate the running time is to count how many operations are required to sort an array of size N.

A Geometric Insight • You can convince yourself that Nx (N+1) 1 + 2 + 3 + . . . + (N- 2) + (N- 1) + N = 2 by thinking about the problem geometrically. • The terms on the left side of the formula can be arranged into a triangle, as shown at the bottom of this slide for N = 6. • If you duplicate the triangle and rotate it by 180˚, you get a rectangle that in this case contains 6 x 7 dots, half of which belong to each triangle.

Nx(N + 1) Nx(N + 1) Nx(N + 1) 2 2 2 • The growth pattern in the right column is similar to that of the measured running time of the selection sort algorithm. As the x valueofNincreasesbyafactorof10,thevalueof xx increases by a factor of around 100, which is 102. Algorithms whose running times increase in proportion to the square of the problem size are said to be quadratic. Quadratic Growth • The reason behind the rapid growth in the running time of selection sort becomes clear if you make a table showing the xxx value of for various values of N: N 10 55 100 5050 1000 500,500 10000 50,005,000

Finding a More Efficient Strategy • As long as arrays are small, selection sort is a perfectly workable strategy. Even for 10,000 elements, the average running time of selection sort is just over a second. • The quadratic behavior of selection sort, however, makes it less attractive for the very large arrays that one encounters in commercial applications. Assuming that the quadratic growth pattern continues beyond the timings reported in the table, sorting 100,000 values would require two minutes, and sorting 1,000,000 values would take more than three hours. • As it turns out, there are sorting strategies that are vastly more efficient for large arrays than selection sort. Most of these algorithms, unfortunately, use programming techniques beyond the scope of this text. The next few slides, however, resurrect a sorting algorithm that was popular in the early days of computing to show that massive improvements in efficiency are possible.

. . . 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 78 79 80 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 Computer companies built machines to sort stacks of punched cards, such as the IBM 083 sorter on the left. The stack of cards was loaded in a large hopper at the right end of the machine, and the cards would then be distributed into the various bins on the front of the sorter according to what value was punched in a particular column. The IBM 083 Sorter Sorting Punched Cards From the 1880 census onward, information was often stored on punched cards like the one shown at the right, in which the number 236 has been punched in the first three columns.

The algorithm used by the IBM 083 is called radix sort and consists of the following steps: 1. 2. 3. 4. 5. 6. Set the machine so that it sorts on the last digit of the number. Put the entire stack of cards in the hopper. Run the machine so that the cards are distributed into the bins. Put the cards from the bins back in the hopper, making sure that the cards from the 0 bin are on the bottom, the cards from the 1 bin come on top of those, and so on. Reset the machine so that it sorts on the preceding digit. Repeat steps 3 through 5 until all the digits are processed. The Radix Sort Algorithm • The IBM 083 sorter was a significant commercial success because it made it possible to sort large sets of punched cards quickly and efficiently. • The next slide illustrates this process for a set of three-digit numbers.

987 946 838 809 659 503 367 361 259 236 Step 1a. Sort the cards using the last digit. Step 1b. Refill the hopper by emptying the bins in order. Step 2a. Sort the cards using the middle digit. Step 2b. Again refill the hopper by emptying the bins. Step 3a. Sort the cards using the first digit. Step 3b. Refill the hopper one last time. 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 Simuating Radix Sort 946 838 236 809 503 838 236 809 503 987 367 361 259 659 946 838 236 809 987 367 361 259 659 946 838 236 809 503 987 367 361 259 659 946 838 236 809 503 809 503 987 367 361 259 659 503 367 361 259 236 259 659 946 838 236 809 503 259 236 367 361 259 236 503 367 361 259 236 838 809 659 503 367 361 259 236 987 946 838 809 659 503 367 361 259 236 367 361 259 659 946 838 236 809 503 987 367 361 259 659 946 838 236 809 259 659 838 367 987 946 236 987 367 361 259 659 946 809 259 659 809 259 659 838 367 987 946 236 503 987 367 361 259 659 946 838 809 259 659 838 367 987 809 259 659 838 367 809 259 659 838 809 259 659 838 367 987 946 809 987 987 367 809 259 659 838 367 987 946 236 503 361 987 367 361 987 367 361 259 659 809 259 503 361 809 503 946 367 987 838 259 809 503 946 367 987 838 259 236 659 361 809 809 503 946 367 987 838 259 236 659 809 503 946 367 987 838 259 236 946 236 503 361 809 503 946 367 987 838 361 809 503 946 367 987 809 503 946 367 809 259 659 838 367 987 946 236 503 361 838 367 987 946 236 503 361 367 987 946 236 503 361 809 503 946 809 503 Note that the list is now sorted by the last digit. The list is now sorted by the last two digits. The list is now completely sorted. 367 838 259 236 659 503 838 361 987 659 987 367 236 809 946 946 361 236 809 838 946 367 987 259 503 659 259 809 361 503 503 809 259 236 361 367 236 838 946 259 659 503 659 367 361 809 838 987 987 946 361 503 236 946 367 987 838 259 659 809 skip simulation

To take account of the fact that larger data sets tend to use longer sort keys, analyses of sorting algorithms assume that the range of the values being sorted is proportional to the number of values. Because the number of digits in a number N is proportional to the logarithm of N, computer scientists say that the performance of radix sort is proportional to N log N Efficiency of Radix Sort • The advantage of radix sort over selection sort is that the time required to sort a stack of cards no longer grows in proportion to the square of the number of cards. The running time is instead proportional to N x D where N is the number of cards and D is the number of digits in the numeric field on which the sort is performed.

Assessing Algorithmic Efficiency • The discussion of the efficiency of the various searching and sorting algorithms illustrates a fundamental computer science technique called algorithmic analysis. • The primary purpose of algorithmic analysis is to make qualititative assessments about the efficiency of an algorithm rather than to offer precise quantitative predictions about running time. The running time of a program depends, for example, on the speed of the hardware and the extent to which the compiler is able to optimize the code. The qualitative performance of the algorithm itself is largely independent of such considerations. • One of the most important problems in algorithmic analysis is deducing the computational complexity of an algorithm, which is the relationship between the size of the problem and the expected running time.

Big-O notation consists of the letter O followed by a formula that offers a qualitative assessment of running time as a function of the problem size, traditionally denoted as N. For example, the computational complexity of linear search is O(N) and the computational complexity of radix sort is O(N log N) Big-O Notation • The most common way to express computational complexity is to use big-O notation, which was introduced by the German mathematician Paul Bachmann in 1892. • If you read these formulas aloud, you would pronounce them as “big-O of N” and “big-O of N log N” respectively.

When you write a big-O expression, you should always make the following simplifications: 1. Eliminate any term whose contribution to the running time ceases to be significant as N becomes large. 2. Eliminate any constant factors. • The computational complexity of selection sort is therefore O(N2) and not ( ) Nx (N+1) O 2 Common Simplifications of Big-O • Given that big-O notation is designed to provide a qualitative assessment, it is important to make the formula inside the parentheses as simple as possible.

In the selection sort implementation, for example, the most commonly executed statement is the if statement inside the findSmallest method. This statement is part of two for loops, one in findSmallest itself and one in sort. The total number of executions is 1 + 2 + 3 + . . . + (N- 1) + N which is O(N2). Deducing Complexity from the Code • In many cases, you can deduce the computational complexity of a program directly from the structure of the code. • The standard approach to doing this type of analysis begins with looking for any section of code that is executed more often than other parts of the program. As long as the individual operations involved in an algorithm take roughly the same amount of time, the operations that are executed most often will come to dominate the overall running time.

a) for (int i = 0; i < n; i++){ for (int j = 0; j < i; j++){ . . . loop body . . . } } O(N2) This loop follows the pattern from selection sort. O(log N) b) for (int k = 1; k <= n; k *= 2){ . . . loop body . . . } This loop follows the pattern from binary search. c) for (int i = 0; i < 100; i++){ for (int j = 0; j < i; j++){ . . . loop body . . . } } O(1) This loop does not depend on the value of n at all. Exercise: Computational Complexity Assuming that none of the steps in the body of the following for loops depend on the problem size stored in the variable n, what is the computational complexity of each of the following examples:

Using Data Files • Applications that involve searching and sorting typically work with data collections that are too large to enter by hand. For this reason, section 12.4 includes a brief discussion of data files, which make it possible to work with stored data. • A file is the generic name for any named collection of data maintained on the various types of permanent storage media attached to a computer. In most cases, a file is stored on a hard disk, but it can also be stored on removable medium, such as a CD or flash memory drive. • Files can contain information of many different types. When you compile a Java program, for example, the compiler stores its output in a set of class files, each of which contains the binary data associated with a class. The most common type of file, however, is a text file, which contains character data of the sort you find in a string.

1. The information stored in a file is permanent. The value of a string variable persists only as long as the variable does. Local variables disappear when the method returns, and instance variables disappear when the object goes away, which typically does not occur until the program exits. Information stored in a file exists until the file is deleted. 2. Files are usually read sequentially. When you read data from a file, you usually start at the beginning and read the characters in order, either individually or in groups that are most commonly individual lines. Once you have read one set of characters, you then move on to the next set of characters until you reach the end of the file. Text Files vs. Strings Although text files and strings both contain character data, it is important to keep in mind the following important differences between text files and strings:

1. 2. 3. Construct a new BufferedReader object that is tied to the data in the file. This phase of the process is called opening the file. Call the methods provided by the BufferedReader class to read data from the file in sequential order. Break the association between the reader and the file by calling the reader’s close method, which is called closing the file. Reading Text Files • When you want to read data from a text file as part of a Java program, you need to take the following steps: • Java’s BufferedReader class make it possible to read data from files in several different ways. • You can read individual characters by calling the read method. • You can read complete lines by calling the readLine method. • You can read individual tokens by using the Scanner class. Each of these strategies is described in a subsequent slide.

The standard idiom for opening a text file is to call the constructors for each of these classes in a single statement: BufferedReader rd = new BufferedReader( new FileReader(filename)); The FileReader constructor takes the file name and creates a simple reader. That reader is then passed along to the BufferedReader constructor. Standard Reader Subclasses • The java.io package defines several different subclasses of the generic Reader class that are useful in different contexts. To read text files, you need to use the following subclasses: • The FileReader class, which allows you to create a simple reader by supplying the name of the file. • The BufferedReader class, which makes all operations more efficient and enables the strategy of reading individual lines.

The following code fragment, for example, counts the number of letters in a BufferedReader named rd: int nLetters = 0; while (true) { int ch = rd.read(); if (ch == -1) break; if (Character.isLetter(ch)) nLetters++; } Reading Characters from a File • Once you have created the BufferedReader object as shown on the preceding slide, you can then read individual characters from the file by calling the read method. • The read method conceptually returns the next character in a file, although it does so as an int. If there are any characters remaining in the file, calling read returns the Unicode value of the next one. If all the characters have been exhausted, read returns -1.

The following code fragment uses the readLine method to determine the length of the longest line in the reader rd: int maxLength = 0; while (true) { String line = rd.readLine(); if (line == null) break; maxLength = Math.max(maxLength, line.length()); } Reading Lines from a File • You can also read entire lines from a text file by calling the readLine method, which returns the next line of data from the file as a string after discarding any end-of-line characters. If no lines remain in the file, readLine returns null. • Using the readLine method makes programs more portable because it eliminates the need to think about the end-of-line characters, which differ from system to system.

Exception Handling • Unfortunately, the process of reading data from a file is not quite as simple as the previous slides suggest. When you work with the classes in the java.io package, you must ordinarily indicate what happens if an operation fails. In the case of opening a file, for example, you need to specify what the program should do if the requested file does not exist. • Java’s library classes often respond to such conditions by throwing an exception, which is one of the strategies Java methods can use to report an unexpected condition. If the FileReader constructor, for example, cannot find the requested file, it throws an IOException to signal that fact. • When Java throws an exception, it stops whatever it is doing and looks back through its execution history to see if any method has indicated an interest in “catching” that exception by including a try statement as described on the next slide.

The try Statement Java uses the try statement to indicate an interest in catching an exception. In its simplest form, the try statement syntax is try { code in which an exception might occur } catch (type identifier){ code to respond to the exception } where type is the name of some exception class and identifier is the name of a variable used to hold the exception itself. The range of statements in which the exception can be caught includes not only the statements explicitly enclosed in the try body but also any methods those statements call. If the exception occurs inside some other method, any subsequent stack frames are removed until control returns to the try statement itself.

The try Statement • The design of the java.io package forces you to use try statements to catch any exceptions that might occur. For example, if you open a file without checking for exceptions, the Java compiler will report an error in the program. • To take account of these conditions, you need to enclose calls to constructors and methods in the various java.io classes inside try statements that check for IOExceptions. • The ReverseFile program on the next few slides illustrates the use of the try statement in two different contexts: • Inside the openFileReader method, the program uses a try statement to detect whether the file exists. If it doesn’t, the catch clause prints a message to the user explaining the failure and then asks the user for a new file name. • Inside the readLineArray method, the code uses a try statement to detect whether an I/O error has occurred.

The ReverseFile Program import acm.program.*; import acm.util.*; import java.io.*; import java.util.*; /** This program prints the lines from a file in reverse order */ public class ReverseFile extends ConsoleProgram { public void run() { println("This program reverses the lines in a file."); BufferedReader rd = openFileReader("Enter input file: "); String[] lines = readLineArray(rd); for (int i = lines.length - 1; i >= 0; i--) { println(lines[i]); } } /* * Implementation note: The readLineArray method on the next slide * uses an ArrayList internally because doing so makes it possible * for the list of lines to grow dynamically. The code converts * the ArrayList to an array before returning it to the client. */ page 1 of 3 skip code

/* * Reads all available lines from the specified reader and returns * an array containing those lines. This method closes the reader * at the end of the file. */ private String[] readLineArray(BufferedReader rd) { ArrayList<String> lineList = new ArrayList<String>(); try { while (true) { String line = rd.readLine(); if (line == null) break; lineList.add(line); } rd.close(); } catch (IOException ex) { throw new ErrorException(ex); } String[] result = new String[lineList.size()]; for (int i = 0; i < result.length; i++) { result[i] = lineList.get(i); } return result; } The ReverseFile Program import acm.program.*; import acm.util.*; import java.io.*; import java.util.*; /** This program prints the lines from a file in reverse order */ public class ReverseFile extends ConsoleProgram { public void run() { println("This program reverses the lines in a file."); BufferedReader rd = openFileReader("Enter input file: "); String[] lines = readLineArray(rd); for (int i = lines.length - 1; i >= 0; i--) { println(lines[i]); } } /* * Implementation note: The readLineArray method on the next slide * uses an ArrayList internally because doing so makes it possible * for the list of lines to grow dynamically. The code converts * the ArrayList to an array before returning it to the client. */ page 2 of 3 skip code

/* * Requests the name of an input file from the user and then opens * that file to obtain a BufferedReader. If the file does not * exist, the user is given a chance to reenter the file name. */ private BufferedReader openFileReader(String prompt) { BufferedReader rd = null; while (rd == null) { try { String name = readLine(prompt); rd = new BufferedReader(new FileReader(name)); } catch (IOException ex) { println("Can't open that file."); } } return rd; } } The ReverseFile Program /* * Reads all available lines from the specified reader and returns * an array containing those lines. This method closes the reader * at the end of the file. */ private String[] readLineArray(BufferedReader rd) { ArrayList<String> lineList = new ArrayList<String>(); try { while (true) { String line = rd.readLine(); if (line == null) break; lineList.add(line); } rd.close(); } catch (IOException ex) { throw new ErrorException(ex); } String[] result = new String[lineList.size()]; for (int i = 0; i < result.length; i++) { result[i] = lineList.get(i); } return result; } page 3 of 3

You can then ask the user to select a file by calling int result = chooser.showOpenDialog(this); Selecting Files Interactively • The Java libraries also make it possible to select an input file interactively using a dialog box. To do so, you need to use the JFileChooser class from the javax.swing package. • The JFileChooser constructor is usually called like this: JFileChooser chooser = new JFileChooser(); where this indicates the program issuing this call. The return value will be JFileChooser.APPROVE_OPTION or JFileChooser.CANCEL_OPTION depending on whether the user clicks the Open or Cancel button. The caller can obtain the selected file by calling chooser.getSelectedFile().

Using the Scanner Class Although it is not used in the examples in the text, it is also possible to divide a text file into individual tokens by using the Scanner class from the java.util package. The most useful methods in the Scanner class are shown in the following table: new Scanner(reader) Creates a new Scanner object from the reader. next() Returns the next whitespace-delimited token as a string. nextInt() Reads the next integer and returns it as an int. nextDouble() Reads the next number and returns it as a double. nextBoolean() Reads the next Boolean value (true or false). hasNext() Returns true if the scanner has any more tokens. hasNextInt() Returns true if the next token scans as an integer. hasNextDouble() Returns true if the next token scans as a number. hasNextBoolean() Returns true if the next token is either true or false. close() Closes the scanner and the underlying reader.

When you write data to a file, the most common approach is to create a PrintWriter like this: PrintWriter wr = new PrintWriter( new FileWriter(filename)); As in the reader example, this nested constructor first creates a FileWriter for the specified file name and then passes the result along to the PrintWriter constructor. Using Files for Output • The java.io package also makes it possible to create new text files by using the appropriate subclasses of the generic Writer class. • Once you have a PrintWriter object, you can write data to that writer using the println and print methods you have used all along with console programs.

The End

Chapter 12—Searching and Sorting