200 likes | 328 Views
This paper assesses the efficiency of linear search methods in information systems and their implications for decision-making. It explores the balance between search costs and the benefits of sorting information. We discuss average search costs using the Pareto Principle, indicating how distribution patterns affect search efficiency. The evaluation also includes simulation studies showing the impact of usage concentration on search savings. By quantifying search costs, we aim to provide insight into optimizing information retrieval processes and highlight applications for assessing market monopolies.
E N D
Linear Search Efficiency Assessment P. Pete Chong Gonzaga University Spokane, WA 99258-0009 chong@gonzaga.edu
Why Search Efficiency? • Information systems help users to obtain “right” information for better decision making, thus require search • Better organization reduces time needed to find this “right” info, thus require sort • Is savings in search worth the sort?
Search Cost • If we assume random access, then the cost is (n+1)/2. That is, for 1000 records, the average search cost is approximately 1000/2=500. • For large n, we may use n/2 to simplify the calculation
Search Cost • In reality, the usage pattern is not random • For payroll, for example, every record is accessed once and only once. In this case sorting have no effect to search efficiency • Most of the time the distribution follows the 80/20 Rule (Pareto Principle)
Group/ Numbers Numbers Paper Cumulative Cumulative Author Paper Group/ Numbers Numbers Paper Cumulative Cumulative Author Paper Index Papers Authors Subtotal Authors Papers Proportion Proportion i ni f(ni) nif(ni) f(ni) nif(ni) xi i 26 242 1 242 1 242 0.003 0.137 25 114 1 114 2 356 0.005 0.202 24 102 1 102 3 458 0.008 0.260 23 95 1 95 4 553 0.011 0.314 22 58 1 58 5 611 0.014 0.347 21 49 1 49 6 660 0.016 0.374 20 34 1 34 7 694 0.019 0.394 19 22 2 44 9 738 0.024 0.419 18 21 2 42 11 780 0.030 0.442 17 20 2 40 13 820 0.035 0.465 16 18 1 18 14 838 0.038 0.475 15 16 4 64 18 902 0.049 0.512 14 15 2 30 20 932 0.054 0.529 13 14 1 14 21 946 0.057 0.537 12 12 2 24 23 970 0.062 0.550 11 11 5 55 28 1025 0.076 0.581 10 10 3 30 31 1055 0.084 0.598 9 9 4 36 35 1091 0.095 0.619 8 8 8 64 43 1155 0.116 0.655 7 8 8 56 51 1121 0.138 0.687 6 6 6 36 57 1247 0.154 0.707 5 5 10 50 67 1297 0.181 0.736 4 4 17 68 84 1365 0.227 0.774 3 3 29 87 113 1452 0.305 0.824 2 2 54 108 167 1560 0.451 0.885 1 1 203 203 370 1763 1.000 1.000 Total number of Groups (m): 26 Average number of publications (m): 4.7649
Formulate the Pareto Curve Chen et al. (1994) define f(ni) = the number of authors with ni papers, T = = total number of authors, R = = total number of papers, m = R/T = the number of published papers per author
Formulate the Pareto Curve for each index level, let xi be the fraction of total number of authors and i be the fraction of total paper published, then xi = and qi = .
Formulate the Pareto Curve Plug in the values above into (i - i+1)/(xi - xi+1), Chen et al. derive the slope formula: si = When ni = 1, si = 1/m = T/R, let’s call this particular slope a.
Revisit the Pareto Curve a = 370/1763 = 0.21
The Significance • We now have a quick way to quantify different usage concentrations • Simulation shows that in most situations a moderate sample size would be sufficient to assess the usage concentration • The inverse of average usage (a) is easy to calculate
Search Cost Calculation • The search cost for a randomly distributed list is n/2. Thus, for 1000 records, the search cost is 500. • For a list that has 80/20 distribution, the search cost is (200/2)(80%)+[(200+1000)/2](20%) = 200 Or a saving of 60%
Search Cost Calculation Let the first number in the 80/20 be a and the second number be b. Since these two numbers are actually percents, we have a + b = 1. Thus, the expected value for searching cost for a list of n records is the weighted average: (bn/2)(a) + [(bn+n)/2](b) = (bn/2)(a+b+1) = (bn/2)(2) = bn
Search Cost Calculation • Thus, b indicates the cost of search in terms of the percentage of records in the list. bn represent an upperbound of the number of searches. • For a fully sorted list (by usage) with 80/20 distribution, Knuth (1973) has shown that the average search cost C(n) is only 0.122n.
Search Cost Estimate Regression Analyses yield: b = 0.15 + 0.359a, for 0.2<a<1.0 b = 0.034 + 0.984a, for 0<a<0.2, and C(n) = 0.02 + 0.49a.
Conclusion • The true search cost is between the estimation of b and C(n) • We may use C(n)~0.5a as a way to quickly estimate the search cost of a fully sorted list. • That is, take a moderate sample of usage, the search cost will be half of the inverse of the average usage times the total number of records.
“Far-fetched” (?) Applications • Define and assess the degree of monopoly? What is the effect of monopoly? Note the gap between b and C(n) (ideal). • Gini Index?