Combining Fuzzy Information: an Overview

Combining Fuzzy Information: an Overview Author: Ronald Fagin Presented by: Bill Eberle

Overview • Introduction • Model • Algorithms • Turning TA into an Approximation Algorithm • Restricting Sorted Access • Restricting Random Access

Introduction • Be able to access data from a variety of data repositories. • Such data is inherently “fuzzy” (ex. the color “red”, where there are degrees of redness). • Result is a “graded” set, or set of pairs (x,g), where x is an object and g is the grade – a real number [0,1]. • Scoring (aggregation) function is used to handle compound queries (ex. redness and roundness). • If x1,…,xm are the grades of an object R under each of the m attributes, then t(x1,…,xm) is the overall grade of an object R. • A scoring function is monotone if t(x1,…,xm) <= t(x1,…,xm) whenever xi <= x’I for every i. In other words: if for every attribute, the grade of object R’ is at least as high as that of object R, then we would expect the overall grade of R’ to be at least as high as that of R. (discussion restricted to monotone aggregate functions)

Introduction (continued) • Middleware: system “on top of” various subsystems with the purpose of integrating results from the subsystems. • Random access: request the grade under a given attribute for any given object. • Sorted Access: request the top k objects in sorted order, each along with its grade. • Simplistic middleware cost = total number of objects obtained from the database under sorted access + total number of objects obtained from the database under random access (times some positive constants). • This paper discusses and compares algorithms for finding the top k objects. In other words, obtain k objects with the highest grades on a query, along with their grades.

The Model • N is the number of objects. • Each object R has m fields x1,…,xm, where xi is [0,1] for each i. • Database consists of m sorted lists L1,…,Lm, each of length N. • Each entry of Li is of the form (R,xi), where xi is the i th field of R, and the list Li is sorted in descending order by the xi value. • Only takes into account access costs and ignoring internal computation costs.

The Naive Algorithm • Under sorted access, looks at every entry in each of the m sorted lists, computes (using t) the overall grade of every object, and returns the top k answers. • Linear middleware cost (linear in the database size), and thus not efficient for a large database.

Fagin’s Algorithm (FA) • Algorithm: • Create a set H of at least k objects such that each of these objects has been seen in each of the m (sorted) lists. • For each object R that has been seen, do random access to each of the lists Li to find the ith field xi of R. • Compute the grade t for each object R that has been seen, and let Y be the set containing the k objects that been seen with the highest grades. • FA is correct for monotone scoring functions. • Middleware cost of FA (if N objects in the database and the orderings in the sorted lists are probabilistically independent):

Threshold Algorithm (TA) • Algorithm: • As an object R is seen under sorted access in some list, do random access to the other lists to find the grade xi of object R in every list Li. Then compute the grade t(R) of object R. If this grade is one of the k highest seen, then remember object R and its grade t(R). • For each list Li, let xi be the grade of the last object seen under sorted access. Define the threshold value T to be t(x1,…,xm). As soon as at least k objects have been seen whose grade is at least equal to T, then halt. • Let Y be the set containing the k objects that been seen with the highest grades. • Reason stopping rule for TA always occurs at least as early as the stopping rule for FA: • In FA, if R is an object that has appeared under sorted access in every list, then by monotonicity, the grade of R is at least equal to the threshold value. Therefore, when there are at least k objects, each of which has appeared under sorted access in every list (the stopping rule for FA), there are at least k objects whose grade is at least equal to the threshold value (the stopping rule for TA). • Advantages of TA over FA: • FA is optimal in a high-probability worst-case sense under certain assumptions; TA is instance optimal, which intuitively means it is optimal in every instance, as opposed to just the worst case or the average case. • FA requires buffers that grow arbitrarily large as the database grows, since it must remember every object is has seen in sorted order, in order to check for matching objects in the various lists; TA requires only bounded buffers, whose size is independent of the size of the database.

Turning TA into an Approximation Algorithm • TA can easily be modified to be an approximation algorithm, where we care only about the approximately top k answers. • First define a -approximation to the top k answers (for t over database D) to be a collection of k objects (and their grades) such that for each y among these k objects and each z not among these k objects, t(y) >= t(z). • To find a -approximation to the top k answers, modify the stopping rule of TA to be: • As soon as at least k objects have been seen whose grade is at least equal to T/ , then halt. • If > 1 and the aggregate function t is monotone, then TA correctly finds a-approximation to the top k answers for t. • Also suggests an interactive version.

Restricting Sorted Access • Sometimes it is not possible to access certain of the lists under sorted access. • Example: Zagat-Review web-site gives ratings of restaurants, NYT-Review web-site gives prices, and MapQuest web-site gives distances – however, only Zagat-Review web-site cane be accessed under sorted control. • Let Z be the set of indices i of those lists Li that can be accessed under sorted access (assume that there is at least one list). • We take m’ to be the cardinality |Z| of Z (and m is still the total number of sorted lists). • Modification to TA algorithm to deal with this restriction (TAZ): • Do sorted access in parallel to each of the m’ sorted lists Li with i in Z. As an object R is seen under sorted access in some list, do random access to the other lists to find the grade xi of object R in every list Li. Then compute the grade t(R) of object R. If this grade is one of the k highest seen, then remember object R and its grade t(R). • For each list Li, with i in Z, let xi be the grade of the last object seen under sorted access. For each list Li with i not in Z, let xi = 1. Define the threshold value T to be t(x1,…,xm). As soon as at least k objects have been seen whose grade is at least equal to T, then halt. • Let Y be the set containing the k objects that been seen with the highest grades.

Restricting Sorted Access - Example • Assume there are only 3 sorted lists L1, L2 and L3, and that only L1 may be accessed under sorted access (Z={1}). • Let t be the aggregation function where t(x,y,z) = min{x,y} if z= 1, and t(x,y,z) = (min{x,y,z}})/2 if z <> 1. • Assume we want to find the top answer (i.e. k = 1). • Looking at the tables, t(R) = 0.6, and t(R’) <= 0.5 (by the distinctness property). • Thus, R is the top object.

Restricting Random Access • Sometimes it is not possible to access certain of the lists under random access. • Example: Middleware system is a text retrieval system, and the subsystems are search engines. There is not a way to ask a major search engine on the web for its internal score on some document of our choice under a query. • Sometimes it is not impossible, but very expensive (ex. when the costs correspond to disk access). • For these scenarios, the desired output changes to just returning the top k objects, without their grades. • Some notions corresponding to lower bounds on the overall grade an object can attain: • Define WS(R) to be the minimum (or worst) value the aggregation function t can attain for object R. When t is monotone, the minimum value is obtained by substituting for each missing field the value 0, and applying t to the result. • Some notions corresponding to upper bounds on the overall grade an object can attain: • Best value an object can attain depends on other information we have. • Use only the bottom values in each field: xi is the last (smallest) value of known fields of R, with values xi1,xi2,…,xil for these known fields. • Define BS(R) to be the maximum (or best) value the aggregation function t can attain for object R. When t is monotone, this maximum value is obtained by substituting for each missing field the value xi, and applying t to the result.

No Random Access (NRA) • Goal is to obtain enough partial information about grades to know that an object is in the top k objects without knowing its exact grade. • Example: • Aggregation function is average. • k = 1 (only top object) • Two sorted lists L1 and L2, and the grade of every object in both L1 and L2 is 1/3, except that object R has a grade 1 in L1and grade 0 in L2. • After two sorted accesses to L1 and one sorted access to L2, there is enough information to know that object R is the top object (its average grade is at least 1/2, and every other object has average grade at most 1/3). • If sorted order desired, can easily determine by finding top object, then top 2 objects, etc.

Combined Algorithm (CA) • Uses random accesses, but takes their cost (relative to sorted order) into account. • Let cS be the cost of a sorted access, and cR be the cost of a random access. • Middleware cost of an algorithm that makes s sorted accesses and r random ones is scS + rcR. • The optimality ratio is a function of the relative cost of a random access to a sorted access: cR/cS. • Goal is to find an algorithm that is instance optimal and where the optimality ratio is independent of cR/cS.

Combined Algorithm (continued) • CA is a merge between TA (which is instance optimal) and NRA. • Let h = cR/cS. Let’s assume that cR >= cS, so that h >= 1. • The idea of CA is to run NRA, but every h steps to run a random access phase and update the information (the upper and lower bounds B and W) accordingly.

Combining Fuzzy Information: an Overview