Searching and Integrating Information on the Web

Searching and Integrating Information on the Web Seminar 4: Ranking Queries and Data Privacy Professor Chen Li UC Irvine

Outline and readings • Ranking Queries • Fagin, R., Combining Fuzzy Information from Multiple Systems, PODS 1996 • Fagin et al., Optimal Aggregation Algorithms for Middleware, PODS 2001. • Data privacy: • Database-as-service • Executing SQL over Encrypted Data in the Database-Service-Provider Model. Hakan Hacigumus, Bala Iyer, Chen Li, and Sharad Mehrotra. SIGMOD 2002. • XML Data publishing • Secure XML Publishing without Information Leakage in the Presence of Data Inference. Xiaochun Yang and Chen Li. To appear in VLDB'04 Seminar 3

Outline • Ranking Queries • Data privacy: • XML Data publishing • Database-as-service Seminar 3

Top-k queries • Finding multi-attribute tuples with top-k highest scores • Scoring function: aggregating scores on attributes, e.g., w1*A1 + … + wn * An, where wi is the weight for attribute Ai. • Monotone aggregation functions: if tuple A has a higher grade than tuple B on each attribute, then A’s overall grade is higher than B’s. Seminar 3

Applications • Multimedia databases • Web search queries: • Restaurants • Houses • Cars • … Seminar 3

Modes of Data Access (Fagin) Underlying Middleware (e.g., Search engines, Garlic, QBIC) supports 2 modes: 1. Sorted access: - Attribute Ai (column) forms a list Li sorted based on the score of Ai. - The list is output one by one. 2. Random access: - Ask the system for the grade of any given object Goal: minimize the total cost to get the top-k results year mileage price b e f . . . a d e . . . a c e . . . Sorted lists Seminar 3

FA: Fagin’s algorithm [PODS96] • Do sorted access in parallel to each of the m sorted lists Li. Wait until there is a set H of at least k objects such that each of these objects has been seen in each of the m lists. • For each object R that has been seen, do random access as needed to each of the lists Li to find the i-th field xi or R. • Compute the aggregate results. Seminar 3

Example: year mileage price • Suppose k = 1. Given the three partial lists retrieved so far, ‘e’ appears in all of them. We can say that the top 1 tuple must be in {a,b,c,e,d,f}. • Reason: since the function is monotonic, tuple ‘e’ “blocks” all tuples below, since they can only have a smaller overall grade than ‘e’. • The algorithm does random access for these 5 tuples to get their grades, and pick the top 1. • Notice that we cannot say ‘e’ must be the top 1, since other tuples (e.g., ‘a’) may still have a higher overall score • Minor point: one possible improvement – ‘f’ can never be better than ‘e’. b e f . . . a d e . . . a c e . . . Cut-off line Seminar 3

General case year mileage price • Once k tuples have appeared in all the partial lists, halt. • Reason: these k tuples block all the tuples below, which cannot be better than these k tuples • Do random access for the retrieved tuples to get their overall grades, and find the top-k. k k k Cut-off line Seminar 3

FA’s Properties • Can correctly find top-k results for monotone aggregation functions • Cost of a database with N objects: O(N^[(m-1)/m]*K^[1/m]) with arbitrarily high probability. Seminar 3

FA’s Drawbacks • The number of sorted accesses is still large. • Since all seen tuples should be buffered, the required buffer size is unbounded. • Does not exploit the bound given by the aggregation function to determine when to stop sorted access. Seminar 3

TA: Threshold Algorithm [PODS2001] • Do sorted access in parallel to each of the m sorted lists. As an object R is seen under sorted access in some list, do random access to the other lists to find the grade xi of object R in other lists. Then compute the aggregate grade for this object R. If this is one of the highest, insert it, else discard it. • For each list Li, let xi be the grade of the last object seen under sorted access. Define the threshold value T to be t( x1, …, xm). As soon as at least k objects have been seen whose grade is at least equal to T, then halt. • Return the K objects that have been seen with the highest grades. Seminar 3

Example: year mileage price • A buffer keeps the top-k tuples that have been found so far • For any tuple in a sorted list, do a random access to get its overall grade. Compare it with the tuples in the buffer queue, and decide to insert it or discard it. • Threshold window (including the previous m records) represents the “best” top-k results we can see, assuming we can combine best values from different tuples. • Notice that this window may not be “horizontal” if we use different speeds to access different lists • This window helps us decide when to stop: once we find k tuple whose grade is at least equal to the window tuple, we halt. buffer for top-k b e f . . . a d e . . . a c e . . . Threshold window Seminar 3

TA’s Properties • TA is optimal for all monotone functions and over every database. • Compared to FA, TA requires a small, constant-size buffer. • TA allows early stopping • Can show TA never stops later than FA. (Why?) • There are times when the user is satisfied with approximate top k list. TA is modified to give such approximation. • TA can be modified to the case where random access is impossible Seminar 3

Instance Optimality • Algorithm b is instance optimaloveran algorithm set A and a database instance set D, if b is in A, and for any algorithm a in A and every instance d in D, we have: cost (b,D) = O(cost(a,D)). • Similar to “competitive ratio” • Essentially: b is the best algorithm in A. • Stronger than “optimality in a worst-case case” • TA is instance optimal in all “correct algorithms” (nondeterministic algorithms). b A a Seminar 3

Variations of TA • NRA: When no random access is possible • Example: Web search engines, which typically do not allow you to enter a URL and get its ranking • TAZ: When no sorted access is possible for some predicates • Example: Find good restaurants near location x (sorted and random access for restaurant ratings, random access only for distances from a mapping site) • CA: When the relative costs of random and sorted accesses matter. • TA: Only when approximate answers are needed • Example: Web search, with lots of good quality answers Seminar 3

Outline • Ranking Queries • Data privacy: • XML Data publishing • Database-as-service Seminar 3

Motivation • Privacy in publishing XML data • Applications: • Web publishing • Data sharing and exchange, e.g., in P2P systems Seminar 3

(1) (3) pname pname treat (3) (2) phname (2) disease (3) (1) (2) (4) ward ward ward ward (1) disease treat (1) (2) treat (2) pname disease (4) disease (3) (1) Alice (2) Betty (1) W305 (2) Cathy (2) Alice (1) Cathy (3) W305 (1) (1) (1) leukemia leukemia Betty (2) W305 leukemia (3) (2) leukemia Example: Hospital XML data hospital (4) (2) (3) (1) (2) (1) patient patient physician patient patient physician ... ... (1) pname (4) phname Smith Walker Tom W403 cancer Goal: hide Alice’s disease Common Knowledge: patients in the same ward have the same disease Seminar 3

Problem Given: • An XML document to be published • Sensitive data in the document • Common knowledge using which public users can do data inference Find: • A partial document to be released so that users cannot infer the sensitive data Seminar 3

Research challenges • How to model data inference using common knowledge? • How to compute all possible inferred data? • How to compute a partial document to be published without leaking sensitive information? Seminar 3

Roadmap •  Information Leakage • Defining sensitive data • Describing common knowledge • Computing inferred documents • Prevent information leakage Seminar 3

patient hospital * disease Alice pname * S Cathy A2 A1 Defining sensitive data • Using an XQuery, called “regulating query” • A special node marked “*” to indicate the sensitive data Seminar 3

patient disease Alice * S A1 (1) (3) pname pname (2) disease (3) (1) (2) ward ward ward (1) disease (2) pname disease (3) (1) Alice (1) W305 (1) Cathy (3) W305 (1) (1) leukemia Betty (2) W305 leukemia (3) (2) leukemia Example 1 hospital (2) (3) (1) patient patient patient • Map the query to the XML tree • For each mapping, the target of the * node is sensitive. Seminar 3

hospital * pname Cathy A2 (1) (3) pname pname (2) disease (3) (1) (2) ward ward ward (1) disease (2) pname disease (3) (1) Alice (1) W305 (1) Cathy (3) W305 (1) (1) leukemia Betty (2) W305 leukemia (3) (2) leukemia Example 2 hospital (2) (3) (1) patient patient patient Seminar 3

Common Knowledge • Represented as XML constraints • Could be obtained in various ways, e.g., • possible schema • analysis from the published data Seminar 3

Common Constraints Patient Patient • Child constraints: //p  //p/c //patient  //patient/pname • Descendant constraints: //p  //p//d //patient  //patient//disease • Functional dependencies: //p/a//p/b //patient/ward  //patient/disease pname Patient Patient disease Patient Patient If w1 = w2, then d1 = d2 disease ward disease ward (value equal) d1 w1 d2 w2 Seminar 3

hospital patient patient (1) (2) pname disease ward ward (2) (1) (1) (1) (1) (1) (2) leukemia W305 W305 Modify partial document using constraints Partial document P C1: //patient  //patient/pname C2: //patient  //patient//disease C3: //patient/ward  //patient/disease Seminar 3

hospital patient patient (1) (2) pname pname disease ward ward (2) (1) (1) (1) (1) (1) (2) leukemia W305 W305 Apply C1 on document P C1(P) C1: //patient  //patient/pname Seminar 3

hospital patient patient (1) (2) pname disease ward ward (2) (1) (1) (1) (1) (1) (2) leukemia W305 W305 disease Apply C2 on document P C2(P) C2: //patient  //patient//disease • Floating branch: exact location unknown Seminar 3

hospital patient patient (1) (2) pname disease ward ward (2) (1) (1) (1) (1) (1) (2) leukemia W305 W305 disease leukemia Apply C3 on document P C3(P) C3: //patient/ward//patient/disease Seminar 3

hospital patient patient (2) (1) disease disease pname disease ward ward (1) (1) (1) (2) leukemia leukemia (2) (1) W305 W305 (1) Apply a sequence of constraints: <C2,C3> C2: //patient  //patient//disease C3: //patient/ward  //patient/disease Seminar 3

hospital patient patient (2) (1) disease pname disease ward ward (1) (1) (1) (2) leukemia leukemia (2) (1) W305 W305 (1) Another user applies a different sequence of constraints: <C3,C2> C2: //patient  //patient//disease C3: //patient/ward  //patient/disease After applying C3, we cannot use C2 to expand the tree No more floating branch! Seminar 3

hospital patient patient (2) (1) disease ward ward (1) (1) (2) leukemia (2) (1) W305 W305 (1) hospital patient patient (2) (1) P2: result of <C3,C2> P1: result of <C2,C3> pname pname disease ward ward disease (1) (1) (1) (1) (2) leukemia (2) (1) W305 leukemia W305 (1) disease disease leukemia They look different! • P1 is “m-contained” in P2: • There is a mapping from P1 to P2. • A floating branch can be mapped to a path. • The m-containing document P2 has more information • P2 is also “m-contained” in P1. • Thus they are “m-equivalent”! Seminar 3

What documents can users infer? • Different users can use different sequences of constraints to do inference • Thus they can infer different documents • Questions: • Can an inference process terminate? • What inferred document should we consider to prevent leakage of sensitive data? Seminar 3

Theorem • Given a partial document P of an XML document D and a set of constraints C={C1,…, Ck}, there is a document M that can be inferred from P using a sequence of constraints, such that: • for any sequence of constraints, its resulting document is m-contained in M. • Can be computed using a greedy approach. • Such a document is unique under m-equivalence. Seminar 3

Inference Information leakage • For a partial document P, if there exists a regulating query A, such that the maximal inferred document M can produce a non-empty answer to the query A, then we say “P causes information leakage.” Partial Document P Regulating query A Seminar 3

Roadmap • Information Leakage •  Prevent information leakage Seminar 3

Formal Problem • Given an XML document D, a regulating query A, common knowledge represented as constraints C1,…,Ck; • How to find a partial document P without information leakage? • Called a valid partial document • The empty document is a trivial one • We want the published document to have as much data as possible Seminar 3

An algorithm • We develop an algorithm for solving this problem • We use the running example to illustrate the algorithm Seminar 3

(1) (3) pname pname (2) disease (3) (1) (2) ward ward ward (1) disease (2) pname disease (3) (1) Alice (1) W305 (1) Cathy (3) W305 (1) (1) leukemia Betty (2) W305 leukemia (3) (2) leukemia Example hospital (2) (3) (1) patient patient patient Regulating query A patient disease Alice * S Functional dependency: //patient/ward  //patient/disease Seminar 3

(1) (3) pname pname (2) disease (3) (1) (2) ward ward ward (1) disease (2) pname disease (3) (1) Alice (1) W305 (1) Cathy (3) W305 (1) (1) leukemia Betty (2) W305 leukemia (3) (2) leukemia Remove sensitive data A(D) hospital (2) (3) (1) patient patient patient patient disease Alice * S Remaining document: D - A(D) Seminar 3

(1) (3) pname pname (2) disease (3) (1) (2) ward ward ward (1) disease (2) pname disease (3) (1) Alice (1) W305 (1) Cathy (3) W305 (1) (1) leukemia Betty (2) W305 leukemia (3) (2) leukemia Compute the maximal inferred document M of D-A(D) hospital (2) (3) (1) patient patient patient patient disease Alice * S Maximal inferred document: M Seminar 3

(1) (3) pname pname (2) disease (3) (1) (2) ward ward ward (1) disease (2) pname disease (3) (1) Alice (1) W305 (1) Cathy (3) W305 (1) (1) leukemia Betty (2) W305 leukemia (3) (2) leukemia Testing Information Leakage hospital (2) (3) (1) patient patient patient Regulating query A patient disease Alice * S There is a mapping from A to P. So information leaked. Seminar 3

chase back chase back A S Inference break mapping A S break mapping Inference A S Computing a valid partial document D - A(D) A(D) How to break the mappings? How to chase back the inference steps? Seminar 3

AND/OR Graphs • A structure representing how a goal can be reached by solving subproblems. • We use such graphs to formulate the process of finding a valid partial document Seminar 3

(1) (3) pname pname (2) disease (3) (1) (2) ward ward ward (1) disease (2) pname disease (3) (1) Alice (1) W305 (1) Cathy (3) W305 (1) (1) leukemia Betty (2) W305 leukemia (3) (2) leukemia hospital (2) (3) (1) patient patient patient Regulating query A patient disease Alice * S START • Consider mapping images of the leaf nodes in A • An “OR” connector shows that solving any of the subproblems can solve the parent problem. OR (1) (1) Alice leukemia Seminar 3

(1) (3) pname pname (2) disease (3) (1) (2) ward ward ward (1) disease (2) pname disease (3) (1) Alice (1) W305 (1) Cathy (3) W305 (1) (1) leukemia Betty (2) W305 leukemia (3) (2) leukemia hospital START (2) (3) (1) patient patient patient OR (1) (1) Alice leukemia AND Regulating query A OR OR patient (1) (2) (3) (3) (2) W305 W305 W305 leukemia leukemia disease Alice • Multiple ways to infer the sensitive data. • An “AND” connector shows that solving ALL the subproblems can solve the parent problem. * S Seminar 3

START OR (1) (1) Alice leukemia (1) (3) pname pname AND AND (2) disease (3) (1) (2) ward ward ward (1) disease (2) pname disease (3) OR OR OR OR (1) Alice (1) W305 (1) . Cathy . . (3) W305 (1) (1) leukemia Betty (1) (2) (3) (3) (2) (2) W305 W305 W305 leukemia W305 leukemia leukemia (3) (2) leukemia hospital (2) (3) (1) patient patient patient Regulating query A patient disease Alice • Continue expanding the AND/OR graph * S Seminar 3

AND/OR Graphs (cont) • A special START node representing the goal of computing a valid partial document. • The graph has nodes corresponding to nodes in the maximal inferred document M. • Such a node represents the subproblem of hiding its corresponding node n in M • This node n should be removed from M • It cannot be inferred using the constraints and other nodes in M. Seminar 3

Searching and Integrating Information on the Web