1 / 54

On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases. Presented by Xi Zhang Feburary 8 th , 2008. Outline. Background Motivation Examples Top-k Queries in Probabilistic Databases Conclusion. Outline. Background Probabilistic database model

vondra
Download Presentation

On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On the Semantics and Evaluation of Top-k Queries in Probabilistic Databases Presented by Xi Zhang Feburary 8th, 2008

  2. Outline • Background • Motivation Examples • Top-k Queries in Probabilistic Databases • Conclusion

  3. Outline • Background • Probabilistic database model • Top-k queries & scoring functions • Motivation Examples • Top-k Queries in Probabilistic Databases • Conclusion

  4. Probabilistic Databases • Motivation • Uncertainty/vagueness/imprecision in data • History • Imcomplete information in relational DB [Imielinski & Lipski 1984] • Probabilistic DB model [Cavallo & Pittarelli 1987] • Probabilistic Relational Algebra [Fuhr & Rölleke 1997 etc.] • Comeback • Flourish of uncertain data in real world application • Examples: WWW, Biological data, Sensor network etc.

  5. Probabilistic Database Model [Fubr & Rölleke 1997] • Probabilisitc Database Model • A generalizaiton of relational DB • Probabilistic Relational Algebra (PRA) • A generalization of standard relational algebra

  6. A Table in Probabilistic Database DocTerm: Event expression Independent events

  7. Probabilistic Relational Algebra • Just like in Relational Algebra… • Selection • Projection • Join • Union • Difference -

  8. Probabilistic Relational Algebra • Just like in Relational Algebra… • Selection • Projection • Join • Union • Difference -

  9. In derived table Selection DocTerm: Propositional expression of basic events

  10. Projection DocTerm:

  11. Join DocAu: DocTerm:

  12. DocAu: Join + Projection DocTerm: IR: DB: Prob 0.81 * 0.21 = 0.1701 0.56 * 0.91 = 0.5096 0.4368

  13. DocAu: Join + Projection DocTerm: IR: DB: Intensional Semantics v.s. Extensional Semantics Prob 0.81 * 0.21 = 0.1701 0.56 * 0.91 = 0.5096 0.4368

  14. Intensional v.s Extensional • Intensional Semantics • Assume data independence of base tables • Keeps track of data dependence during the evaluation • Extensional Semantics • Assume data independence during the evaluation • Could be WRONG with probability computation!

  15. When Intensional = Extensional? • No identical underlying basic events in the event expression 0.4368 Identical basic event

  16. Fubr & Rölleke 1997 • Summary • Probabilisitc DB Model • Concept of event • Basic v.s. complex event • Event expression • Probabilistic Relational Algebra • Just like in Relational Algebra… • Computation of event probabilities • Intensional v.s. extensional semantics • Yield the same result when NO data dependence in event expressions

  17. Outline • Background • Probabilistic database model • Top-k queries & scoring functions • Motivation Examples • Top-k Queries in Probabilistic Databases • Semantics • Query Evaluation • Conclusion

  18. Top-k Queries • Traditonally, given Objects: o1, o2, …, on An non-negative integer: k A scoring function s: Question: What are the k objects with the highest score? • Have been studied in Web, XML, Relational Databases, and more recently in Probabilistic Databases.

  19. Scoring Function • A scoring function s over a deterministic relation R is • For any ti and tj from R,

  20. Outline • Background • Motivation Examples • Smart Enviroment Example • Sensor Network Example • Top-k Queries in Probabilistic Databases • Conclusion

  21. Motivating Example I • Smart Environment • Sample Question • “Who were the two visitors in the lab last Saturday night?” • Data • Biometric data from sensors • We would be able to see how those data match the profile of every candidate -- a scoring function • Historical statistics • e. g. Probability of a certain candidate being in lab on Saturday nights

  22. Motivating Example I (cont.) Biometrics Probability of being in lab on Saturday nights … ) score( Personnel 0.3 0.9 0.4 Question: Find two people in the lab last Saturday night a Top-2 query over the above probabilistic database under the above scoring function

  23. Motivating Example II • Sensor Network in a Habitat • Sample Question • “What is the temperature of the warmest spot?” • Data • Sensor readings from different sensors • At a sampling time, only one “real” reading from a sensor • Each sensor reading comes with a confidence value

  24. Motivating Example II (cont.) Prob 0.6 C1 (from Sensor 1) 0.4 0.1 C2 (from Sensor 2) 0.6 Question: What is the temperature of the warmest spot? a Top-1 query over the above probabilistic database under the scoring function proportional to temperature

  25. Outline • Background • Motivation Examples • Top-k Queries in Probabilistic Databases • Semantics • Query Evaluation • Conclusion

  26. Models • A probabilistic relation Rp=<R, p, C > • R: the support deterministic relation • p: probability function • C : a partition of R, such that • Simple v.s. General probabilistic relation • Simple • Assume tuple independence, i.e. |C |=|R| • E.g. smart environment example • General • Tuples can be independent or exclusive, i.e. |C |<|R| • E.g. sensor network example

  27. Challenges Given • A probabilistic relation Rp=<R, p, C > • An injective scoring function s over R • No ties • A non-negative integer k • What is the top-k answer set over Rp ? (Semantics) • How to compute the top-k answer of Rp ? (Query Evaluation)

  28. What is a “Good” Semantics? • Desired Properties • Exact-k • Faithfulness • Stability

  29. Properties • Exact-k • If R has at least k tuples, then exactly k tuples are returned as the top-k answer • Faithfulness • A “better” tuple, i.e. higher in score and probability, is more likely to be in the top-k answer, compared to a “worse” one • Stability • Raising the score/prob. of a winning tuple will not cause it to lose • Lowering the score/prob. of a losing tuple will not cause it to win

  30. Global-Topk Semantics Given • A probabilistic relation Rp=<R, p, C > • An injective scoring function s over R • No ties • A non-negative integer k • What is the top-k answer set over Rp ? (Semantics) • Global-Topk • Return the k highest-ranked tuples according to their probability of being in top-k answers in possible worlds • Global-Topk satisfies aforementioned three properties

  31. Smart Environment Example Query: Find two people in lab on last Saturday night Biometrics Face Voice Detection, Detection, Prob. … ) Score( Personnel Aiden Score( 0.70 , 0.60, … ) = 0.65 0.3 Bob Score( 0.50 , 0.60, … ) = 0.55 0.9 Chris Score( 0.50 , 0.40, … ) = 0.45 0.4 possible worlds Aiden Aiden Bob Aiden Top-2 Aiden Bob Chris Bob Chris Chris Bob Chris 0.042 0.018 0.378 0.028 0.162 0.012 0.252 0.108 Global-Topk Semantics: Pr(Bob in top-2) = 0.9 Top-2 Answer Pr(Aiden in top-2) = 0.3 Pr(Chris in top-2) = 0.028 + 0.012 + 0.252 = 0.292

  32. Other Semantics • Soliman, Ilyas & Chang 2007 • Two Alternative Semantics • U-Topk • U-kRanks

  33. U-Topk Semantics Given • A probabilistic relation Rp=<R, p, C > • An injective scoring function s over R • No ties • A non-negative integer k • What is the top-k answer set over Rp ? (Semantics) • U-Topk • Return the most probable top-k answer set that belongs to possible worlds • U-Topk does not satisfies all three properties

  34. Smart Environment Example Query: Find two people in lab on last Saturday night Biometrics Face Voice Detection, Detection, Prob. … ) Score( Personnel Aiden Score( 0.70 , 0.60, … ) = 0.65 0.3 Bob Score( 0.50 , 0.60, … ) = 0.55 0.9 Chris Score( 0.50 , 0.40, … ) = 0.45 0.4 possible worlds Aiden Aiden Bob Aiden Top-2 Aiden Bob Chris Bob Chris Chris Bob Chris 0.042 0.018 0.378 0.028 0.162 0.012 0.252 0.108 U-Topk Semantics: Top-2 Answer Pr({Bob}) = 0.378 … Pr({Aiden, Bob}) = 0.162 + 0.108 = 0.27

  35. U-kRanks Semantics Given • A probabilistic relation Rp=<R, p, C > • An injective scoring function s over R • No ties • A non-negative integer k • What is the top-k answer set over Rp ? (Semantics) • U-kRanks • For i=1,2,…,k, return the most probable ith-ranked tuples across all possible worlds • U-kRanks does not satisfies all three properties

  36. Smart Environment Example Query: Find two people in lab on last Saturday night Biometrics Face Voice Detection, Detection, Prob. … ) Score( Personnel Aiden Score( 0.70 , 0.60, … ) = 0.65 0.3 Bob Score( 0.50 , 0.60, … ) = 0.55 0.9 Chris Score( 0.50 , 0.40, … ) = 0.45 0.4 possible worlds Aiden Aiden Bob Aiden Top-2 Aiden Bob Chris Bob Chris Chris Bob Chris 0.042 0.018 0.378 0.028 0.162 0.012 0.252 0.108 U-kRanks Semantics: Aiden Bob Chris Highest at rank-1 Highest at rank-2 Top-2 Answer {Bob} e.g. Pr(Chris at rank-2) = 0.012 + 0.252 = 0.292

  37. Properties A better sementics * Yes when the relation is simple, No otherwise

  38. Challenges Given • A probabilistic relation Rp=<R, p, C > • An injective scoring function s over R • No ties • A non-negative integer k • What is the top-k answer set over Rp ? (Semantics) • How to compute the top-k answer of Rp ? (Query Evaluation) Global-Topk

  39. Global-Topk in Simple Relation • Given Rp=<R, p, C >, a scoring function s, anon-negative integer k • Assumptions • Tuples are independent, i.e. |C |=|R| • R={t1,t2,…tn}, ordered in the decreasing order of their scores, i.e.

  40. Global-Topk in Simple Relation • Query Evaluation • Recursion • Pk,s(ti): Global-Topk probability of tuple ti • Dynamic Programming

  41. Optimization • Threshold Algorithm (TA) • [Fagin & Lotem 2001] • Given a system of objects, such that • For each object attribute, there is a sorted list ranking objects in the decreasing order of its score on that attribute • An aggregation function f combines individual attribute scores xi, i=1,2,…m, to obtain the overall object score f(x1,x2,…,xm) • f is monotonic • f(x1,x2,…,xm)<= f(x’1,x’2,…,x’m) whenever xi<=x’ifor every i • TA is cost-optimal in finding the top-k objects • TA and its variants are widely used in ranking queries, e.g. top-k, skyline, etc.

  42. Applying TA Optimization • Global-Topk • Two attributes: probability & score • Aggregation function: Global-Topk probability

  43. Global-Topk in General Relation • Given Rp=<R, p, C >, a scoring function s, anon-negative integer k • Assumptions • Tuples are independent or exclusive, i.e. |C |<|R| • R={t1,t2,…tn}, ordered in the decreasing order of their scores, i.e.

  44. Global-Topk in General Relation • Induced Event Relation • For each tuple in R, there is a probabilistic relation Ep=<E, pE, C E> generated by the following two rules • Ep is simple

  45. Sensor Network Example Prob. Relation (general) Prob For example: 0.6 C1 (from Sensor 1) 0.4 t= 0.6 0.1 C2 (from Sensor 2) 0.6 Induced Event Relation (simple) Prob Rule 2 where i=1 0.6 = 0.6 = p(t) Rule 1

  46. Global-Topk in General Relation

  47. Evaluating Global-Topk in General Relation • For each tuple t, generate corresponding induced event relation • Compute the Global-Topk probability of t by Theorem 4.3 • Pick the k tuples with the highest Global-Topk probability

  48. Summary on Query Evaluation • Simple (Independent Tuples) • Dynamic Programming • Tuples are ordered on their scores • Recursion on the tuple index and k • General (Independent/Exclusive Tuples) • Polynomial reduction to simple cases

  49. Complexity * m is a rule engine related factor m represents how complicated the relationship between tuples could be

  50. Outline • Background • Motivation Examples • Top-k Queries in Probabilistic Databases • Conclusion

More Related