120 Views

Download Presentation
## Data Uncertainty

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Supporting Ranking and Aggregation Queries on Uncertain and**Incomplete Datahttp://www.cs.uwaterloo.ca/~ilyas/URank**Data Uncertainty**• Conventional databases are deterministic • An item either is in the database or not in the database • Database is a “complete world” • A tuple either is in the query answer or is not • In many current applications, data are uncertain • Sensor networks • Information extraction • Moving objects tracking • Data cleaning • “A tuple belongs to the database” is a probabilistic event • “A tuple is part of query answer” is a probabilistic event • Data models, e.g., relational, XML, etc .. need to be extended to capture uncertainty**The Future, Smart Home**Alert Systems A/V sensors Temp. sensors Security Cameras Power Monitors Motion sensors**Managing Dynamic Data**• Readings are sent periodically to a server • GPS devices • Mobile phones • Sensors • Infeasible to keep track of exact values • Limited bandwidth • Limited battery • Continuous changes • Dynamic data is represented as intervals associated with PDFs • Queries are answered by processing the probabilistic intervals**Managing Sensor Data**• Queries are submitted at a base station, parsed, optimized and sent into the sensor network, where they are disseminated and processed, with results flowing back up the routing tree • Sensor readings are probabilistically defined based on readings’ histories and correlations • Possible queries • Finding query answers consistent with readings correlations? • Finding answers with high confidence? • Ranking sensor readings by their values?**Traffic-Monitoring Applications**• Identification mistakes • Interference by high-voltage lines • Imperfection in radar devices Uncertain Data • Data Dependencies • What are the most speeding cars ? • What are the most appropriate locations for speed traps? Probabilistic Correlations**Uncertain Data on the Web**• Data entry errors • Privacy concerns • Integrating data from different sources • Presentation Style Uncertain/Incomplete Data • What are the best apartments with areas above 70 m2 ? • Find the best cars with odometer between 50,000 and 70,000 km**Many New Challenges ...**Results Visualization Visualizing Uncertainty High-Level Summaries Answer Space Lineage Query Processing Ranking & Aggregation Computation Cost Special Query Types Approximation Data Models Uncertainty Models Data Types Probabilistic Correlations Uncertainty in Schemas**Ranking and Aggregating Uncertain Data**Data Analysis Aggregation Uncertainty Ranking Decision Making Data Exploration Data Cleaning**New semantics for ranking and aggregation queries on**uncertain data • Studying the interaction between scores and probabilities • Integrating scoring and uncertainty dimensions into the same query definitions • Efficient support of probabilistic ranking and aggregation queries • Incremental and optimized query processing • Unified framework for handling both ranking and uncertainty • Finding the most probable answers**Possible Worlds Semantics**0.3 Independent Tuples Correlated Tuples 0.12 0.42 0.6 0.18 0.1 0.28**Probabilistic Query Processing**• Intensional Semantics [Dalvi et al., VLDB’04] • Independent base tuples • Relational processing induces correlations among intermediate results r2 R S r1 r1 0.3 0.4 s1 0.5 r2 Pr(r2) σ(A=‘y’)(R)={<y,z: 0.5>} s1 Pr(r2 ^ s1) Pr(r1 ^ s1) R ⋈B=C S = {<x,z,z,w:0.12>,<y,z,z,w:0.2>} Pr ((r1 ^ s1) V (r2 ^ s1)) πD(R ⋈B=C S)={<w:0.26>}**Score-Uncertainty Interaction**The most probable rank-1 tuple with prob 0.7 R 0.072 SELECT tID FROM R ORDER BY Score LIMIT 2 0.048 0.108 The most probable rank -2 tuple with prob 0.324 The most probable top-2 vector with prob. 0.28 0.168 0.112 0.252 0.168 0.072**Data Model**• We treat tuples as probabilistic events associated with membership probabilities • Later, we discuss modeling uncertain attribute values • Each tuple has a score computed using a query-specified scoring function • We assume a general uncertainty model that allows for computing the joint probability of an arbitrary combination of tuple events • Computing this probability is the only interface between the uncertainty model, and our processing framework • Example Models • Independent Tuples [Dalvi et al.VLDB04] • Correlated Tuples with Simple Rules [Sarma et al., ICDE06] • General Inference Model [Sen et al. ICDE07]**Top-k Query Definitions**Integrating Score and Probability U-kRanks Top-ith tuples U-Topk Top-k (vectors) Aggregate probabilities on the level of individual ranks Aggregate probabilities on the level of tuple vectors**Example**t1 t2 t3 t4 t5 t6 U-Top2 U-Rank1 U-Rank2 0.28 0.42 0.324 0.112 0.168 0.072 0.048 0.108 0.168 0.252 0.072**Ranking-Aggregate Query Definitions**Integrating Score and Probability with Grouping Criteria U-kRanks-Agg Top-ith groups U-Topk-Agg Top-k ( group vectors) Aggregate probabilities on the level of individual ranks Aggregate probabilities on the level of group vectors**Example**t1 t2 0.112 0.072 0.168 0.048 t3 t4 t5 t6 0.168 0.252 0.072 0.108 Grouped, Aggregated, and Ranked Worlds**Processing Framework**Query Answer Ranked tuple vector State Formulation Space Navigation Probabilistic Ranking Layer Formulated state Score-ordered tuples from requested groups Score-ordered tuples Probability Tuple Events t1 t2 Probabilistic Top-k Probabilistic Top-k Agg. … t3 Incremental Tuple Retrieval g2 g1 gn t4 … Per-Group Access Tuple Access Layer Relational Query Engine Rule Engine Access Methods ... RandomAccess ScoreAccess Prob.Access Dependency information Physical Data and Rules Store**Assumptions**• Tuple Access • Tuples are consumed incrementally, i.e., one by one, from the output of a query executed by the relational engine in the Tuple Access Layer • Group Access • We assume an interface that allows accessing tuples from specific groups incrementally • Available Dependency Information • Dependencies among query output tuples are only known when these tuples are consumed by the Probabilistic Ranking Layer • Group Cardinality Information • The Tuple Access Layer provides (bounds on) the size of each group • Event Probability Computation • We assume the Rule Engine responds with exact probabilities to the submitted questions (tuple events combinations)**Search Space Model**• Top-l State • Top-l vector in one or more possible worlds • State Probability • The joint probability of involved tuple events • Search Goal • UTop-k: A state of length k with the maximum probability • Probability Reduction Property • State Comparability (Space Pruning) • Complete states prune partial ones • With independence, partial states prune each other**Search Space Model**U-Top2 ? Ф t1 t2 ¬t2 t2 t2 ¬t2 0.3 0.7 t3 ¬t3 Upper bound over the probability of any other answer Pr (<t2> is the top-1) Scan tuples in Score order t2 ¬t3 t2 t3 0.42 0.28 Score-ordered prefixes of one or more worlds A complete state containing a possible top-2 answer A partial state thatmight generate another top-2 answer t4 t4 t5 t5 Space size is nCk (in O(nk)), where n is the number of consumed tuples**Probabilistic Ranking Queries**• Tuple Access layer executes a score-based top-k query plan • Probabilistic Ranking Layer consumes tuples (when needed) in score order • Necessary to compute non-trivial bounds on state probabilities • Other retrieval orders necessitate fully materializing the space • Our main cost metric is the number of consumed tuples n • Space size is in O(nk) • We adopt lazy space materialization to avoid space explosion • Query processing algorithms aim at • Minimizing the number of consumed tuples, and • Minimizing the number of materialized space states • Optimality guarantees on the size of materialized space, and the number of consumed tuples • Based on optimality of A* Search**U-Topk Queries with Dependencies**U-Top2 ? Ф ¬t2 t2 t2 ¬t2 0.3 0.7 Partial states cannot prune each other t3 ¬t3 t2,¬t3 t2,t3 0.42 0.35 Complete State Complete states prune other partial states Partial states cannot prune other partial states**U-Topk Queries with Independence**U-Top3 ? Ф ¬t2 t2 t2 ¬t2 0.3 0.7 Pr(.) >= Pr(.) Pr(<t2,*,*>) > Pr(<¬t2, *, *,*>) Partial states prune other partial states of the same or smaller length**U-kRanks Queries with Independence**U-3Ranks? Score-ranked tuple stream t2:0.9 t3:0.6 t1:0.3 t3 t1 t2 ∑=1 Rank1 0.3 0.9 x 0.7 =0.63 0.6 x 0.1 x 0.7 =0.042 0.6 x ( (0.63) + (1-0.9)x 0.3)=0.369 ∑=1 0.9 x 0.3 =0.27 Rank2 0 ∑=1 Rank3 0 0.6 x 0.27=0.162 0**Probabilistic Ranking-Aggregates Query**• Procedural definition • Expand possible worlds • Group each world based on the grouping attributes • Score each group in each world based on group-aggregate function • Rank worlds based on groups’ scores • Aggregate the probabilities of top-ranked groups across worlds • Report the most probable answers • Very expensive !**Probabilistic Ranking-Aggregates Query**2-Level Space Model • Group World • A possible world restricted to a specific group • Projects the space of possible worlds on grouping attributes • Aggregating tuple scores in a group world gives a possible group aggregate value • Global World • A ranking of one or more group worlds, coming from different groups, based on their aggregate scores • Represents a candidate query answer • Goal: Materialize the necessary group worlds, and combine them into global worlds, while looking for the global worlds with the highest probabilities**Probabilistic Ranking-Aggregates Query**Grouping Attribute g1 Scoring Attribute + Membership probability + g2 g3 In-group Space Across-groups Space**Probabilistic Ranking-Aggregates Query**• Intra-group search • Applying the aggregate function to group worlds defines a probability distribution over the possible aggregates of each group • We are interested in exploring the aggregate space in score order to satisfy query’s ranking requirements • Basic Idea: • Incrementally scan group tuples in score order, and construct partitions of group aggregate space • Each partition has a score range and a probability • Partitions can be ordered if they have non-overlapping score ranges**Probabilistic Ranking-Aggregates Query**• Assume a group g={t1,t2,….} After retrieving t1(5,0.4) After retrieving t2(4,0.8) Prob. sg1={t1} sg11={t1, ¬t2} sg12={t1,t2} Score Rsum=[5,5+(|g|-1)x α] Rsum =[5,5+(|g|-2)xα] Rsum=[9,9+(|g|-2)x α] Pr(sg1)= 0.4 Pr(sg11)= 0.08 Pr(sg12)= 0.32 Max score of unseen group tuples Rsum=[4,4+(|g|-2)x α] Rsum= [0,|g|-2) xα] Rsum =[0,(|g|-1) xα] Pr(sg21)= 0.12 Pr(sg22)= 0.48 Pr(sg2)= 0.6 sg2= {¬t1} sg21={¬t1,¬t2} sg22={¬t1,t2}**Probabilistic Ranking-Aggregates Query**Space Navigation • Inter-groups search • Aggregate partitions coming from different groups are combined into across-groups ranking • Candidate answers are constructed incrementally based on probability Inter-group Agg. Inter-Groups States Global States Score-Ranked Group Partitions Intra-group Agg. … … g2 g1 gn**Probabilistic Ranking-Aggregates Query**State Probabilities guide the search 0.3 0.13 0.02 0.5 g1 g2 g2 g1 …. Inter-groups Search g2 g1 g3 g3 g3 g3 g11 g21 Score-ranked aggregate partitionsfrom different groups g12 Incrementally explored candidate answers g31 .… Intra-group Search**Probabilistic Nearest Neighbor Search**k-Nearest Neighbor Query reports the closest k objects to a query point q Objects are points in an n-dimensional space Many applications: Pattern recognition Statistical classification Location Tracking In real applications, objects have imprecise descriptions We propose a novel approach to efficiently answer NN queries over uncertain data**Examples**Locations of cell phones are probabilistically defined based on signal strength Moving objects have: Uncertain locations Uncertain membership Example query: Locate the nearest witnesses/police cars to an accident site User’s location is anonymized to preserve privacy Example query: locate the closest gas stations to a given user**Nearest-Neighbor Probability**An Uncertain Object has: A membership probability An uncertain attribute represented as a PDF defined on an uncertainty region NN Probability The probability of an object Oito be the NN, Pnn(Oi ,Q), is computed using nested integration over all objects O1 0.3 0.7 x O2 q 1.0 Q O4 0.1 O3 0.9 Pnn(O1| O1=x Q=q) = 0.7 * 1.0 * 0.9**Solution Overview**• Topk-PNN Query Report the top-k probable NNs based on Pnn(Oi,Q) • Two cost factors: • I/O cost due to object retrieval • CPU cost due to integral evaluation • Main Idea: • Incremental retrieval model • Progressive (lazy) tightening of bounds computed on Pnn until stopping criteria is met**Bounding Pnn Values**0.7 0.3 x 0.2 0.7 O2 0.1 O2 q O1 O1 1.0 q 0.3 0.7 O4 O4 0.1 0.9 0.9 0.1 O3 O3 Pnn(O1|O1=x) = (0.7)(1.0)(0.9) (0.2)(0.7)(0.1) Pnn(O1) (0.9)(1.0)(1.0)**Bound Tightening**• Bounds are refined by: • Selecting one subregion • Splitting the subregion into two smaller regions and re-computing the bounds**Ranking with Uncertain Scores**• A tuple’s score is uncertain • Data entry errors • Privacy concerns • Integrating heterogeneous data sources • Presentation style**Ranking with Uncertain Scores**• Uncertain scores induce a partial order • Multiple valid rankings conform with the partial order • Enumerating possible rankings is infeasible • Multiple challenges • Ranking Model • Query Semantics • Query Processing**Data Model**• A tuple’s score is a PDF defined on a real interval • Tuples’ scoring PDFs are independent • Probabilistic Partial Order Model • Non-intersecting intervals are totally ordered • Intersecting intervals have a Probabilistic Dominance Relationship • A probabilistic process generates a probability distribution on the space of linear extensions**Query Definitions**UTop-Prefix(3) UTop-Set(3) UTop-Rank(1,2) • Record-Rank Queries • UTop-Rank(i,j) • Top-k Queries • UTopPrefix (k) • UTopSet(k) • Rank Aggregation Queries • Optimal rank aggregation under Footrule/Kendal tau distance metrics • Mapping to UTop-Rank**Applications**• A UTop-Rank(i, j) query can be used to find the most probable athlete to end up in a range of ranks in some competition given a partial order of competitors • A UTop-Rank(1, k) query can be used to find the most-likely location to be in the top-k hottest locations based on uncertain sensor readings represented as intervals • A UTop-Prefix query can be used in market analysis to find the most-likely product ranking based on fuzzy evaluations in users’ reviews • A UTop-Set query can be used to find a set of products that are most-likely to be ranked higher than all other products.**Record-Rank Queries**Space of score combinations with t at rank i…j Space of all possible score combinations • Monte-Carlo Integration • Transform the complex space of linear extensions to the space of all possible score combinations • Can be easily sampled at random • Sample from each [ loi, upi ] a score value biased by fi(x) • Allows estimating the summation of the probabilities of linear extensions having tupletin the rank range i … j x Avg( πfi(x) ) Pr(t at rank i …j) = x Vol( )**Record-Rank Queries**• Monte-Carlo integration allows us to avoid computing complex multidimensional integrals with dependent limits • Monte-Carlo integration gives the expected value of the multidimensional integral • The accuracy depends only on the number of drawn samples • The huge size of possible linear extensions does not affect accuracy • Error in O(1/s(1/2)), where s is the number of samples • We need to evaluate a linear number of integrals (one Monte-Carlo integral per tuple) • Given a sample size, the total cost is linear in database size • Using a heap of size l, we can easily compute the l most probable answers on the fly**Top-k Queries**• The answer space is exponential • We need to compute a nested integral for each possible permutation of k tuples, and report the most probable permutations • Exact computation cannot scale with respect to k • Computing approximate answers using Markov Chain Monte-Carlo (MCMC) methods by sampling the space biased by prefix probability: • Start from an initial linear extension with Prefix p0 • Apply a transformation to move to another linear extension with Prefix p1 • Accept the new linear extension biased by the probability of p1 with respect to p0 • Repeat from (1)