Data Uncertainty

Supporting Ranking and Aggregation Queries on Uncertain and Incomplete Datahttp://www.cs.uwaterloo.ca/~ilyas/URank

Data Uncertainty • Conventional databases are deterministic • An item either is in the database or not in the database • Database is a “complete world” • A tuple either is in the query answer or is not • In many current applications, data are uncertain • Sensor networks • Information extraction • Moving objects tracking • Data cleaning • “A tuple belongs to the database” is a probabilistic event • “A tuple is part of query answer” is a probabilistic event • Data models, e.g., relational, XML, etc .. need to be extended to capture uncertainty

The Future, Smart Home Alert Systems A/V sensors Temp. sensors Security Cameras Power Monitors Motion sensors

Managing Dynamic Data • Readings are sent periodically to a server • GPS devices • Mobile phones • Sensors • Infeasible to keep track of exact values • Limited bandwidth • Limited battery • Continuous changes • Dynamic data is represented as intervals associated with PDFs • Queries are answered by processing the probabilistic intervals

Managing Sensor Data • Queries are submitted at a base station, parsed, optimized and sent into the sensor network, where they are disseminated and processed, with results flowing back up the routing tree • Sensor readings are probabilistically defined based on readings’ histories and correlations • Possible queries • Finding query answers consistent with readings correlations? • Finding answers with high confidence? • Ranking sensor readings by their values?

Traffic-Monitoring Applications • Identification mistakes • Interference by high-voltage lines • Imperfection in radar devices Uncertain Data • Data Dependencies • What are the most speeding cars ? • What are the most appropriate locations for speed traps? Probabilistic Correlations

Uncertain Data on the Web • Data entry errors • Privacy concerns • Integrating data from different sources • Presentation Style Uncertain/Incomplete Data • What are the best apartments with areas above 70 m2 ? • Find the best cars with odometer between 50,000 and 70,000 km

Many New Challenges ... Results Visualization Visualizing Uncertainty High-Level Summaries Answer Space Lineage Query Processing Ranking & Aggregation Computation Cost Special Query Types Approximation Data Models Uncertainty Models Data Types Probabilistic Correlations Uncertainty in Schemas

Ranking and Aggregating Uncertain Data Data Analysis Aggregation Uncertainty Ranking Decision Making Data Exploration Data Cleaning

New semantics for ranking and aggregation queries on uncertain data • Studying the interaction between scores and probabilities • Integrating scoring and uncertainty dimensions into the same query definitions • Efficient support of probabilistic ranking and aggregation queries • Incremental and optimized query processing • Unified framework for handling both ranking and uncertainty • Finding the most probable answers

Possible Worlds Semantics 0.3 Independent Tuples Correlated Tuples 0.12 0.42 0.6 0.18 0.1 0.28

Probabilistic Query Processing • Intensional Semantics [Dalvi et al., VLDB’04] • Independent base tuples • Relational processing induces correlations among intermediate results r2 R S r1 r1 0.3 0.4 s1 0.5 r2 Pr(r2) σ(A=‘y’)(R)={<y,z: 0.5>} s1 Pr(r2 ^ s1) Pr(r1 ^ s1) R ⋈B=C S = {<x,z,z,w:0.12>,<y,z,z,w:0.2>} Pr ((r1 ^ s1) V (r2 ^ s1)) πD(R ⋈B=C S)={<w:0.26>}

Score-Uncertainty Interaction The most probable rank-1 tuple with prob 0.7 R 0.072 SELECT tID FROM R ORDER BY Score LIMIT 2 0.048 0.108 The most probable rank -2 tuple with prob 0.324 The most probable top-2 vector with prob. 0.28 0.168 0.112 0.252 0.168 0.072

Probabilistic Ranking and Aggregation

Data Model • We treat tuples as probabilistic events associated with membership probabilities • Later, we discuss modeling uncertain attribute values • Each tuple has a score computed using a query-specified scoring function • We assume a general uncertainty model that allows for computing the joint probability of an arbitrary combination of tuple events • Computing this probability is the only interface between the uncertainty model, and our processing framework • Example Models • Independent Tuples [Dalvi et al.VLDB04] • Correlated Tuples with Simple Rules [Sarma et al., ICDE06] • General Inference Model [Sen et al. ICDE07]

Top-k Query Definitions Integrating Score and Probability U-kRanks Top-ith tuples U-Topk Top-k (vectors) Aggregate probabilities on the level of individual ranks Aggregate probabilities on the level of tuple vectors

Example t1 t2 t3 t4 t5 t6 U-Top2 U-Rank1 U-Rank2 0.28 0.42 0.324 0.112 0.168 0.072 0.048 0.108 0.168 0.252 0.072

Ranking-Aggregate Query Definitions Integrating Score and Probability with Grouping Criteria U-kRanks-Agg Top-ith groups U-Topk-Agg Top-k ( group vectors) Aggregate probabilities on the level of individual ranks Aggregate probabilities on the level of group vectors

Example t1 t2 0.112 0.072 0.168 0.048 t3 t4 t5 t6 0.168 0.252 0.072 0.108 Grouped, Aggregated, and Ranked Worlds

Processing Framework Query Answer Ranked tuple vector State Formulation Space Navigation Probabilistic Ranking Layer Formulated state Score-ordered tuples from requested groups Score-ordered tuples Probability Tuple Events t1 t2 Probabilistic Top-k Probabilistic Top-k Agg. … t3 Incremental Tuple Retrieval g2 g1 gn t4 … Per-Group Access Tuple Access Layer Relational Query Engine Rule Engine Access Methods ... RandomAccess ScoreAccess Prob.Access Dependency information Physical Data and Rules Store

Assumptions • Tuple Access • Tuples are consumed incrementally, i.e., one by one, from the output of a query executed by the relational engine in the Tuple Access Layer • Group Access • We assume an interface that allows accessing tuples from specific groups incrementally • Available Dependency Information • Dependencies among query output tuples are only known when these tuples are consumed by the Probabilistic Ranking Layer • Group Cardinality Information • The Tuple Access Layer provides (bounds on) the size of each group • Event Probability Computation • We assume the Rule Engine responds with exact probabilities to the submitted questions (tuple events combinations)

Search Space Model • Top-l State • Top-l vector in one or more possible worlds • State Probability • The joint probability of involved tuple events • Search Goal • UTop-k: A state of length k with the maximum probability • Probability Reduction Property • State Comparability (Space Pruning) • Complete states prune partial ones • With independence, partial states prune each other

Search Space Model U-Top2 ? Ф t1 t2 ¬t2 t2 t2 ¬t2 0.3 0.7 t3 ¬t3 Upper bound over the probability of any other answer Pr (<t2> is the top-1) Scan tuples in Score order t2 ¬t3 t2 t3 0.42 0.28 Score-ordered prefixes of one or more worlds A complete state containing a possible top-2 answer A partial state thatmight generate another top-2 answer t4 t4 t5 t5 Space size is nCk (in O(nk)), where n is the number of consumed tuples

Probabilistic Ranking Queries • Tuple Access layer executes a score-based top-k query plan • Probabilistic Ranking Layer consumes tuples (when needed) in score order • Necessary to compute non-trivial bounds on state probabilities • Other retrieval orders necessitate fully materializing the space • Our main cost metric is the number of consumed tuples n • Space size is in O(nk) • We adopt lazy space materialization to avoid space explosion • Query processing algorithms aim at • Minimizing the number of consumed tuples, and • Minimizing the number of materialized space states • Optimality guarantees on the size of materialized space, and the number of consumed tuples • Based on optimality of A* Search

U-Topk Queries with Dependencies U-Top2 ? Ф ¬t2 t2 t2 ¬t2 0.3 0.7 Partial states cannot prune each other t3 ¬t3 t2,¬t3 t2,t3 0.42 0.35 Complete State Complete states prune other partial states Partial states cannot prune other partial states

U-Topk Queries with Independence U-Top3 ? Ф ¬t2 t2 t2 ¬t2 0.3 0.7 Pr(.) >= Pr(.) Pr(<t2,*,*>) > Pr(<¬t2, *, *,*>) Partial states prune other partial states of the same or smaller length

U-kRanks Queries with Independence U-3Ranks? Score-ranked tuple stream t2:0.9 t3:0.6 t1:0.3 t3 t1 t2 ∑=1 Rank1 0.3 0.9 x 0.7 =0.63 0.6 x 0.1 x 0.7 =0.042 0.6 x ( (0.63) + (1-0.9)x 0.3)=0.369 ∑=1 0.9 x 0.3 =0.27 Rank2 0 ∑=1 Rank3 0 0.6 x 0.27=0.162 0

Probabilistic Ranking-Aggregates Query • Procedural definition • Expand possible worlds • Group each world based on the grouping attributes • Score each group in each world based on group-aggregate function • Rank worlds based on groups’ scores • Aggregate the probabilities of top-ranked groups across worlds • Report the most probable answers • Very expensive !

Probabilistic Ranking-Aggregates Query 2-Level Space Model • Group World • A possible world restricted to a specific group • Projects the space of possible worlds on grouping attributes • Aggregating tuple scores in a group world gives a possible group aggregate value • Global World • A ranking of one or more group worlds, coming from different groups, based on their aggregate scores • Represents a candidate query answer • Goal: Materialize the necessary group worlds, and combine them into global worlds, while looking for the global worlds with the highest probabilities

Probabilistic Ranking-Aggregates Query Grouping Attribute g1 Scoring Attribute + Membership probability + g2 g3 In-group Space Across-groups Space

Probabilistic Ranking-Aggregates Query • Intra-group search • Applying the aggregate function to group worlds defines a probability distribution over the possible aggregates of each group • We are interested in exploring the aggregate space in score order to satisfy query’s ranking requirements • Basic Idea: • Incrementally scan group tuples in score order, and construct partitions of group aggregate space • Each partition has a score range and a probability • Partitions can be ordered if they have non-overlapping score ranges

Probabilistic Ranking-Aggregates Query • Assume a group g={t1,t2,….} After retrieving t1(5,0.4) After retrieving t2(4,0.8) Prob. sg1={t1} sg11={t1, ¬t2} sg12={t1,t2} Score Rsum=[5,5+(|g|-1)x α] Rsum =[5,5+(|g|-2)xα] Rsum=[9,9+(|g|-2)x α] Pr(sg1)= 0.4 Pr(sg11)= 0.08 Pr(sg12)= 0.32 Max score of unseen group tuples Rsum=[4,4+(|g|-2)x α] Rsum= [0,|g|-2) xα] Rsum =[0,(|g|-1) xα] Pr(sg21)= 0.12 Pr(sg22)= 0.48 Pr(sg2)= 0.6 sg2= {¬t1} sg21={¬t1,¬t2} sg22={¬t1,t2}

Probabilistic Ranking-Aggregates Query Space Navigation • Inter-groups search • Aggregate partitions coming from different groups are combined into across-groups ranking • Candidate answers are constructed incrementally based on probability Inter-group Agg. Inter-Groups States Global States Score-Ranked Group Partitions Intra-group Agg. … … g2 g1 gn

Probabilistic Ranking-Aggregates Query State Probabilities guide the search 0.3 0.13 0.02 0.5 g1 g2 g2 g1 …. Inter-groups Search g2 g1 g3 g3 g3 g3 g11 g21 Score-ranked aggregate partitionsfrom different groups g12 Incrementally explored candidate answers g31 .… Intra-group Search

Probabilistic Nearest Neighbor Search

Probabilistic Nearest Neighbor Search k-Nearest Neighbor Query reports the closest k objects to a query point q Objects are points in an n-dimensional space Many applications: Pattern recognition Statistical classification Location Tracking In real applications, objects have imprecise descriptions We propose a novel approach to efficiently answer NN queries over uncertain data

Examples Locations of cell phones are probabilistically defined based on signal strength Moving objects have: Uncertain locations Uncertain membership Example query: Locate the nearest witnesses/police cars to an accident site User’s location is anonymized to preserve privacy Example query: locate the closest gas stations to a given user

Nearest-Neighbor Probability An Uncertain Object has: A membership probability An uncertain attribute represented as a PDF defined on an uncertainty region NN Probability The probability of an object Oito be the NN, Pnn(Oi ,Q), is computed using nested integration over all objects O1 0.3 0.7 x O2 q 1.0 Q O4 0.1 O3 0.9 Pnn(O1| O1=x  Q=q) = 0.7 * 1.0 * 0.9

Solution Overview • Topk-PNN Query Report the top-k probable NNs based on Pnn(Oi,Q) • Two cost factors: • I/O cost due to object retrieval • CPU cost due to integral evaluation • Main Idea: • Incremental retrieval model • Progressive (lazy) tightening of bounds computed on Pnn until stopping criteria is met

Bounding Pnn Values 0.7 0.3 x 0.2 0.7 O2 0.1 O2 q O1 O1 1.0 q 0.3 0.7 O4 O4 0.1 0.9 0.9 0.1 O3 O3 Pnn(O1|O1=x) = (0.7)(1.0)(0.9) (0.2)(0.7)(0.1)  Pnn(O1)  (0.9)(1.0)(1.0)

Bound Tightening • Bounds are refined by: • Selecting one subregion • Splitting the subregion into two smaller regions and re-computing the bounds

Ranking with Uncertain Scores

Ranking with Uncertain Scores • A tuple’s score is uncertain • Data entry errors • Privacy concerns • Integrating heterogeneous data sources • Presentation style

Ranking with Uncertain Scores • Uncertain scores induce a partial order • Multiple valid rankings conform with the partial order • Enumerating possible rankings is infeasible • Multiple challenges • Ranking Model • Query Semantics • Query Processing

Data Model • A tuple’s score is a PDF defined on a real interval • Tuples’ scoring PDFs are independent • Probabilistic Partial Order Model • Non-intersecting intervals are totally ordered • Intersecting intervals have a Probabilistic Dominance Relationship • A probabilistic process generates a probability distribution on the space of linear extensions

Query Definitions UTop-Prefix(3) UTop-Set(3) UTop-Rank(1,2) • Record-Rank Queries • UTop-Rank(i,j) • Top-k Queries • UTopPrefix (k) • UTopSet(k) • Rank Aggregation Queries • Optimal rank aggregation under Footrule/Kendal tau distance metrics • Mapping to UTop-Rank

Applications • A UTop-Rank(i, j) query can be used to find the most probable athlete to end up in a range of ranks in some competition given a partial order of competitors • A UTop-Rank(1, k) query can be used to find the most-likely location to be in the top-k hottest locations based on uncertain sensor readings represented as intervals • A UTop-Prefix query can be used in market analysis to find the most-likely product ranking based on fuzzy evaluations in users’ reviews • A UTop-Set query can be used to find a set of products that are most-likely to be ranked higher than all other products.

Record-Rank Queries Space of score combinations with t at rank i…j Space of all possible score combinations • Monte-Carlo Integration • Transform the complex space of linear extensions to the space of all possible score combinations • Can be easily sampled at random • Sample from each [ loi, upi ] a score value biased by fi(x) • Allows estimating the summation of the probabilities of linear extensions having tupletin the rank range i … j x Avg( πfi(x) ) Pr(t at rank i …j) = x Vol( )

Record-Rank Queries • Monte-Carlo integration allows us to avoid computing complex multidimensional integrals with dependent limits • Monte-Carlo integration gives the expected value of the multidimensional integral • The accuracy depends only on the number of drawn samples • The huge size of possible linear extensions does not affect accuracy • Error in O(1/s(1/2)), where s is the number of samples • We need to evaluate a linear number of integrals (one Monte-Carlo integral per tuple) • Given a sample size, the total cost is linear in database size • Using a heap of size l, we can easily compute the l most probable answers on the fly

Top-k Queries • The answer space is exponential • We need to compute a nested integral for each possible permutation of k tuples, and report the most probable permutations • Exact computation cannot scale with respect to k • Computing approximate answers using Markov Chain Monte-Carlo (MCMC) methods by sampling the space biased by prefix probability: • Start from an initial linear extension with Prefix p0 • Apply a transformation to move to another linear extension with Prefix p1 • Accept the new linear extension biased by the probability of p1 with respect to p0 • Repeat from (1)

Data Uncertainty