Chapter 13: Incorporating Uncertainty into Data Integration

Chapter 13: Incorporating Uncertainty into Data Integration PRINCIPLES OF DATA INTEGRATION ANHAI DOAN ALON HALEVY ZACHARY IVES

Outline • Sources of uncertainty in data integration • Representing uncertain data (brief overview) • Probabilistic schema mappings

Managing Uncertain Data • Databases typically model certain data: • A tuple is either true (in the database) or false (not in the database). • Real life involves a lot of uncertainty: • “The thief had either blond or brown hair” • The sensor reading is often unreliable. • Uncertain databases try to model such uncertain data and to answer queries in a principled fashion. • Data integration involves multiple facets of uncertainty!

Uncertainty in Data Integration • Data itself may be uncertain (perhaps it’s extracted from an unreliable source) • Schema mappings can be approximate (perhaps created by an automatic tool) • Reference reconciliation (and hence joins) are approximate • If the domain is broad enough, even the mediated schema could involve uncertainty • Queries, often posed as keywords, have uncertain intent.

Principles of Uncertain Databases • Instead of describing one possible state of the world, an uncertain database describes a set of possible worlds. • The expressive power of the data model determines which sets of possible world that database can represent. • Is uncertainty on values of an attribute? • Or on the presence of a tuple? • Can dependencies between tuples be represented?

C-Tables: Uncertainty without Probabilities • Alice and Bob want to go on a vacation together, but will go to either Tahiti or Ulaanbaatar. Candace will definitely go to Ulaanbaatar. • Possible words result from different assignments to the variables.

Representing Complex Distributions • The c-table represents mutual exclusion of tuples, but doesn’t represent probability distributions. • Representing complex probability distributions and correlations between tuples requires using probabilistic graphical models. • A couple of simpler models: • Independent tuple probabilities • Block independent probabilities

Tuple Independent Model • Assign each tuple a probability. • The probability of every possible world is the appropriate product of the probabilities for each of the rows. • pi if row i is in the database, and (1-pi) if it’s not. • Cannot represent correlations between tuples.

Block Independent Model • You choose one tuple from every block according to the distribution of that block. • Can represent mutual exclusion, but not co-dependence (i.e., Alice and Bob going to the same location).

Probabilistic Schema Mappings • Source schema: • S=(pname, email-addr, home-addr, office-addr) • Target schema: • T=(name, mailing-addr) • We may not be sure which attribute of S mailing-addr should map to? • Probabilistic schema mappings let us handle such uncertainty.

Probabilistic Schema Mappings Intuitively, we want to give each mapping a probability: • S=(pname, email-addr, home-addr, office-addr) • T=(name, mailing-addr)

What are the Semantics? • S=(pname, email-addr, home-addr, office-addr) • T=(name, mailing-addr) Should a single mapping apply to the entire table? (by-table semantics), or can different mappings apply to different tuples? (by-tuple semantics)

By-Tableversus By-Tuple Semantics Ds= There are 3 possible databases DT: DT= Pr(m1)=0.5 Pr(m2)=0.4 Pr(m3)=0.1

By-Table versus By-Tuple Semantics Ds= There are 9 possible databases DT: … DT= Pr(<m1,m3>)=0.05 Pr(<m2,m3>)=0.04 Pr(<m3,m3>)=0.01

Complexity of Query Answering Answering queries is more expensive under by-tuple semantics:

Summary of Chapter 13 • Uncertainty is everywhere in data integration • Work on this area is really only beginning • Great opportunity for further research. • Probabilistic schema mappings: • By-table versus by-tuple semantics • By-tuple semantics is computationally expensive, but restricted cases can found where query answering is still polynomial. • Where do the probabilities come from? • Sometimes we interpret statistics as probabilities • Sometimes the provenance of the data is more meaningful than the probabilities

Chapter 13: Incorporating Uncertainty into Data Integration