132 Views

Download Presentation
## Chapter 13: Incorporating Uncertainty into Data Integration

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Chapter 13: Incorporating Uncertainty into Data Integration**PRINCIPLES OF DATA INTEGRATION ANHAI DOAN ALON HALEVY ZACHARY IVES**Outline**• Sources of uncertainty in data integration • Representing uncertain data (brief overview) • Probabilistic schema mappings**Managing Uncertain Data**• Databases typically model certain data: • A tuple is either true (in the database) or false (not in the database). • Real life involves a lot of uncertainty: • “The thief had either blond or brown hair” • The sensor reading is often unreliable. • Uncertain databases try to model such uncertain data and to answer queries in a principled fashion. • Data integration involves multiple facets of uncertainty!**Uncertainty in Data Integration**• Data itself may be uncertain (perhaps it’s extracted from an unreliable source) • Schema mappings can be approximate (perhaps created by an automatic tool) • Reference reconciliation (and hence joins) are approximate • If the domain is broad enough, even the mediated schema could involve uncertainty • Queries, often posed as keywords, have uncertain intent.**Outline**• Sources of uncertainty in data integration • Representing uncertain data (brief overview) • Probabilistic schema mappings**Principles of Uncertain Databases**• Instead of describing one possible state of the world, an uncertain database describes a set of possible worlds. • The expressive power of the data model determines which sets of possible world that database can represent. • Is uncertainty on values of an attribute? • Or on the presence of a tuple? • Can dependencies between tuples be represented?**C-Tables: Uncertainty without Probabilities**• Alice and Bob want to go on a vacation together, but will go to either Tahiti or Ulaanbaatar. Candace will definitely go to Ulaanbaatar. • Possible words result from different assignments to the variables.**Representing Complex Distributions**• The c-table represents mutual exclusion of tuples, but doesn’t represent probability distributions. • Representing complex probability distributions and correlations between tuples requires using probabilistic graphical models. • A couple of simpler models: • Independent tuple probabilities • Block independent probabilities**Tuple Independent Model**• Assign each tuple a probability. • The probability of every possible world is the appropriate product of the probabilities for each of the rows. • pi if row i is in the database, and (1-pi) if it’s not. • Cannot represent correlations between tuples.**Block Independent Model**• You choose one tuple from every block according to the distribution of that block. • Can represent mutual exclusion, but not co-dependence (i.e., Alice and Bob going to the same location).**Outline**• Sources of uncertainty in data integration • Representing uncertain data (brief overview) • Probabilistic schema mappings**Probabilistic Schema Mappings**• Source schema: • S=(pname, email-addr, home-addr, office-addr) • Target schema: • T=(name, mailing-addr) • We may not be sure which attribute of S mailing-addr should map to? • Probabilistic schema mappings let us handle such uncertainty.**Probabilistic Schema Mappings**Intuitively, we want to give each mapping a probability: • S=(pname, email-addr, home-addr, office-addr) • T=(name, mailing-addr)**What are the Semantics?**• S=(pname, email-addr, home-addr, office-addr) • T=(name, mailing-addr) Should a single mapping apply to the entire table? (by-table semantics), or can different mappings apply to different tuples? (by-tuple semantics)**By-Tableversus By-Tuple Semantics**Ds= There are 3 possible databases DT: DT= Pr(m1)=0.5 Pr(m2)=0.4 Pr(m3)=0.1**By-Table versus By-Tuple Semantics**Ds= There are 9 possible databases DT: … DT= Pr(<m1,m3>)=0.05 Pr(<m2,m3>)=0.04 Pr(<m3,m3>)=0.01**Complexity of Query Answering**Answering queries is more expensive under by-tuple semantics:**Summary of Chapter 13**• Uncertainty is everywhere in data integration • Work on this area is really only beginning • Great opportunity for further research. • Probabilistic schema mappings: • By-table versus by-tuple semantics • By-tuple semantics is computationally expensive, but restricted cases can found where query answering is still polynomial. • Where do the probabilities come from? • Sometimes we interpret statistics as probabilities • Sometimes the provenance of the data is more meaningful than the probabilities