chapter 13 incorporating uncertainty into data integration n.
Skip this Video
Loading SlideShow in 5 Seconds..
Chapter 13: Incorporating Uncertainty into Data Integration PowerPoint Presentation
Download Presentation
Chapter 13: Incorporating Uncertainty into Data Integration

play fullscreen
1 / 18
Download Presentation

Chapter 13: Incorporating Uncertainty into Data Integration - PowerPoint PPT Presentation

odetta
132 Views
Download Presentation

Chapter 13: Incorporating Uncertainty into Data Integration

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Chapter 13: Incorporating Uncertainty into Data Integration PRINCIPLES OF DATA INTEGRATION ANHAI DOAN ALON HALEVY ZACHARY IVES

  2. Outline • Sources of uncertainty in data integration • Representing uncertain data (brief overview) • Probabilistic schema mappings

  3. Managing Uncertain Data • Databases typically model certain data: • A tuple is either true (in the database) or false (not in the database). • Real life involves a lot of uncertainty: • “The thief had either blond or brown hair” • The sensor reading is often unreliable. • Uncertain databases try to model such uncertain data and to answer queries in a principled fashion. • Data integration involves multiple facets of uncertainty!

  4. Uncertainty in Data Integration • Data itself may be uncertain (perhaps it’s extracted from an unreliable source) • Schema mappings can be approximate (perhaps created by an automatic tool) • Reference reconciliation (and hence joins) are approximate • If the domain is broad enough, even the mediated schema could involve uncertainty • Queries, often posed as keywords, have uncertain intent.

  5. Outline • Sources of uncertainty in data integration • Representing uncertain data (brief overview) • Probabilistic schema mappings

  6. Principles of Uncertain Databases • Instead of describing one possible state of the world, an uncertain database describes a set of possible worlds. • The expressive power of the data model determines which sets of possible world that database can represent. • Is uncertainty on values of an attribute? • Or on the presence of a tuple? • Can dependencies between tuples be represented?

  7. C-Tables: Uncertainty without Probabilities • Alice and Bob want to go on a vacation together, but will go to either Tahiti or Ulaanbaatar. Candace will definitely go to Ulaanbaatar. • Possible words result from different assignments to the variables.

  8. Representing Complex Distributions • The c-table represents mutual exclusion of tuples, but doesn’t represent probability distributions. • Representing complex probability distributions and correlations between tuples requires using probabilistic graphical models. • A couple of simpler models: • Independent tuple probabilities • Block independent probabilities

  9. Tuple Independent Model • Assign each tuple a probability. • The probability of every possible world is the appropriate product of the probabilities for each of the rows. • pi if row i is in the database, and (1-pi) if it’s not. • Cannot represent correlations between tuples.

  10. Block Independent Model • You choose one tuple from every block according to the distribution of that block. • Can represent mutual exclusion, but not co-dependence (i.e., Alice and Bob going to the same location).

  11. Outline • Sources of uncertainty in data integration • Representing uncertain data (brief overview) • Probabilistic schema mappings

  12. Probabilistic Schema Mappings • Source schema: • S=(pname, email-addr, home-addr, office-addr) • Target schema: • T=(name, mailing-addr) • We may not be sure which attribute of S mailing-addr should map to? • Probabilistic schema mappings let us handle such uncertainty.

  13. Probabilistic Schema Mappings Intuitively, we want to give each mapping a probability: • S=(pname, email-addr, home-addr, office-addr) • T=(name, mailing-addr)

  14. What are the Semantics? • S=(pname, email-addr, home-addr, office-addr) • T=(name, mailing-addr) Should a single mapping apply to the entire table? (by-table semantics), or can different mappings apply to different tuples? (by-tuple semantics)

  15. By-Tableversus By-Tuple Semantics Ds= There are 3 possible databases DT: DT= Pr(m1)=0.5 Pr(m2)=0.4 Pr(m3)=0.1

  16. By-Table versus By-Tuple Semantics Ds= There are 9 possible databases DT: … DT= Pr(<m1,m3>)=0.05 Pr(<m2,m3>)=0.04 Pr(<m3,m3>)=0.01

  17. Complexity of Query Answering Answering queries is more expensive under by-tuple semantics:

  18. Summary of Chapter 13 • Uncertainty is everywhere in data integration • Work on this area is really only beginning • Great opportunity for further research. • Probabilistic schema mappings: • By-table versus by-tuple semantics • By-tuple semantics is computationally expensive, but restricted cases can found where query answering is still polynomial. • Where do the probabilities come from? • Sometimes we interpret statistics as probabilities • Sometimes the provenance of the data is more meaningful than the probabilities