Chapter 13: Incorporating Uncertainty into Data Integration

1 / 18

# Chapter 13: Incorporating Uncertainty into Data Integration - PowerPoint PPT Presentation

## Chapter 13: Incorporating Uncertainty into Data Integration

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
##### Presentation Transcript

1. Chapter 13: Incorporating Uncertainty into Data Integration PRINCIPLES OF DATA INTEGRATION ANHAI DOAN ALON HALEVY ZACHARY IVES

2. Outline • Sources of uncertainty in data integration • Representing uncertain data (brief overview) • Probabilistic schema mappings

3. Managing Uncertain Data • Databases typically model certain data: • A tuple is either true (in the database) or false (not in the database). • Real life involves a lot of uncertainty: • “The thief had either blond or brown hair” • The sensor reading is often unreliable. • Uncertain databases try to model such uncertain data and to answer queries in a principled fashion. • Data integration involves multiple facets of uncertainty!

4. Uncertainty in Data Integration • Data itself may be uncertain (perhaps it’s extracted from an unreliable source) • Schema mappings can be approximate (perhaps created by an automatic tool) • Reference reconciliation (and hence joins) are approximate • If the domain is broad enough, even the mediated schema could involve uncertainty • Queries, often posed as keywords, have uncertain intent.

5. Outline • Sources of uncertainty in data integration • Representing uncertain data (brief overview) • Probabilistic schema mappings

6. Principles of Uncertain Databases • Instead of describing one possible state of the world, an uncertain database describes a set of possible worlds. • The expressive power of the data model determines which sets of possible world that database can represent. • Is uncertainty on values of an attribute? • Or on the presence of a tuple? • Can dependencies between tuples be represented?

7. C-Tables: Uncertainty without Probabilities • Alice and Bob want to go on a vacation together, but will go to either Tahiti or Ulaanbaatar. Candace will definitely go to Ulaanbaatar. • Possible words result from different assignments to the variables.

8. Representing Complex Distributions • The c-table represents mutual exclusion of tuples, but doesn’t represent probability distributions. • Representing complex probability distributions and correlations between tuples requires using probabilistic graphical models. • A couple of simpler models: • Independent tuple probabilities • Block independent probabilities

9. Tuple Independent Model • Assign each tuple a probability. • The probability of every possible world is the appropriate product of the probabilities for each of the rows. • pi if row i is in the database, and (1-pi) if it’s not. • Cannot represent correlations between tuples.

10. Block Independent Model • You choose one tuple from every block according to the distribution of that block. • Can represent mutual exclusion, but not co-dependence (i.e., Alice and Bob going to the same location).

11. Outline • Sources of uncertainty in data integration • Representing uncertain data (brief overview) • Probabilistic schema mappings

12. Probabilistic Schema Mappings • Source schema: • S=(pname, email-addr, home-addr, office-addr) • Target schema: • T=(name, mailing-addr) • We may not be sure which attribute of S mailing-addr should map to? • Probabilistic schema mappings let us handle such uncertainty.