Probabilistic Information Integration Maurice van Keulen, Ander de Keijzer

1 / 20

# Probabilistic Information Integration Maurice van Keulen, Ander de Keijzer - PowerPoint PPT Presentation

Probabilistic Information Integration Maurice van Keulen, Ander de Keijzer. Automatic data integration. Why does data integration take so long, why not automatic? The schema mismatch problem The data conversion/mapping problem

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## Probabilistic Information Integration Maurice van Keulen, Ander de Keijzer

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Probabilistic Information IntegrationMaurice van Keulen, Ander de Keijzer

Dagstuhl Seminar 08421 - Probabilistic Information Integration

Automatic data integration

Why does data integration take so long, why not automatic?

• The schema mismatch problem
• The data conversion/mapping problem
• The overlapping data problem(entity resolution / record linkage / data cleaning)
• Proverbial 90% of the cases is straightforwardcan be done with little development effort
• Proverbial 10% of the cases are hardtake most of the development time

Let’s simply not solve those 10% rightaway!

Let’s go for an initial integration that can readily be used

“Good is good enough” for many applications

Let it improve over time during use

Dagstuhl Seminar 08421 - Probabilistic Information Integration

How to deal with remaining 10%
• Conflict between sources ≠ inconsistency= (Independent) observations
• Data conflicts and partial/ambiguous matchings are symptoms of semantic uncertainty

Our approach to data integration:

• Define few rules to resolve only proverbial 90% of the cases
• Store initial integration result as uncertain data
• Start using the integrated data(time-to-market 10x earlier)
• Queries will return uncertain answers
• But integrated data can already meaningfully used

Dagstuhl Seminar 08421 - Probabilistic Information Integration

Data integration process

(semi-)automatic

user interaction

Solve remaining semantic uncertainty during use

1. Data integrationwith external source

Allow early meaningful use of integrated data

2. Query

DB

DB

DB

lessuncertain

certain

uncertain

Dagstuhl Seminar 08421 - Probabilistic Information Integration

What we built

DemoGUI

Focus of talk

• Differences / correspondences between probabilistic XML and relational DBs
• Probabilistic integration algorithm
• What would defy my purpose?
• What is quality? (metrics)
• When is it good enough?(experiments)

ProbabilisticIntegrationFunctionality

IMPrECISE

ProbabilisticXMLDatabase

αML

XMLDBMS

MonetDB/XQuery

Dagstuhl Seminar 08421 - Probabilistic Information Integration

probabilistic node

possibility

tag

XML node with tag name ‘tag’

Data representation

Probabilistic XML tree represents all possible worlds in one tree

Possible worlds

• Movie list with 1 movie (King Kong/1933)probability 8%
• Movie list with 1 movie (King Kong/1976)probability 32%
• Movie list with 2 movies(King Kong/1933 and King Kong/1976)probability 60%

Can express uncertainty about existence, dependent and independent choice

1

choice points

movies

.4

.6

movie

movie

movie

.2

.8

1

1

1

1

1

tl

yr

yr

tl

yr

tl

yr

King Kong

1933

1976

King Kong

1933

King Kong

1976

Dagstuhl Seminar 08421 - Probabilistic Information Integration

Differences / correspondencesXML vs. relational

What to say about our probabilistic XML DBMS

Representation:

• Choice point (▼) = variable / x-tupleAlternative (O) = possible var assignment / alternative
• Dependencies expressed in ancestor/descendant= event expression / lineage formula

Querying

• In XPath/XQuery vs. SQL
• Semantics of querying according to possible world theory. Scalable implementation by working directly on compact/succint representation

Dagstuhl Seminar 08421 - Probabilistic Information Integration

Motivating example

Scenario of demo on ICDE april 2008:

• Portal with daily recommendation of movies on TV
• Source 1 : TV guide (e.g., www.tvguide.com)
• Enrich with information of Source 2 : IMDB
• Combined 18 ‘attributes’ of which 6 overlap
• Entity resolution problem with movies and actors

Dagstuhl Seminar 08421 - Probabilistic Information Integration

Movie

Title: King Kong

Year 1976 Year 1933

Rating: 8.0; 5.5

Movie

Title: King Kong

Movie

Year 1933

Title: King Kong

Rating: 8.0

Year 1976

Rating: 5.5

Movie

Movie

Title: King Kong

Title: King Kong

Year 1976

Year 1933

Rating: 8.0

Rating: 5.5

Movie

Movie

Title: King Kong

Title: King Kong

Year 1933

Year 1976

Rating: 8.0

Rating: 5.5

Uncertainty concerningentity resolution

Same movie;for conflictingfields, both are correct

Different movies

Schema may exclude this possibility

Same movie;for conflictingfields, one is correct

Dagstuhl Seminar 08421 - Probabilistic Information Integration

Integration functionality

Integration algorithm =

XML Tree-merge (in recursive descent fashion)

• Similarity matching

(In Christoph’s words)

• Repair-key
• Select worlds that satisfy background knowledge
• Rules / Constraints
• Thresholds

Strict separation of concerns

• Integration mechanism:enumeration of possibilities + XML tree merge
• Integration intelligence:background knowledge + similarity matching

Dagstuhl Seminar 08421 - Probabilistic Information Integration

Result
• A compact/succinct representation of all possible merged XML trees

Why in this way?

• Result need not be perfectAn integration result of ‘good enough quality’ sufficesSemantical issues in data integration not an obstacle

Knowledge needed for meaningful use

• Schema info (e.g., movies have 1 year child)
• Some thresholds (e.g., less than 50% match on titles means not the same movie title)
• Few domain specific rules (e.g., for (possibly) the same movie, if actors agree on role and role is unique in movie, then decide same actor regardless of difference in name)
• Automatic fallback: edit distance similarity (should be something better)

I on purpose use bad similarity matcher

Dagstuhl Seminar 08421 - Probabilistic Information Integration

What would defy my purpose?

Purpose is to significantly reduce software development effort for obtaining integration of sources that is good enough (reduce time-to-market)

• What is good enough?
• Useful metrics for data quality
• Threshold on metric when good enough
• I do not reduce anything if
• Need to manually define and fine-tune many rules
• Need to fine-tune thresholds for sufficient accuracy
• Feedback should be able to effectively improve quality

Dagstuhl Seminar 08421 - Probabilistic Information Integration

Metrics for integration result
• Metrics for uncertainty
• # possible worlds
• Uncertainty density= average number of alternatives per choice point
• Metrics for probability assignment
• Answer decisivenessTwo 50/50 alternatives are less decisive then 90/10

Dens: .25 .17 .22

Dec: .83 .89 .72

Dagstuhl Seminar 08421 - Probabilistic Information Integration

Year of movie “King Kong”?//yr[../tl=“King Kong”]

• (1933) : 40%×20% = 8%(1976) : 40%×80% = 32%(1933,1976) : 60%
• Ranking by probability:1976 at 92% (better: 1x 1976 at 92%, 0x 1976 at 8%)1933 at 68% (better: 1x 1933 at 68%, 0x 1933 at 32%)
• Suggests IR-like precision and recall, but
• Query answers are possibly not distinct
• Correct answer with high probability is better than one with low probability (and vice versa for incorrect answers)
• Approach
• Answer only exists for asmuch as its probability
• Expected value of precision and recall

Dagstuhl Seminar 08421 - Probabilistic Information Integration

Year of movie “King Kong”?//yr[../tl=“King Kong”]

• (1933) : 40%×20% = 8%(1976) : 40%×80% = 32%(1933,1976) : 60%
• Ranking by probability:1976 at 92%1933 at 68%
• Suppose 1976 is a correct answer and 1933 is not
• EXP(Precision) = EXP(correct) / EXP(all answers) = 0.92 / 1.6 = 57.5%EXP(Recall) = EXP(correct) / |Human| = 0.92 / 1 = 92%

Dagstuhl Seminar 08421 - Probabilistic Information Integration

Data: few “Today’s picks” from TV guide enriched

with IMDB source with 243000 movies.

18 attrs in total; 6 overlapping.

Queries: 43 XPath queries

Too many rules?
• Isn’t development effort not simply shifted to rule definition and threshold tuning?
• Rules: DTD-info + 1 ‘rough’ rule per entity suffices
• Thresholds: Quality insensitive to ‘safe’ thresholds

Don’t worry about perfecting the rules and thresholds. Strive for an in initial query result that can be queried with about 90% entities resolved. For the 10% hard cases just make sure that you don’t miss the one correct match (user feedback cannot invent matches).

Dagstuhl Seminar 08421 - Probabilistic Information Integration

User feedback
• Usually, user feedback can be naturally embedded in user interaction
• Example:
• Contacts application in your mobile phone, integrated/synchronized with company phone list, PC at home, other people’s phones (community) with the aim to automatically pick up changes
• Phone application ranks possible phone numbers according to likelihood for dialing
• Phone application can automatically give feedback
• Dialed number gave error ‘invalid number’
• Both “End call” and “Wrong number” buttons
• No significant additional interaction needed

Dagstuhl Seminar 08421 - Probabilistic Information Integration

Data: integration result @ margin 4, threshold 0.8

Queries: 43 XPath queries

Feedback: several series of 40 consecutive feedbacks

Each feedback randomly chosen from possible ones

UF effective?
• Is user feedback effective enough to quickly and effectively improve integration quality?

Negative feedback

Positive feedback

Mixed feedback

Precision

Recall

Dagstuhl Seminar 08421 - Probabilistic Information Integration

Conclusions
• Many correspondences between probabilistic XML and relational database
• Simple model for uncertainty in data with well-understood semantics suffices: possible world model with discrete choices
• Seems appropriate for schema and data integration for many applications (e.g. portals): early meaningful use of integrated data, improves during use with feedback
• My worries:
• First proposals for some quality metrics
• Few rules and safe thresholds suffice
• Mixed targeted user feedback effective in quickly improving integration quality

Dagstuhl Seminar 08421 - Probabilistic Information Integration

Opportunities
• Put probabilistic relational DBMS underneath
• Techniques for deriving (imperfect) (conditional) functional dependencies may be used to automate rule definition
• Since rules need not be perfect nor handle all cases, tool-support for non-expert users becomes possible?
• User feedback may also be used to learn new rulesWork is needed to handle wrong user feedbackAnswer explanation may help in targeting user feedback
• Recent works on probabilistic schema matching/mapping
• More distant future:
• Autonomous applications that only rely on their own data and metadata for automatic data exchange/integration“Community of co-operating applications”
• We need a way to let applications automatically learn how to disambiguate things

Dagstuhl Seminar 08421 - Probabilistic Information Integration