Managing a Space of Heterogeneous Data

Managing a Space of Heterogeneous Data Xin (Luna) Dong University of Washington March, 2007

Once upon a time…

D5 D1 D2 D4 D3 Nowadays…

Mappings Between Heterogeneous Data Sources MovieDVD Movie Director Review

Mediated Schema D5 D1 D2 D4 D3 Traditional Data Integration Systems Require Semantic Mappings Between Data Sources Up Front Q Q Q Q Q5 Q1 Q4 Q2 Q2 Q2 Q3

D5 D1 D2 D4 D3 In Many Applications it is Hard to Obtain Precise Semantic Mappings ?

Scenario 1. Different Websites About Movies

Scenario 2. Personal Information Space Intranet Internet

Mediated Schema D5 D1 D2 D4 D3 In Many Applications it is Hard to Obtain Precise Semantic Mappings Q

Managing Dataspaces • Dataspaces [Halevy et al., PODS’06] • Collections of heterogeneous data sources • Not necessarily include semantic mappings • Scenarios: personal information, enterprises, government agencies, smart homes, digital libraries, and the Web My goal: Provide quality search, querying and browsing as the system evolves

Heterogeneity at Different Levels Name: First: Luna Last: Dong E-Mail Addresses: lunadong@cs.washington.edu @inproceedings{dong05, author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …}

Heterogeneity at Instance Level Name: First: Luna Last: Dong E-Mail Addresses: lunadong@cs.washington.edu • Form of heterogeneity • The same real-world object can be referred to using different attribute values • Current work • Record linkage: most works assume matching tuples from a single database table that has a fair number of attributes (Surveyed in [Winkler, 2006]) • Contributions • Reference reconciliation: reconcile instances of multiple classes and with only limited attributes [Sigmod’05] @inproceedings{dong05, author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …}

Heterogeneity at Schema Level Name: First: Luna Last: Dong E-Mail Addresses: lunadong@cs.washington.edu • Form of heterogeneity • The same domain can be described using different schemas • Data can be (semi-)structured or unstructured • Current work • Schema matching (Surveyed in [Rahm&Bernstein, 2001]) • Query reformulation (Surveyed in [Halevy 2000]) • Contributions • Probabilistic schema mapping [VLDB’07] • Visualizing heterogeneous data [InfoVis’07] @inproceedings{dong05, author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …}

Heterogeneity at Query Level Name: First: Luna Last: Dong E-Mail Addresses: lunadong@cs.washington.edu • Form of heterogeneity • Different terms and different levels of structural details Keyword search: ‘Semex Dong’ Structured query:Paper (title, ‘Semex’), (authoredBy, ‘Dong’) @inproceedings{dong05, author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …}

Heterogeneity at Query Level Name: First: Luna Last: Dong E-Mail Addresses: lunadong@cs.washington.edu • Form of heterogeneity • Different terms and different levels of structural details Keyword search: ‘Semex Dong’ Structured query:Paper (title, ‘Semex’), (authoredBy, ‘Dong’) • Current work • Keyword search on databases(Discover, DBExplorer, etc.) • Contributions • Seamless querying of structured and unstructured data • Indexing heterogeneous data [Sigmod’07] • Answering structured queries on unstructured data [WebDB’06] @inproceedings{dong05, author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …}

Outline • Problem definition and goals • Semex Personal Information Management System [CIDR’05, one of three Best Demos at Sigmod’05] • Technical contributions: Reference reconciliation [Sigmod 2005] Indexing heterogeneous data [Sigmod 2007] Answering structured queries on unstructured data [WebDB 2006] Probabilistic schema mapping [VLDB 2007] Visualizing heterogeneous data [InfoVis 2007] • Future research directions

AttachedTo Recipient ConfHomePage CourseGradeIn ExperimentOf PublishedIn Sender Cites ComeFrom EarlyVersion ArticleAbout PresentationFor FrequentEmailer CoAuthor AddressOf OriginitatedFrom BudgetOf HomePage Semex Generates a Logical View of Meaningful Objects and Associations

Semex Provides Association Browsing of One’s Personal Information Alon. Y. Levy Names Emails

Semex Provides Association Browsing of One’s Personal Information A Platform for Personal Information Management and Integration Title Year

Semex Provides Association Browsing of One’s Personal Information CIDR

Semex Provides Association Browsing of One’s Personal Information Trio: A System for Integrated Mangement of Data, Accuracy, and Lineage

Question 1: Which emails has my advisor sent me about my thesis? alonhalevy@gmail.com alon@cs.washington.edu halevy@google.com alonh@transformic.com

Question 2: Who have been working on schema matching? Search ‘Schema Matching’ 6 Messages 67 Articles 31 Persons working on Schema Matching (e.g., Alon Halevy, Phil Bernstein, Renee Miller, Anhai Doan)

Question 3: Which of my friends published in Sigmod 2007? My friends who published papers in Sigmod 2007

Searcher Searcher Browser Browser Analyzer Analyzer Domain Model Domain Model Association DB Association DB Index Index Indexer Indexer Reference Reconciliater Reference Reconciliater Domain Manager Associations Associations Objects Objects Extractors Extractors Integrator Integrator Word Word PPT PPT PDF PDF Latex Latex Email Email Webpage Webpage Excel Excel DB DB Semex Architecture Data Analysis Module Schema Management Module Data Integration Module Domain Manager

Outline • Problem definition and our principle • Semex Personal Information Management System [CIDR’05, one of three Best Demos at Sigmod’05] • Technical contributions: Reference reconciliation [Sigmod 2005] Indexing heterogeneous data [Sigmod 2007] Answering structured queries on unstructured data [WebDB 2006] Probabilistic schema mapping [VLDB 2007] Visualizing heterogeneous data [InfoVis 2007] • Future research directions

Heterogeneity at Different Levels Name: First: Luna Last: Dong E-Mail Addresses: lunadong@cs.washington.edu • Instancelevel • Reference Reconciliation [Sigmod’05] • Query level • Answering structured queries on unstructured data [WebDB’06] • Indexing heterogeneous data [Sigmod’07] • Schema level • Probabilistic schema mapping[VLDB’07] • Visualization of heterogeneous data [InfoVis’07] @inproceedings{dong05, author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …}

Reference Reconciliation is Crucial in Dataspaces Xin (Luna) Dong Lab-#dong xin dong xin luna • ¶ðà xinluna dong Names luna x. dong dongxin Emails xin dong

Previous Approaches • A very active area of research in databases, data mining and AI • Most current approaches assume matching tuples from a single database table • Traditional approaches are based on pair-wise comparisons (Surveyed in [Winkler, 2006]) • New approaches explore relationship between reconciliation decisions using probability models[Russell et al, 2002] [Domingos et al, 2004] • Harder for a complex information space

Challenges for a Complex Information Space • Article: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3}, c1) a2=(“Distributed query processing”,“169-180”, {p4,p5,p6}, c2) • Venue: c1=(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”)c2=(“ACM SIGMOD”, “1978”, null) • Person: p1=(“Robert S. Epstein”, null) p2=(“Michael Stonebraker”, null) p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null) p5=(“Stonebraker, M.”, null) p6=(“Wong, E.”, null)

? ? Challenges for a Complex Information Space • Article: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3}, c1) a2=(“Distributed query processing”,“169-180”, {p4,p5,p6}, c2) • Venue: c1=(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”)c2=(“ACM SIGMOD”, “1978”, null) • Person: p1=(“Robert S. Epstein”, null) p2=(“Michael Stonebraker”, null) p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null) p5=(“Stonebraker, M.”, null) p6=(“Wong, E.”, null) p7=(“Eugene Wong”, “eugene@berkeley.edu”) p8=(null, “stonebraker@csail.mit.edu”) p9=(“mike”, “stonebraker@csail.mit.edu”) 2. Limited Information 1. Multiple Classes 3. Multi-value Attributes

Intuition: Exploit Association Network • We extract from dataspaces networks of instances and associations between the instances • Key: exploit the network, specifically, the clues hidden in the associations

Strategy I. Exploiting Richer Evidence • Cross-attribute similarity – Name&email • p5=(“Stonebraker, M.”, null) • p8=(null, “stonebraker@csail.mit.edu”) • Context Information I – Contact list • p5=(“Stonebraker, M.”, null, {p4, p6}) • p8=(null, “stonebraker@csail.mit.edu”, {p7}) • p6=p7 • Context Information II – Authored articles • p2=(“Michael Stonebraker”, null) • p5=(“Stonebraker, M.”, null) • p2 and p5 authored the same article

1409 1750 Considering Only Attribute-wise Similarities Cannot Merge Persons Well 3159 Person references: 24076 Real-world persons (gold-standard):1750

1409 1750 346 Considering Richer Evidence Improves the Result Person references: 24076 Real-world persons:1750

Strategy II. Propagate Information Between Reconciliation Decisions • Article: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3}, c1) a2=(“Distributed query processing”,“169-180”, {p4,p5,p6}, c2) • Venue: c1=(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”)c2=(“ACM SIGMOD”, “1978”, null) • Person: p1=(“Robert S. Epstein”, null) p2=(“Michael Stonebraker”, null) p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null) p5=(“Stonebraker, M.”, null) p6=(“Wong, E.”, null)

1750 272 Propagating Information Between Reconciliation Decisions Further Improves the Result 1409 346 Person references: 24076 Real-world persons:1750

X X V V Strategy III. Reference Enrichment • p2=(“Michael Stonebraker”, null, {p1,p3})p8=(null, “stonebraker@csail.mit.edu”, {p7})p9=(“mike”, “stonebraker@csail.mit.edu”, null) • p8-9 =(“mike”, “stonebraker@csail.mit.edu”, {p7})

1750 160 References Enrichment Improves the Result More than Information Propagation 1409 346 Person references: 24076 Real-world persons:1750

1409 346 1750 125 Applying Both Information Propagation and Reference Enrichment Gets the Best Result Person references: 24076 Real-world persons:1750

Experiment Settings • Data sets: Four personal data sets • Use the same parameters and thresholds for all data sets • Measure • Precision: #(correctly reconciled reference pairs) #(reconciled reference pairs) • Recall: #(correctly reconciled reference pairs)#(reference pairs that refer to the same real-world object) • F-measure: 2·Precision·Recall Precision+Recall

Precision and Recall Increase Largely Compared with Attr-wise Matching

Heterogeneity at Different Levels Name: First: Luna Last: Dong E-Mail Addresses: lunadong@cs.washington.edu • Instancelevel • Reference Reconciliation [Sigmod’05] • Query level • Answering structured queries on unstructured data [WebDB’06] • Indexing heterogeneous data [Sigmod’07] • Schema level • Probabilistic schema mapping[VLDB’07] • Visualization of heterogeneous data [InfoVis’07] @inproceedings{dong05, author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …}

Seamless Querying of Structured and Unstructured Data Structured QueriesSELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ Keyword Search “dataspaces”

DB ? DB IR I. Answering Structured Queries on Unstructured Data Structured QueriesSELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ Keyword Search “dataspaces” • Our approach: query translation • Transform a structured query into keyword search • Keyword search on unstructured data

Challenges • Example SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ select title from paper where title LIKE +dataspaces and year +2005 Top-10 Precision 0

Challenges • Example SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ title paper title +dataspaces year +2005 Top-10 Precision 0

Challenges • Example SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ +dataspaces +2005 Top-10 Precision 0.2

Challenges • Example SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ +dataspaces +2005 paper title Top-10 Precision 0.2

Challenges • Example SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ +dataspaces +2005 paper Top-10 Precision 0.6

Managing a Space of Heterogeneous Data