1 / 90

Managing a Space of Heterogeneous Data

Managing a Space of Heterogeneous Data. Xin (Luna) Dong University of Washington March, 2007. Once upon a time…. D5. D1. D2. D4. D3. Nowadays…. Mappings Between Heterogeneous Data Sources. MovieDVD. Movie. Director. Review. Mediated Schema. D5. D1. D2. D4. D3.

sezja
Download Presentation

Managing a Space of Heterogeneous Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Managing a Space of Heterogeneous Data Xin (Luna) Dong University of Washington March, 2007

  2. Once upon a time…

  3. D5 D1 D2 D4 D3 Nowadays…

  4. Mappings Between Heterogeneous Data Sources MovieDVD Movie Director Review

  5. Mediated Schema D5 D1 D2 D4 D3 Traditional Data Integration Systems Require Semantic Mappings Between Data Sources Up Front Q Q Q Q Q5 Q1 Q4 Q2 Q2 Q2 Q3

  6. D5 D1 D2 D4 D3 In Many Applications it is Hard to Obtain Precise Semantic Mappings ?

  7. Scenario 1. Different Websites About Movies

  8. Scenario 2. Personal Information Space Intranet Internet

  9. Mediated Schema D5 D1 D2 D4 D3 In Many Applications it is Hard to Obtain Precise Semantic Mappings Q

  10. Managing Dataspaces • Dataspaces [Halevy et al., PODS’06] • Collections of heterogeneous data sources • Not necessarily include semantic mappings • Scenarios: personal information, enterprises, government agencies, smart homes, digital libraries, and the Web My goal: Provide quality search, querying and browsing as the system evolves

  11. Heterogeneity at Different Levels Name: First: Luna Last: Dong E-Mail Addresses: lunadong@cs.washington.edu @inproceedings{dong05, author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …}

  12. Heterogeneity at Instance Level Name: First: Luna Last: Dong E-Mail Addresses: lunadong@cs.washington.edu • Form of heterogeneity • The same real-world object can be referred to using different attribute values • Current work • Record linkage: most works assume matching tuples from a single database table that has a fair number of attributes (Surveyed in [Winkler, 2006]) • Contributions • Reference reconciliation: reconcile instances of multiple classes and with only limited attributes [Sigmod’05] @inproceedings{dong05, author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …}

  13. Heterogeneity at Schema Level Name: First: Luna Last: Dong E-Mail Addresses: lunadong@cs.washington.edu • Form of heterogeneity • The same domain can be described using different schemas • Data can be (semi-)structured or unstructured • Current work • Schema matching (Surveyed in [Rahm&Bernstein, 2001]) • Query reformulation (Surveyed in [Halevy 2000]) • Contributions • Probabilistic schema mapping [VLDB’07] • Visualizing heterogeneous data [InfoVis’07] @inproceedings{dong05, author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …}

  14. Heterogeneity at Query Level Name: First: Luna Last: Dong E-Mail Addresses: lunadong@cs.washington.edu • Form of heterogeneity • Different terms and different levels of structural details Keyword search: ‘Semex Dong’ Structured query:Paper (title, ‘Semex’), (authoredBy, ‘Dong’) @inproceedings{dong05, author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …}

  15. Heterogeneity at Query Level Name: First: Luna Last: Dong E-Mail Addresses: lunadong@cs.washington.edu • Form of heterogeneity • Different terms and different levels of structural details Keyword search: ‘Semex Dong’ Structured query:Paper (title, ‘Semex’), (authoredBy, ‘Dong’) • Current work • Keyword search on databases(Discover, DBExplorer, etc.) • Contributions • Seamless querying of structured and unstructured data • Indexing heterogeneous data [Sigmod’07] • Answering structured queries on unstructured data [WebDB’06] @inproceedings{dong05, author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …}

  16. Outline • Problem definition and goals • Semex Personal Information Management System [CIDR’05, one of three Best Demos at Sigmod’05] • Technical contributions: Reference reconciliation [Sigmod 2005] Indexing heterogeneous data [Sigmod 2007] Answering structured queries on unstructured data [WebDB 2006] Probabilistic schema mapping [VLDB 2007] Visualizing heterogeneous data [InfoVis 2007] • Future research directions

  17. AttachedTo Recipient ConfHomePage CourseGradeIn ExperimentOf PublishedIn Sender Cites ComeFrom EarlyVersion ArticleAbout PresentationFor FrequentEmailer CoAuthor AddressOf OriginitatedFrom BudgetOf HomePage Semex Generates a Logical View of Meaningful Objects and Associations

  18. Semex Provides Association Browsing of One’s Personal Information Alon. Y. Levy Names Emails

  19. Semex Provides Association Browsing of One’s Personal Information A Platform for Personal Information Management and Integration Title Year

  20. Semex Provides Association Browsing of One’s Personal Information CIDR

  21. Semex Provides Association Browsing of One’s Personal Information Trio: A System for Integrated Mangement of Data, Accuracy, and Lineage

  22. Question 1: Which emails has my advisor sent me about my thesis? alonhalevy@gmail.com alon@cs.washington.edu halevy@google.com alonh@transformic.com

  23. Question 2: Who have been working on schema matching? Search ‘Schema Matching’ 6 Messages 67 Articles 31 Persons working on Schema Matching (e.g., Alon Halevy, Phil Bernstein, Renee Miller, Anhai Doan)

  24. Question 3: Which of my friends published in Sigmod 2007? My friends who published papers in Sigmod 2007

  25. Searcher Searcher Browser Browser Analyzer Analyzer Domain Model Domain Model Association DB Association DB Index Index Indexer Indexer Reference Reconciliater Reference Reconciliater Domain Manager Associations Associations Objects Objects Extractors Extractors Integrator Integrator Word Word PPT PPT PDF PDF Latex Latex Email Email Webpage Webpage Excel Excel DB DB Semex Architecture Data Analysis Module Schema Management Module Data Integration Module Domain Manager

  26. Outline • Problem definition and our principle • Semex Personal Information Management System [CIDR’05, one of three Best Demos at Sigmod’05] • Technical contributions: Reference reconciliation [Sigmod 2005] Indexing heterogeneous data [Sigmod 2007] Answering structured queries on unstructured data [WebDB 2006] Probabilistic schema mapping [VLDB 2007] Visualizing heterogeneous data [InfoVis 2007] • Future research directions

  27. Heterogeneity at Different Levels Name: First: Luna Last: Dong E-Mail Addresses: lunadong@cs.washington.edu • Instancelevel • Reference Reconciliation [Sigmod’05] • Query level • Answering structured queries on unstructured data [WebDB’06] • Indexing heterogeneous data [Sigmod’07] • Schema level • Probabilistic schema mapping[VLDB’07] • Visualization of heterogeneous data [InfoVis’07] @inproceedings{dong05, author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …}

  28. Reference Reconciliation is Crucial in Dataspaces Xin (Luna) Dong Lab-#dong xin dong xin luna • ¶­ðà xinluna dong Names luna x. dong dongxin Emails xin dong

  29. Previous Approaches • A very active area of research in databases, data mining and AI • Most current approaches assume matching tuples from a single database table • Traditional approaches are based on pair-wise comparisons (Surveyed in [Winkler, 2006]) • New approaches explore relationship between reconciliation decisions using probability models[Russell et al, 2002] [Domingos et al, 2004] • Harder for a complex information space

  30. Challenges for a Complex Information Space • Article: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3}, c1) a2=(“Distributed query processing”,“169-180”, {p4,p5,p6}, c2) • Venue: c1=(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”)c2=(“ACM SIGMOD”, “1978”, null) • Person: p1=(“Robert S. Epstein”, null) p2=(“Michael Stonebraker”, null) p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null) p5=(“Stonebraker, M.”, null) p6=(“Wong, E.”, null)

  31. ? ? Challenges for a Complex Information Space • Article: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3}, c1) a2=(“Distributed query processing”,“169-180”, {p4,p5,p6}, c2) • Venue: c1=(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”)c2=(“ACM SIGMOD”, “1978”, null) • Person: p1=(“Robert S. Epstein”, null) p2=(“Michael Stonebraker”, null) p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null) p5=(“Stonebraker, M.”, null) p6=(“Wong, E.”, null) p7=(“Eugene Wong”, “eugene@berkeley.edu”) p8=(null, “stonebraker@csail.mit.edu”) p9=(“mike”, “stonebraker@csail.mit.edu”) 2. Limited Information 1. Multiple Classes 3. Multi-value Attributes

  32. Intuition: Exploit Association Network • We extract from dataspaces networks of instances and associations between the instances • Key: exploit the network, specifically, the clues hidden in the associations

  33. Strategy I. Exploiting Richer Evidence • Cross-attribute similarity – Name&email • p5=(“Stonebraker, M.”, null) • p8=(null, “stonebraker@csail.mit.edu”) • Context Information I – Contact list • p5=(“Stonebraker, M.”, null, {p4, p6}) • p8=(null, “stonebraker@csail.mit.edu”, {p7}) • p6=p7 • Context Information II – Authored articles • p2=(“Michael Stonebraker”, null) • p5=(“Stonebraker, M.”, null) • p2 and p5 authored the same article

  34. 1409 1750 Considering Only Attribute-wise Similarities Cannot Merge Persons Well 3159 Person references: 24076 Real-world persons (gold-standard):1750

  35. 1409 1750 346 Considering Richer Evidence Improves the Result Person references: 24076 Real-world persons:1750

  36. Strategy II. Propagate Information Between Reconciliation Decisions • Article: a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3}, c1) a2=(“Distributed query processing”,“169-180”, {p4,p5,p6}, c2) • Venue: c1=(“ACM Conference on Management of Data”, “1978”, “Austin, Texas”)c2=(“ACM SIGMOD”, “1978”, null) • Person: p1=(“Robert S. Epstein”, null) p2=(“Michael Stonebraker”, null) p3=(“Eugene Wong”, null) p4=(“Epstein, R.S.”, null) p5=(“Stonebraker, M.”, null) p6=(“Wong, E.”, null)

  37. 1750 272 Propagating Information Between Reconciliation Decisions Further Improves the Result 1409 346 Person references: 24076 Real-world persons:1750

  38. X X V V Strategy III. Reference Enrichment • p2=(“Michael Stonebraker”, null, {p1,p3})p8=(null, “stonebraker@csail.mit.edu”, {p7})p9=(“mike”, “stonebraker@csail.mit.edu”, null) • p8-9 =(“mike”, “stonebraker@csail.mit.edu”, {p7})

  39. 1750 160 References Enrichment Improves the Result More than Information Propagation 1409 346 Person references: 24076 Real-world persons:1750

  40. 1409 346 1750 125 Applying Both Information Propagation and Reference Enrichment Gets the Best Result Person references: 24076 Real-world persons:1750

  41. Experiment Settings • Data sets: Four personal data sets • Use the same parameters and thresholds for all data sets • Measure • Precision: #(correctly reconciled reference pairs) #(reconciled reference pairs) • Recall: #(correctly reconciled reference pairs)#(reference pairs that refer to the same real-world object) • F-measure: 2·Precision·Recall Precision+Recall

  42. Precision and Recall Increase Largely Compared with Attr-wise Matching

  43. Heterogeneity at Different Levels Name: First: Luna Last: Dong E-Mail Addresses: lunadong@cs.washington.edu • Instancelevel • Reference Reconciliation [Sigmod’05] • Query level • Answering structured queries on unstructured data [WebDB’06] • Indexing heterogeneous data [Sigmod’07] • Schema level • Probabilistic schema mapping[VLDB’07] • Visualization of heterogeneous data [InfoVis’07] @inproceedings{dong05, author=“Xin Dong”, title=“Semex: A Platform for Personal Information Management and Integration”, booktitle=“VLDB 2005 PhD Workshop”, …}

  44. Seamless Querying of Structured and Unstructured Data Structured QueriesSELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ Keyword Search “dataspaces”

  45. DB ? DB IR I. Answering Structured Queries on Unstructured Data Structured QueriesSELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ Keyword Search “dataspaces” • Our approach: query translation • Transform a structured query into keyword search • Keyword search on unstructured data

  46. Challenges • Example SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ select title from paper where title LIKE +dataspaces and year +2005 Top-10 Precision 0

  47. Challenges • Example SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ title paper title +dataspaces year +2005 Top-10 Precision 0

  48. Challenges • Example SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ +dataspaces +2005 Top-10 Precision 0.2

  49. Challenges • Example SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ +dataspaces +2005 paper title Top-10 Precision 0.2

  50. Challenges • Example SELECT title FROM paper WHERE title LIKE ‘%Dataspaces%’ AND year = ‘2005’ +dataspaces +2005 paper Top-10 Precision 0.6

More Related