1 / 25

ICDE 06 04.05.06

ICDE 06 04.05.06. Probabilistic Message Passing in Peer Data Management Systems Philippe Cudré-Mauroux, EPFL Joint work with: Karl Aberer (EPFL) Andras Feher (T.U. Darmstadt). Overview of the talk. Data Integration in Large-Scale Information Systems

karen-lucas
Download Presentation

ICDE 06 04.05.06

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ICDE 06 04.05.06 Probabilistic Message Passing in Peer Data Management Systems Philippe Cudré-Mauroux, EPFL Joint work with: Karl Aberer (EPFL) Andras Feher (T.U. Darmstadt)

  2. Overview of the talk • Data Integration in Large-Scale Information Systems • Peer Data Management Systems (PDMS) • Query Routing in PDMS • Precision / Recall tradeoff • Probabilistic Message Passing • Deriving quality measures for the mappings • Conclusions

  3. Classical Data Integration: LAV/GAV • Traditional database techniques (e.g., LAV/GAV) rely on centralizedschemas to integrate data sources • Not applicable to large-scale, decentralized contexts • Scale (upper ontologies?) • Churn • Autonomy • How can we foster semantic interoperability in decentralized settings? Date m(myDate) = Date m(yourDate) = Date myDate yourDate

  4. Peer Data Management Systems (1) Q2=<GUID>$p/GUID</GUID> FOR $p IN T12WHERE $p/Creator LIKE "%Robi%" Q1=<GUID>$p/GUID</GUID> FOR $p IN /Photoshop_Image WHERE $p/Creator LIKE "%Robi%" Extending data integration techniques to decentralized settings Photoshop (own schema) WinFS (known schema) <Photoshop_Image> <GUID>178A8CD8865</GUID> <Creator>Robinson</Creator> <Subject> <Bag> <Item> Tunbridge Wells</Item> <Item>Royal Council</Item> </Bag> </Subject> … </Photoshop_Image> <WinFSImage> <GUID>178A8CD8866</GUID> <Author> <DisplayName> Henry Peach Robinson <DisplayName> <Role>Photographer</Role> <Author> <Keyword> Tunbridge </Keyword> <Keyword>Council</Keyword> … </WinFSImage> T12 = <Photoshop_Image> <GUID>$fs/GUID</GUID> <Creator> $fs/Author/DisplayName </Creator></Photoshop_Image>FOR $fs IN /WinFSImage

  5. <xap:CreateDate>2001-12-19T18:49:03Z</xap:CreateDate> <xap:ModifyDate>2001-12-19T20:09:28Z</xap:ModifyDate> date? <es:cDate> 05/08/2004 </es:cDate> weather article myRDF:Date xap:ModifyDate es:cDate  myRDF:Date <myRDF:Date> Jan 1, 2005 </myRDF:Date> Peer Data Management Systems (2) • Pairwise mappings • Local mappings overcome global heterogeneity • Iterative query reformulation es:cDate  xap:CreateDate

  6. PDMS Examples • Some academic systems • Piazza • Hyperion • BestPeer • GridVine • … • Out there on the Internet • The Sequence Retrieval System (SRS) • 388 schemata (May 05, EBI repository) • 518 mappings (ID <-> ID) • Power-law distribution of node degrees • Clustering coefficient = 0.32 • Diameter = 9 • Semantic Overlay Networks • P2P + semi-structured data • The Semantic Web

  7. VS Data in large-scale PDMS • Distributed Databases • Number of sources < 100 • Consistent data • Coordination • Structured data • E.g., Relational data model • Integrity constraints • Transactions • Powerful queries • E.g., SQL, aggregation • Schemas created by administrators • Relatively Fixed topology • Large-Scale PDMS • Number of sources > 100 • Unreliable data • Autonomy • Semi-structured data • E.g., XML/RDF • No integrity constraints • No transactions • Simple SP queries • E.g., triple patterns, ranking • Schemata created by end users • Network churn

  8. Problem: Precision/Recall Tradeoff (1) • Semantic Query routing • To whom shall I forward a query posed against my local schema? • Some (most) mappings will be (partially) faulty • Low expressive power of mapping languages • samePropertyAs / sameClassAs / subclassOf • … or event worse (Microformats) • Automatic schema alignment techniques • Different views on conceptualizations • Local query resolution • Low recall • Flooding (PDMS so far) • Low precision

  9. Problem: Precision/Recall Tradeoff (2) • Standard deductive integration is not sufficient • Uncertainty on mappings and conceptualizations • Probabilistic Message Passing • Deriving quality measures for the mappings • Reduces uncertainty • Used to route query / optimize mappings • Based on a notion of agreement on conceptualizations • Decentralized decision making, Emergent Semantics • From Schema Matching to Probabilistic Message Passing • Automatic Schema Matching • INPUT: 2 schemas + data • OUTPUT: 1 mapping • Probabilistic Message Passing • INPUT: n schemas and m mappings • OUTPUT: quality measures for the mappings

  10. Probabilistic Message Passing • Link-based analysis of the PDMS • -Automatically deriving quality measures for the mappings • Transitive closures of mapping operations • Mapping Cycles • Parallel Paths m0 q:art/Creator? m4 f0 m3 qVSm3(m4(m0(q))) art/Creator? VS art/creatDate?

  11. On Cycles / parallel paths m0 m1 m4 m5 f0 m2 m3

  12. Computing a Marginal for one cycle unknown observed • P(m0, m3, m4, f0) = P(m0) P(m3) P(m4) P(f0 | m0, m3, m4,) • P(m0| f0)= m3, m4 P(m0, m3, m4, f0) P(f0)-1 • But: feedbacks on different cycles are correlated • One wrong mapping will affect several cycles/paths • Need to express a global probabilistic model for the mapping graph

  13. A Brief Intro to Factor-Graphs • g(x1, x2, x3, x4) = fA(x1, x2)fB(x2, x3, x4)

  14. Deriving PDMS Factor-Graphs Abductive reasoning on transitive closures of mappings a priori information on mapping

  15. PDMS Factor-Graphs • Cyclic graph • Junction Tree? Clustering / Stretching of variables? • Centralization • Computational + communicational overhead • Iterative Sum-Product • Approximate results • How to perform iterative sum-product by message passing on the mapping graph? • Message passing in factor graph does not correspond to connectivity of mapping graph • We want to rely on decentralized computations only • Locality VS Globality of nodes in the factor graph • Mappings: local • Feedback factor: common, global knowledge • Observed feedback variables: neighborhood

  16. Embedded Message-Passing (1)

  17. Embedded Message-Passing (2)

  18. Message Passing • Decentralized computations • Computationally inexpensive • Sums and Products • Message-Passing Schedules • Periodic • Lazy (piggybacking on query forwarding) • No message overhead

  19. Implemented System • Schemas • Import from OWL (Web Ontology Language) • Mappings • KnowledgeWeb Ontology Alignment API • Import from RDF/XML • Automated on-the-fly creation • Comparison to standard alignments  Automatic derivation of quality measures P(m=correct | {F}) for the mappings using iterative message-passing  Query routing based on the quality measures Precision / recall tradeoff

  20. Some (Preliminary) Results: Convergence (undirected example graph, prior 0.7 delta 0.1)

  21. Fault-tolerance (faulty links) (undirected example graph, prior 0.8 delta 0.1)

  22. Detecting Erroneous Mappings (random network of 50 schemas and 200 mappings, no prior information)

  23. Conclusions • Deriving quality measures for PDMS mappings • Automated process • Decentralized computations • Based on agreements on conceptualizations • Emergent Semantics • Current work • More expressive mappings • E.g., subsumption • Integration in the GridVine semantic overlay network • Application to other domains • Web Services composition?

  24. Thank you for your attention Web page: lsirpeople.epfl.ch/cudre • Questions?

More Related