250 likes | 406 Views
Semantic Network Analysis 11.07.05. Analyzing Semantic Interoperability in Bioinformatic Database Networks Philippe Cudré-Mauroux, EPFL Joint work with: Julien Gaugaz, Adriana Budura and Karl Aberer. Overview. Peer Data Management Systems (PDMS)
E N D
Semantic Network Analysis 11.07.05 Analyzing Semantic Interoperability in Bioinformatic Database Networks Philippe Cudré-Mauroux, EPFL Joint work with: Julien Gaugaz, Adriana Budura and Karl Aberer
Overview • Peer Data Management Systems (PDMS) • Semantic Interoperability in the Large • Generatingfunctionologic framework • The Sequence Retrieval System • Degree distribution • Analysis of giant component • Weighted analysis • Conclusions
Beyond Keyword Search • searching semantically richer objects in large scale heterogeneous networks <xap:CreateDate>2001-12-19T18:49:03Z</xap:CreateDate> <xap:ModifyDate>2001-12-19T20:09:28Z</xap:ModifyDate> date? <es:DofCreation> 05/08/2004 </es:DofCreation> ? ? ? ? ? <myRDF:Date> Jan 1, 2005 </myRDF:Date>
VS Decentralized Data Integration • Distributed Databases • Number of sources < 100 • Consistent data • Coordination • Structured data • E.g., Relational data model • Integrity constraints • Transactions • Powerful queries • E.g., SQL, aggregation • Schemas created by administrators • Relatively Fixed topology • Large Scale Information Systems (e.g., WWW) • Number of sources > 100 • Unreliable data • Autonomy • Semi-structured data • E.g., XML/RDF • No integrity constraints • No transactions • Simple SP queries • E.g., triple patterns, ranking • Schemata created by end users • Network churn
Data Integration: LAV/GAV • Traditional database techniques (e.g., LAV/GAV) rely on centralizedschemas to integrate data sources • Not applicable to our context • Scale (upper ontologies?) • Churn • Autonomy • How can we foster semantic interoperability in decentralized settings? Date m(Date) = myDate m(Date) = yourDate myDate yourDate
Semantic Interoperability Q2=<GUID>$p/GUID</GUID> FOR $p IN T12WHERE $p/Creator LIKE "%Robi%" Q1=<GUID>$p/GUID</GUID> FOR $p IN /Photoshop_Image WHERE $p/Creator LIKE "%Robi%" Extending semantic interoperability techniques to decentralized settings Photoshop (own schema) WinFS (known schema) <Photoshop_Image> <GUID>178A8CD8865</GUID> <Creator>Robinson</Creator> <Subject> <Bag> <Item> Tunbridge Wells</Item> <Item>Royal Council</Item> </Bag> </Subject> … </Photoshop_Image> <WinFSImage> <GUID>178A8CD8866</GUID> <Author> <DisplayName> Henry Peach Robinson <DisplayName> <Role>Photographer</Role> <Author> <Keyword> Tunbridge </Keyword> <Keyword>Council</Keyword> … </WinFSImage> T12 = <Photoshop_Image> <GUID>$fs/GUID</GUID> <Creator> $fs/Author/DisplayName </Creator></Photoshop_Image>FOR $fs IN /WinFSImage
<xap:CreateDate>2001-12-19T18:49:03Z</xap:CreateDate> <xap:ModifyDate>2001-12-19T20:09:28Z</xap:ModifyDate> date? <es:cDate> 05/08/2004 </es:cDate> myRDF:Date xap:ModifyDate es:cDate myRDF:Date <myRDF:Date> Jan 1, 2005 </myRDF:Date> 1. Peer Data Management Systems • Pairwise mappings • Peer Data Management Systems (PDMS) • Local mappings overcome global heterogeneity • Iterative query rewriting es:cDate xap:CreateDate weather article
Semantic Mediation Layer Semantic Mediation Layer Correlated / Uncorrelated Overlay Layer Correlated / Uncorrelated “Physical” layer
Schema-to-Schema Graph • Inter-organization of the different schemas used by the peers • Logical model • Directed • Weighted • Redundant
The Semantic Connectivity Graph • Definition (Semantic Interoperability) Two peers are said to be semantically interoperable if they can forward queries to each other in the Schema-to-Schema graph, potentially through series of semantic translation links • Idea • As for physical network analyses, create a connectivity layer to account for semantic interoperability • The semantic connectivity Graph S • Unweighted, irreflexive and non-redundant version of the Schema-to-Schema graph
Observations • Theorem Peers in a set Ps are semantically interoperable iff Ss is strongly connected, with Ss {s | p Ps, ps} • Observation 1 A set of peers Pscannot be semantically interoperable if |Es| <|Vs| • Observation 2 A set of peers Psis semantically interoperable if |Es| >|Vs| (|Vs|-1) - (|Vs|-1)
2. Semantic Interoperability in the Large • Question • How can we analyze semantic interoperability in large-scale PDMS? • Idea: use percolation theory to detect the emergence of a strongly connected component in S • Necessary condition for vertex-strong connectivity • Necessary condition for semantic interoperability
The Model • Adaptation of a recent graph-theoretic framework • Newman, Strogatz, Watts 2001 • Large-scale semantic graphs as random graphs with arbitrary degree distribution • Exponentially distributed, small-world, scale-free… graphs • Specificities of our model • Strong clustering (clustering coefficient cc) • Bidirectionality (bidirectionality coefficient bc) (for directed networks) • Based on generatingfunctionology • Percolation: ci > 0
Size of the giant component With u the smallest non-negative solution of And G1the distribution of edges from first to second-order neighbors:
3. The Sequence Retrieval System (SRS) • Commercial information indexing and retrieval system • Bioinformatic libraries • EMBL • SwissProt • Prosite • Etc. • Schemas described in a custom language (Icarus) • Mappings (links) from one database to others
Why is SRS interesting? • Applying our heuristics on a real large-scale corpus of interconnected databases • More than 380 databanks • More than 500 (undirected) links • Data used by professionals on a daily basis
Crawling the SRS schema-to-schema graph • Custom crawler • As of May 2005 (EBI repository) • 388 nodes • 518 edges • Giant connected component: 187 nodes • Power-law distribution of node degrees • Clustering coefficient = 0.32 • Diameter = 9
Results • Connectivity indicator ci = 25.4 • Super-critical state • Size of the giant component • 0.47 (derived) • 0.48 (observed)
Graphs with same power-law degree distr. • Varying number of edges
Analyzing weighted networks • Do we have a sufficient number of good mappings? • Introducing quality measures from the mappings • Weights • Attribute / schema level • Cf. Chatty Web (WWW03) • Semantic query forwarding • Per-hop forwarding behaviors • Only forward if wi >= • = 0 : flooding • = 1 : exact answers
Weighted Results • Same degree distribution (388 nodes) • Uniformly distributed weights between 0 and 1
4. Conclusions • Analyzing a real network of bioinformatic databases • Accurate results (even for relatively small networks) • Weighted / unweighted • Current works • Compositions of weights along a path • Semantic random walkers • Public domain simulator • Future works • Analyzing other forwarding behaviors • Implementation in a real PDMS (self-organizing mappings) • GridVine
References A Necessary Condition for Semantic Interoperability in the Large Philippe Cudré-Mauroux and Karl Aberer ODBASE 2004 GridVine: Building Internet-Scale Semantic Overlay Networks Karl Aberer, Philippe Cudré-Mauroux and Tim van Pelt ISWC 2004 Semantic Overlay Networks (Tutorial) Karl Aberer and Philippe Cudré-Mauroux VLDB 2005 … complete reference list at http://lsirpeople.epfl.ch/pcudre/
Thank you for your attention Questions ?