1 / 26

Can RDB2RDF Tools Feasible Expose Large Science Archives for Data Integration?

Can RDB2RDF Tools Feasible Expose Large Science Archives for Data Integration?. Alasdair J G Gray (University of Glasgow now Manchester) Norman Gray (Universities of Leicester and Glasgow) Iadh Ounis (University of Glasgow). ESWC 2009 – Crete 3 June 2009. Outline.

elysia
Download Presentation

Can RDB2RDF Tools Feasible Expose Large Science Archives for Data Integration?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Can RDB2RDF Tools Feasible Expose Large Science Archives for Data Integration? Alasdair J G Gray (University of Glasgow now Manchester) Norman Gray (Universities of Leicester and Glasgow) IadhOunis (University of Glasgow) ESWC 2009 – Crete 3 June 2009

  2. Outline • Motivation: The Virtual Observatory • Can SPARQL be used to express scientific queries? • Can existing archives be exposed with semantic tools? • Can RDB2RDF tools extract large volumes of data? A.J.G. Gray - ESWC 2009

  3. International Virtual Observatory Alliance “facilitate the international coordination and collaboration necessary for the development and deployment of the tools, systems and organizational structures necessary to enable the international utilization of astronomical archives as an integrated and interoperating virtual observatory.” A.J.G. Gray - ESWC 2009

  4. Searching for Brown Dwarfs • Data sets: • Near Infrared, 2MASS/UK Infrared Deep Sky Survey • Optical, APMCAT/Sloan Digital Sky Survey • Complex colour/motion selection criteria • Similar problems • Halo White Dwarfs A.J.G. Gray - ESWC 2009

  5. Deep Field Surveys • Observations in multiple wavelengths • Radio to X-Ray • Searching for new objects • Galaxies, stars, etc • Requires correlations across many catalogues • ISO • Hubble • SCUBA • etc A.J.G. Gray - ESWC 2009

  6. The Problem Locate and combine relevant data • Heterogeneous publishers • Archive centres • Research labs • Heterogeneous data • Relational • XML • Files Virtual Observatory A.J.G. Gray - ESWC 2009

  7. A Data Integration Approach • Heterogeneous sources • Autonomous • Local schemas • Homogeneous view • Mediated global schema • Mapping • LAV: local-as-view • GAV: global-as-view Query1 Queryn Relies on agreement of a common global schema Global Schema Mappings Wrapper1 Wrapperk Wrapperi DB1 DBk DBi A.J.G. Gray - ESWC 2009

  8. P2P Data Integration Approach • Heterogeneous sources • Autonomous • Local schemas • Heterogeneous views • Multiple schemas • Mappings • From sources to common schema • Between pairs of schema • Require common integration data model Can RDF do this? Query1 Queryn Schemaj Schema1 Mappings Wrapper1 Wrapperk Wrapperi DB1 DBk DBi A.J.G. Gray - ESWC 2009

  9. Resource Description Framework • W3C standard • Designed as a metadata data model • Contains semantic details • Ideal for linking distributed data • Queried through SPARQL IAU:Star rdf:type #Sun #name The Sun #foundIn The Galaxy #name #MilkyWay #name Milky Way rdf:type IAU:BarredSpiral A.J.G. Gray - ESWC 2009

  10. SPARQL • Declarative query language • Select returned data • Graph or tuples • Attributes to return • Describe structure of desired results • Filter data • W3C standard • Syntactically similar to SQL • Should be easy for scientists to learn! A.J.G. Gray - ESWC 2009

  11. Integrating Using RDF • Data resources • Expose schema and data as RDF • Need a SPARQL endpoint • Allows multiple • Access models • Storage models • Easy to relate data from multiple sources Common Model (RDF) SPARQL query We will focus on exposing relational data sources Mappings RDF / Relational Conversion RDF / XML Conversion Relational DB XML DB A.J.G. Gray - ESWC 2009

  12. RDB2RDF Extract-Transform-Load Query-driven Conversion Data stored as relations Native SQL query support Highly optimised access methods SPARQL queries must be translated Existing translation systems D2RQ SquirrelRDF • Data replicated as RDF • Data can become stale • Native SPARQL query support • Limited optimisation mechanisms Existing RDF stores • Jena • Seasame A.J.G. Gray - ESWC 2009

  13. System Hypothesis Is it viable to perform query-driven conversions to facilitate data access from a data model that a scientist is familiar with? Can RDB2RDF tools feasibly expose large science archives for data integration? Common Model (RDF) SPARQL query SPARQL query Mappings RDB2RDF RDF / XML Conversion Relational DB XML DB A.J.G. Gray - ESWC 2009

  14. Astronomical Test Data Set • SuperCOSMOS Science Archive (SSA) • Data extracted from scans of Schmidt plates • Stored in a relational database • About 4TB of data, detailing 6.4 billion objects • Fairly typical of astronomical data archives • Schema designed using 20 real queries • Personal version contains • Data for a specific region of the sky • About 0.1% of the data • About 500MB A.J.G. Gray - ESWC 2009

  15. Analysis of Test Data • Using personal version • About 500MB in size (similar size to related work) • Organised in 14 Relations • Number of attributes: 2 – 152 • 4 relations with more than 20 attributes • Number of rows: 3 – 585,560 • Two views • Complex selection criteria in views Makes this different from business cases and previous work! A.J.G. Gray - ESWC 2009

  16. IsSPARQL expressive enough? Can the 20 sample queries be expressed in SPARQL? A.J.G. Gray - ESWC 2009

  17. Real Science Queries Query 5: Find the positions and (B,R,I) magnitudes of all star-like objects within delta mag of 0.2 of the colours of a quasar of redshift 2.5 < z < 3.5 SPARQL: SELECT ?ra ?decl ?sCorMagB ?sCorMagR2 ?sCorMagI WHERE {<bindings> FILTER (?sCorMagB – ?sCorMagR2 >= 0.05 && ?sCorMagB - ?sCorMagR2 <= 0.80) FILTER (?sCorMagR2 – ?sCorMagI >= -0.17 && ?sCorMagR2 - ?sCorMagI <= 0.64)} SQL: SELECT ra, dec, sCorMagB, sCorMagR2, sCorMagI FROM ReliableStars WHERE (sCorMagB-sCorMagR2 BETWEEN 0.05 AND 0.80) AND (sCorMagR2-sCorMagI BETWEEN -0.17 AND 0.64) A.J.G. Gray - ESWC 2009

  18. Analysis of Test Queries A.J.G. Gray - ESWC 2009

  19. Expressivity of SPARQL Features Limitations Range shorthands Arithmetic in head Math functions Trigonometry functions Sub queries Aggregate functions Casting • Select-project-join • Arithmetic in body • Conjunction and disjunction • Ordering • String matching • External function calls (extension mechanism) A.J.G. Gray - ESWC 2009

  20. Analysis of Test Queries Expressible queries: 1, 2, 3, 5, 6, 14, 15, 17, 19 A.J.G. Gray - ESWC 2009

  21. Can RDB2RDF tools feasibly expose large science archives for data integration? A.J.G. Gray - ESWC 2009

  22. Experiment • Time query evaluation • 5 out of 20 queries used • No joins • Systems compared: • Relational DB (Base line) • MySQL v5.1.25 • RDB2RDF tools • D2RQ v0.5.2 • SquirrelRDF v0.1 • RDF Triple stores • Jena v2.5.6 (SDB) • Sesame v2.1.3 (Native) SPARQL query SPARQL query SQL query RDB2RDF Relational DB Triple store Relational DB A.J.G. Gray - ESWC 2009

  23. Experimental Configuration • 8 identical machines • 64 bit Intel Quad Core Xeon 2.4GHz • 4GB RAM • 100 GB Hard drive • Java 1.6 • Linux • 10 repetitions A.J.G. Gray - ESWC 2009

  24. Performance Results 485,932 372,561 21,492 17,793 19,984 3,450 5,339 2,733 7,229 4,090 1,307 7,468 ms 1 A.J.G. Gray - ESWC 2009

  25. Conclusions • SPARQL not expressive enough for real astronomical queries • RDBMS benefits from 30+ years research • Query optimisation • Indexes • RDF stores are improving • Require existing data to be replicated • RDB2RDF tools show promise • Need to exploit relational database A.J.G. Gray - ESWC 2009

  26. Can RDB2RDF Tools Feasible Expose Large Science Archives for Data Integration? Not currently! We need more work on query translation… A.J.G. Gray - ESWC 2009

More Related