1 / 28

Linked Open GeoData Management in the Cloud

This publication discusses the management of Linked Open GeoData in the cloud, including infrastructure development (IaaS), software development (PaaS), and data querying (SaaS). It explores concepts such as Data as a Service (DaaS) and Linked Data, and provides solutions for publishing, querying, and updating Linked Data. The architecture of the proposed system is scalable and elastic, leveraging cloud facilities for improved data management and processing.

dgarvey
Download Presentation

Linked Open GeoData Management in the Cloud

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linked Open GeoData Management in the Cloud K. Kritikos, Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres

  2. Cloud Computing Better (faster, reliable, etc.) infrastructure - IaaS Development infrastructure – PaaS Software infrastructure – SaaS

  3. Cloud Computing • Publication • Querying • Updating Data as a Service (DaaS) Data

  4. Linked (Open) Data as a Service • Publishing Linked Data • URI construction • Conceptual Model • Storage as RDF files or SPARQL endpoints • Querying Linked Data • SPARQL • GeoSPARQL • Updating Linked Data • SPARUL • Synchronization with original sources

  5. Problem Introduction (I) • INGeoCloudS FP7 Pilot B Project (www.ingeoclouds.eu) • Geophysical data from different sources and in different formats (excel, xml, relational, nothing …) • Borehole and Groundwater Water Analysis • Boreholes located in Mygdonia/Thriasio of Greece, whole country in Denmark and France and their features (static data over time) • Chemical analyses of ground waters sampled from their boreholes (data updated over time) • Earthquake events and features • Landslides

  6. Data granularity Data refer to different levels of granularity, e.g. susceptibility maps refer to a country-wide area while earthquakes or boreholes are point-level data Data might need to be aggregated by such aggregation is based on the spatial dimension, i.e. points contained within a polygon Some problems of aggregation do exist since phenomena outside the area of concern may affect it, so spatial aggregation might not be enough

  7. Landslides: • Which area and how much is it affected? • How does this change over time? • Is the earthquake effect cumulative or fades over time? • Earthquakes: • How much back in time should we go? • What information should be kept/would be relevant? • How should we query the repository to get the relevant information?

  8. Problem Introduction (II) • Data/Metadata Standards • INSPIRE standard proposes generic conceptual schema for scientific data + models for 34 spatial data themes • Deal with geospatial data & maintaining schemas/ontologies becomes difficult • Challenge is to exploit semantic heterogeneity • Need to offer seamless & transparent LOD as a service (LODaaS) way to manage LOD data • Lack of tools for mapping, transforming & synchronizing geo-spatial LD • Generic LOD management independent of way LOD are stored

  9. Points of interest • GeoData get bigger and more important • Used in a variety of applications in different fields • Size & high demand impose considerable requirements in infrastructure storage size & compute power • Need to be reused and linked with other data sets • Go beyond current Web paradigm of isolated data silos • Current geo-spatial open data management work does not offer such effort • Cloud-based approaches: • do not provide geo-spatial support • Some do not fully support SPARQL or offer SPARQL end-points • Centralized approaches offer geo-spatial support but: • Do not enable automatic mapping between relational and RDF data • Worse performance in general (with the exception of Strabonwrt geo-spatial query support)

  10. Proposed Solution (I) • A specific set of LODaaS services for geo-spatial LOD publishing, integration & querying • Cloud is offering its scalability & elasticity of computation, 24/7 availability & multiple data storage and integration offerings • Our cloud-based service-oriented system: • Exhibits good LOD management performance • Exposes a LOD management service that abstracts away RDF Store peculiarities & provides a generic way for LOD access and management

  11. Proposed Solution (II) • A particular solution is adopted for mapping geo-spatial data in different formats to RDF data • The latter conform to extensible conceptual models that accurately capture thematic areas and are integrated via GeoScientific Observation Model • This allows imposing queries across providers and thematic fields • Our solution is part of the system, developed in the context of the InGeoCloudS project, that exploits cloud capabilities & LD technology to integrate & store heterogeneous geo-spatial data sets of different thematic fields + host & execute applications that exploit these data sets

  12. Architecture (I) • System is scalable and elastic by exploiting cloud facilities • An extensive application pool can be built on top that exploits the offered services to perform various added-value and high-demanding tasks: • LO GeoData visualization, discovery & composition of data-sets, LO GeoData analytics • System could be extended to host such applications & offer various (geo-spatial) LO GeoData processing services and pre-built applications

  13. Architecture (II) • Distributor: equally distributes generic queries & collects back the results, non-generic queries are sent to instances with the appropriate data, data distribution achieved by assigning new data to the less loaded wrt storage space scaling layer, exploits CPU monitoring & elasticity facilities of Amazon • Scaling Layer: comprises one or more LOD management components, data are replicated across these components to enhance reliability & enable layer-based load balancing • LOD Management Component: comprises LOD Management Service (LMS) instance & Virtuoso server for storage • LMS: provides methods for data providers to manage LOD & for other users to query & export the LOD stored • Virtuoso: underlying RDF triple store also allowing the mapping & synchronization between relational and RDF data

  14. General Query Evaluation Behavior Response Time 2nd instance involvement Time passed

  15. LOD Integration & Publishing (I) • Extension of the high-level CIDOC-CRM conceptual model • New model is called Geo-Scientific Spatial Observation Model (GSOM) & expressed in RDF/S • It enables to capture all information coming different fields & countries + link data across different providers • INSPIRE was not exploited as did not cover all requirements: • Capturing of scientific events • Complicated and cumbersome for information integration • In some cases, does not cover all appropriate information required by the data providers in particular thematic fields • GSOM-to-INSPIRE mapping specification to enable exporting INSPIRE-compliant data

  16. LOD Integration & Publishing (II) • Two alternatives for publishing LOD: • Create and import RDF-based descriptions of data-sets via particular LMS method • Data update process must be controlled by performing SPARUL updates via particular LMS method • Data provider responsibility to keep synchronized relational & RDF data • A perfect synchronization may be also not required as it may incur costs -> second alternative becomes more preferable

  17. LOD Integration & Publishing (III) • Data provider publishes relational data of his/her data sets + provides a mapping file in R2RML to enable the synchronization of relational to RDF data (by executing LMS method) • System takes care of this synchronization • Relational storage in the way used many years + additional RDF storage for the data with automatic one-way synchronization between the two • Provider should have a good knowledge of GSOM & RDF

  18. LOD Integration & Publishing (IV) • R2RML: • W3C recommendation since 2012 • Can specify customized mappings between RDB & RDF data • R2RML specification is just a RDF graph in Turtle • No specific implementation is imposed • Virtuoso supports R2RML by processing the R2RML specification & creating the respective RDB2RDF triggers (used for creating/updating RDF data from relational ones) • An RDF view or physical RDF graph can be created with the second option mapping to far better performance

  19. R2RML E26.Physical_Feature GSOM O4.sampled_from S15.Acquifer_ Concept Intake P121F.overlaps_ with S16.Borehole S2.SampleTaking O5.removed P43F.has_ dimension P1F.is_identified_by S13.Sample E41.Appelation Borehole_Name E42.Identifier Sample_ID, E54.Dimension Waterlevel URI Identification: http://orgURL/SampleID/XYZ P1F.is_identified_by Publication Borehole Relational Model RDB Synchronization

  20. LOD Management Service (I) • REST-based service with API exposing all appropriate management functionality needed by geo-spatial LOD users • Abstracts away from peculiarities of RDF triple stores • Enables simple & intuitive use of a specific set of LOD management methods • Programmatic or form-based access to methods • Production of query results in different forms, such as WKT, GML, & KML • Imporing/exporting capabilities in different formats (RDF/XML, NTriples, Turtle)

  21. LOD Management Service (II) • The provided methods are: • meta_query (SPARQL string, timeout (opt.), row limit (opt.)): user-requested format (e.g., JSON) • meta_update (SPARUL string, baseURI, timeout (opt.), row limit (opt.)) • meta_addMappings(R2RML string, graphURI) -> initiates mapping procedure • meta_export(graphURI, subjURI, predURI, objURI, internal): user-requested format -> last param indicates if result will be inline in the response • meta_import(url, graphUri, format, blocking): ImportStatus -> RDF data are imported by downloading them via provided URL or inline in user-request + method can be blocking or non-blocking • import_status(importID): ImportStatus -> in case of blocking import request, the user can inquire the status of his/her import by exploiting the value of a specific field (importID) returned from the previous method as input to this method

  22. LOD Management Service (III) • Each method accessible via specific URL + produces meaningful exception messages (e.g., in case user input is wrong) • User-friendly HTML Documentation produced via Enunciate • Implementation exploited Sesame RDF Data Management API, Virtuoso’s JDBC Driver & Jersey

  23. Open Issues (I) • Model: • Extend it to capture other thematic fields • Data published in our system could fulfill all requirements to be 5-star LOD if respective owners decide to do so • Data mapping: • Cloud-based Virtuoso version supports native Relational DB for RDB2RDF synchronization • Trade-off between LOD management completeness & cost • Mapping tools are needed to allow visual-based editing of R2RML without needing from data providers to have good knowledge of RDF • Research issue: support bi-directional RDB2RDF mappings

  24. Open Issues (II) • Geo-spatial query support: • Virtuoso does not support GeoSPARQL • Virtuoso has limited geo-spatial query support only in commercial versions • 2D geometries + limited set of topological relation operators • Additional support in terms of geometry dimensionality + feature aggregation operators • Could extend Virtuoso via frameworks, such as uSeekM, which provide adequate geo-spatial support along with the capability of evaluating GeoSPARQL queries • Such solutions require processing all RDF data stored to create geo-spatial indices as well as deploy another DB -> do not fit well with automatic geo-spatial LOD management • Could resolve problem by: (a) performing re-indexing in infrequent time intervals, (b) create specialized triggers which trigger re-indexing only when RDF data are updated

  25. Open Issues (III) • Quality & provenance: • Original input data sets may not have the appropriate quality -> resulting RDF data can have the same or lower quality level • Proposed infrastructure must be extended with quality resolving procedures & methods (e.g., data cleansing methods for correcting the data exploited) • Provenance information can ensure the correct updating of LD + assist in LD reasoning process by deriving additional facts • Thus, provenance information should be exploited by our system, especially if we consider that such exploitation is not enabled by most LOD management systems

  26. Conclusions • Proposed a scalable, geo-spatial LOD as-a-Service management system deployed on Amazon cloud • Distributes query load + scales-up/down when CPU utilization surpasses specific thresholds • Exposes REST-based service with LOD management methods • Provides two different ways for publishing open geo-spatial data sets • Advance geo-spatial support level by following two directions: • Realize GSOM-to-INSPIRE mapping to enable producing INSPIRE-compliant data • Extend Virtuoso with geo-spatial indexing & query systems to enable the efficient processing of rich & expressive geo-spatial queries, expressed either in SPARQL or GeoSPARQL

More Related