Linked open geodata management in the cloud
Download
1 / 28

Linked Open GeoData Management in the Cloud - PowerPoint PPT Presentation


  • 137 Views
  • Uploaded on

Linked Open GeoData Management in the Cloud. K. Kritikos , Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres. Cloud Computing. Better (faster, reliable, etc.) infrastructure - IaaS. Development infrastructure – PaaS. Software infrastructure – SaaS. Cloud Computing.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Linked Open GeoData Management in the Cloud' - jirair


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Linked open geodata management in the cloud

Linked Open GeoData Management in the Cloud

K. Kritikos, Y. Roussakis ICS-FORTH

D. Kotzinos ICS-FORTH & TEI of Serres


Cloud computing
Cloud Computing

Better (faster, reliable, etc.) infrastructure - IaaS

Development infrastructure –

PaaS

Software infrastructure –

SaaS


Cloud computing1
Cloud Computing

  • Publication

  • Querying

  • Updating

Data as a Service (DaaS)

Data


Linked open data as a service
Linked (Open) Data as a Service

  • Publishing Linked Data

    • URI construction

    • Conceptual Model

    • Storage as RDF files or SPARQL endpoints

  • Querying Linked Data

    • SPARQL

    • GeoSPARQL

  • Updating Linked Data

    • SPARUL

    • Synchronization with original sources


Problem introduction i
Problem Introduction (I)

  • INGeoCloudS FP7 Pilot B Project (www.ingeoclouds.eu)

  • Geophysical data from different sources and in different formats (excel, xml, relational, nothing …)

  • Borehole and Groundwater Water Analysis

    • Boreholes located in Mygdonia/Thriasio of Greece, whole country in Denmark and France and their features (static data over time)

    • Chemical analyses of ground waters sampled from their boreholes (data updated over time)

  • Earthquake events and features

  • Landslides


Data granularity
Data granularity

Data refer to different levels of granularity, e.g. susceptibility maps refer to a country-wide area while earthquakes or boreholes are point-level data

Data might need to be aggregated by such aggregation is based on the spatial dimension, i.e. points contained within a polygon

Some problems of aggregation do exist since phenomena outside the area of concern may affect it, so spatial aggregation might not be enough


  • Landslides:

  • Which area and how much is it affected?

  • How does this change over time?

  • Is the earthquake effect cumulative or fades over time?

  • Earthquakes:

  • How much back in time should we go?

  • What information should be kept/would be relevant?

  • How should we query the repository to get the relevant information?


Problem introduction ii
Problem Introduction (II)

  • Data/Metadata Standards

    • INSPIRE standard proposes generic conceptual schema for scientific data + models for 34 spatial data themes

  • Deal with geospatial data & maintaining schemas/ontologies becomes difficult

    • Challenge is to exploit semantic heterogeneity

  • Need to offer seamless & transparent LOD as a service (LODaaS) way to manage LOD data

    • Lack of tools for mapping, transforming & synchronizing geo-spatial LD

    • Generic LOD management independent of way LOD are stored


Points of interest
Points of interest

  • GeoData get bigger and more important

    • Used in a variety of applications in different fields

  • Size & high demand impose considerable requirements in infrastructure storage size & compute power

  • Need to be reused and linked with other data sets

    • Go beyond current Web paradigm of isolated data silos

  • Current geo-spatial open data management work does not offer such effort

    • Cloud-based approaches:

      • do not provide geo-spatial support

      • Some do not fully support SPARQL or offer SPARQL end-points

    • Centralized approaches offer geo-spatial support but:

      • Do not enable automatic mapping between relational and RDF data

      • Worse performance in general (with the exception of Strabonwrt geo-spatial query support)


Proposed solution i
Proposed Solution (I)

  • A specific set of LODaaS services for geo-spatial LOD publishing, integration & querying

  • Cloud is offering its scalability & elasticity of computation, 24/7 availability & multiple data storage and integration offerings

  • Our cloud-based service-oriented system:

    • Exhibits good LOD management performance

    • Exposes a LOD management service that abstracts away RDF Store peculiarities & provides a generic way for LOD access and management


Proposed solution ii
Proposed Solution (II)

  • A particular solution is adopted for mapping geo-spatial data in different formats to RDF data

  • The latter conform to extensible conceptual models that accurately capture thematic areas and are integrated via GeoScientific Observation Model

    • This allows imposing queries across providers and thematic fields

  • Our solution is part of the system, developed in the context of the InGeoCloudS project, that exploits cloud capabilities & LD technology to integrate & store heterogeneous geo-spatial data sets of different thematic fields + host & execute applications that exploit these data sets


Architecture i
Architecture (I)

  • System is scalable and elastic by exploiting cloud facilities

  • An extensive application pool can be built on top that exploits the offered services to perform various added-value and high-demanding tasks:

    • LO GeoData visualization, discovery & composition of data-sets, LO GeoData analytics

    • System could be extended to host such applications & offer various (geo-spatial) LO GeoData processing services and pre-built applications


Architecture ii
Architecture (II)

  • Distributor: equally distributes generic queries & collects back the results, non-generic queries are sent to instances with the appropriate data, data distribution achieved by assigning new data to the less loaded wrt storage space scaling layer, exploits CPU monitoring & elasticity facilities of Amazon

  • Scaling Layer: comprises one or more LOD management components, data are replicated across these components to enhance reliability & enable layer-based load balancing

  • LOD Management Component: comprises LOD Management Service (LMS) instance & Virtuoso server for storage

  • LMS: provides methods for data providers to manage LOD & for other users to query & export the LOD stored

  • Virtuoso: underlying RDF triple store also allowing the mapping & synchronization between relational and RDF data


  • General Query Evaluation Behavior

    Response Time

    2nd instance involvement

    Time passed


    Lod integration publishing i
    LOD Integration & Publishing (I)

    • Extension of the high-level CIDOC-CRM conceptual model

    • New model is called Geo-Scientific Spatial Observation Model (GSOM) & expressed in RDF/S

    • It enables to capture all information coming different fields & countries + link data across different providers

    • INSPIRE was not exploited as did not cover all requirements:

      • Capturing of scientific events

      • Complicated and cumbersome for information integration

      • In some cases, does not cover all appropriate information required by the data providers in particular thematic fields

    • GSOM-to-INSPIRE mapping specification to enable exporting INSPIRE-compliant data


    Lod integration publishing ii
    LOD Integration & Publishing (II)

    • Two alternatives for publishing LOD:

    • Create and import RDF-based descriptions of data-sets via particular LMS method

      • Data update process must be controlled by performing SPARUL updates via particular LMS method

      • Data provider responsibility to keep synchronized relational & RDF data

        • A perfect synchronization may be also not required as it may incur costs -> second alternative becomes more preferable


    Lod integration publishing iii
    LOD Integration & Publishing (III)

    • Data provider publishes relational data of his/her data sets + provides a mapping file in R2RML to enable the synchronization of relational to RDF data (by executing LMS method)

      • System takes care of this synchronization

      • Relational storage in the way used many years + additional RDF storage for the data with automatic one-way synchronization between the two

      • Provider should have a good knowledge of GSOM & RDF


    Lod integration publishing iv
    LOD Integration & Publishing (IV)

    • R2RML:

      • W3C recommendation since 2012

      • Can specify customized mappings between RDB & RDF data

      • R2RML specification is just a RDF graph in Turtle

      • No specific implementation is imposed

    • Virtuoso supports R2RML by processing the R2RML specification & creating the respective RDB2RDF triggers (used for creating/updating RDF data from relational ones)

      • An RDF view or physical RDF graph can be created with the second option mapping to far better performance


    R2RML

    E26.Physical_Feature

    GSOM

    O4.sampled_from

    S15.Acquifer_

    Concept

    Intake

    P121F.overlaps_

    with

    S16.Borehole

    S2.SampleTaking

    O5.removed

    P43F.has_

    dimension

    P1F.is_identified_by

    S13.Sample

    E41.Appelation

    Borehole_Name

    E42.Identifier

    Sample_ID,

    E54.Dimension

    Waterlevel

    URI Identification:

    http://orgURL/SampleID/XYZ

    P1F.is_identified_by

    Publication

    Borehole Relational Model

    RDB

    Synchronization


    Lod management service i
    LOD Management Service (I)

    • REST-based service with API exposing all appropriate management functionality needed by geo-spatial LOD users

      • Abstracts away from peculiarities of RDF triple stores

      • Enables simple & intuitive use of a specific set of LOD management methods

      • Programmatic or form-based access to methods

      • Production of query results in different forms, such as WKT, GML, & KML

      • Imporing/exporting capabilities in different formats (RDF/XML, NTriples, Turtle)


    Lod management service ii
    LOD Management Service (II)

    • The provided methods are:

      • meta_query (SPARQL string, timeout (opt.), row limit (opt.)): user-requested format (e.g., JSON)

      • meta_update (SPARUL string, baseURI, timeout (opt.), row limit (opt.))

      • meta_addMappings(R2RML string, graphURI) -> initiates mapping procedure

      • meta_export(graphURI, subjURI, predURI, objURI, internal): user-requested format -> last param indicates if result will be inline in the response

      • meta_import(url, graphUri, format, blocking): ImportStatus -> RDF data are imported by downloading them via provided URL or inline in user-request + method can be blocking or non-blocking

      • import_status(importID): ImportStatus -> in case of blocking import request, the user can inquire the status of his/her import by exploiting the value of a specific field (importID) returned from the previous method as input to this method


    Lod management service iii
    LOD Management Service (III)

    • Each method accessible via specific URL + produces meaningful exception messages (e.g., in case user input is wrong)

    • User-friendly HTML Documentation produced via Enunciate

    • Implementation exploited Sesame RDF Data Management API, Virtuoso’s JDBC Driver & Jersey


    Open issues i
    Open Issues (I)

    • Model:

      • Extend it to capture other thematic fields

      • Data published in our system could fulfill all requirements to be 5-star LOD if respective owners decide to do so

    • Data mapping:

      • Cloud-based Virtuoso version supports native Relational DB for RDB2RDF synchronization

        • Trade-off between LOD management completeness & cost

      • Mapping tools are needed to allow visual-based editing of R2RML without needing from data providers to have good knowledge of RDF

      • Research issue: support bi-directional RDB2RDF mappings


    Open issues ii
    Open Issues (II)

    • Geo-spatial query support:

      • Virtuoso does not support GeoSPARQL

      • Virtuoso has limited geo-spatial query support only in commercial versions

        • 2D geometries + limited set of topological relation operators

      • Additional support in terms of geometry dimensionality + feature aggregation operators

        • Could extend Virtuoso via frameworks, such as uSeekM, which provide adequate geo-spatial support along with the capability of evaluating GeoSPARQL queries

          • Such solutions require processing all RDF data stored to create geo-spatial indices as well as deploy another DB -> do not fit well with automatic geo-spatial LOD management

          • Could resolve problem by: (a) performing re-indexing in infrequent time intervals, (b) create specialized triggers which trigger re-indexing only when RDF data are updated


    Open issues iii
    Open Issues (III)

    • Quality & provenance:

      • Original input data sets may not have the appropriate quality -> resulting RDF data can have the same or lower quality level

      • Proposed infrastructure must be extended with quality resolving procedures & methods (e.g., data cleansing methods for correcting the data exploited)

      • Provenance information can ensure the correct updating of LD + assist in LD reasoning process by deriving additional facts

      • Thus, provenance information should be exploited by our system, especially if we consider that such exploitation is not enabled by most LOD management systems


    Conclusions
    Conclusions

    • Proposed a scalable, geo-spatial LOD as-a-Service management system deployed on Amazon cloud

      • Distributes query load + scales-up/down when CPU utilization surpasses specific thresholds

      • Exposes REST-based service with LOD management methods

      • Provides two different ways for publishing open geo-spatial data sets

    • Advance geo-spatial support level by following two directions:

      • Realize GSOM-to-INSPIRE mapping to enable producing INSPIRE-compliant data

      • Extend Virtuoso with geo-spatial indexing & query systems to enable the efficient processing of rich & expressive geo-spatial queries, expressed either in SPARQL or GeoSPARQL


    ad