Linked open geodata management in the cloud
Download
1 / 28

Linked Open GeoData Management in the Cloud - PowerPoint PPT Presentation


  • 137 Views
  • Uploaded on

Linked Open GeoData Management in the Cloud. K. Kritikos , Y. Roussakis ICS-FORTH D. Kotzinos ICS-FORTH & TEI of Serres. Cloud Computing. Better (faster, reliable, etc.) infrastructure - IaaS. Development infrastructure – PaaS. Software infrastructure – SaaS. Cloud Computing.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Linked Open GeoData Management in the Cloud' - jirair


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Linked open geodata management in the cloud

Linked Open GeoData Management in the Cloud

K. Kritikos, Y. Roussakis ICS-FORTH

D. Kotzinos ICS-FORTH & TEI of Serres


Cloud computing
Cloud Computing

Better (faster, reliable, etc.) infrastructure - IaaS

Development infrastructure –

PaaS

Software infrastructure –

SaaS


Cloud computing1
Cloud Computing

  • Publication

  • Querying

  • Updating

Data as a Service (DaaS)

Data


Linked open data as a service
Linked (Open) Data as a Service

  • Publishing Linked Data

    • URI construction

    • Conceptual Model

    • Storage as RDF files or SPARQL endpoints

  • Querying Linked Data

    • SPARQL

    • GeoSPARQL

  • Updating Linked Data

    • SPARUL

    • Synchronization with original sources


Problem introduction i
Problem Introduction (I)

  • INGeoCloudS FP7 Pilot B Project (www.ingeoclouds.eu)

  • Geophysical data from different sources and in different formats (excel, xml, relational, nothing …)

  • Borehole and Groundwater Water Analysis

    • Boreholes located in Mygdonia/Thriasio of Greece, whole country in Denmark and France and their features (static data over time)

    • Chemical analyses of ground waters sampled from their boreholes (data updated over time)

  • Earthquake events and features

  • Landslides


Data granularity
Data granularity

Data refer to different levels of granularity, e.g. susceptibility maps refer to a country-wide area while earthquakes or boreholes are point-level data

Data might need to be aggregated by such aggregation is based on the spatial dimension, i.e. points contained within a polygon

Some problems of aggregation do exist since phenomena outside the area of concern may affect it, so spatial aggregation might not be enough


Linked open geodata management in the cloud

  • Landslides:

  • Which area and how much is it affected?

  • How does this change over time?

  • Is the earthquake effect cumulative or fades over time?

  • Earthquakes:

  • How much back in time should we go?

  • What information should be kept/would be relevant?

  • How should we query the repository to get the relevant information?


Problem introduction ii
Problem Introduction (II)

  • Data/Metadata Standards

    • INSPIRE standard proposes generic conceptual schema for scientific data + models for 34 spatial data themes

  • Deal with geospatial data & maintaining schemas/ontologies becomes difficult

    • Challenge is to exploit semantic heterogeneity

  • Need to offer seamless & transparent LOD as a service (LODaaS) way to manage LOD data

    • Lack of tools for mapping, transforming & synchronizing geo-spatial LD

    • Generic LOD management independent of way LOD are stored


Points of interest
Points of interest

  • GeoData get bigger and more important

    • Used in a variety of applications in different fields

  • Size & high demand impose considerable requirements in infrastructure storage size & compute power

  • Need to be reused and linked with other data sets

    • Go beyond current Web paradigm of isolated data silos

  • Current geo-spatial open data management work does not offer such effort

    • Cloud-based approaches:

      • do not provide geo-spatial support

      • Some do not fully support SPARQL or offer SPARQL end-points

    • Centralized approaches offer geo-spatial support but:

      • Do not enable automatic mapping between relational and RDF data

      • Worse performance in general (with the exception of Strabonwrt geo-spatial query support)


Proposed solution i
Proposed Solution (I)

  • A specific set of LODaaS services for geo-spatial LOD publishing, integration & querying

  • Cloud is offering its scalability & elasticity of computation, 24/7 availability & multiple data storage and integration offerings

  • Our cloud-based service-oriented system:

    • Exhibits good LOD management performance

    • Exposes a LOD management service that abstracts away RDF Store peculiarities & provides a generic way for LOD access and management


Proposed solution ii
Proposed Solution (II)

  • A particular solution is adopted for mapping geo-spatial data in different formats to RDF data

  • The latter conform to extensible conceptual models that accurately capture thematic areas and are integrated via GeoScientific Observation Model

    • This allows imposing queries across providers and thematic fields

  • Our solution is part of the system, developed in the context of the InGeoCloudS project, that exploits cloud capabilities & LD technology to integrate & store heterogeneous geo-spatial data sets of different thematic fields + host & execute applications that exploit these data sets


Architecture i
Architecture (I)

  • System is scalable and elastic by exploiting cloud facilities

  • An extensive application pool can be built on top that exploits the offered services to perform various added-value and high-demanding tasks:

    • LO GeoData visualization, discovery & composition of data-sets, LO GeoData analytics

    • System could be extended to host such applications & offer various (geo-spatial) LO GeoData processing services and pre-built applications


Architecture ii
Architecture (II)

  • Distributor: equally distributes generic queries & collects back the results, non-generic queries are sent to instances with the appropriate data, data distribution achieved by assigning new data to the less loaded wrt storage space scaling layer, exploits CPU monitoring & elasticity facilities of Amazon

  • Scaling Layer: comprises one or more LOD management components, data are replicated across these components to enhance reliability & enable layer-based load balancing

  • LOD Management Component: comprises LOD Management Service (LMS) instance & Virtuoso server for storage

  • LMS: provides methods for data providers to manage LOD & for other users to query & export the LOD stored

  • Virtuoso: underlying RDF triple store also allowing the mapping & synchronization between relational and RDF data


  • Linked open geodata management in the cloud

    General Query Evaluation Behavior

    Response Time

    2nd instance involvement

    Time passed


    Lod integration publishing i
    LOD Integration & Publishing (I)

    • Extension of the high-level CIDOC-CRM conceptual model

    • New model is called Geo-Scientific Spatial Observation Model (GSOM) & expressed in RDF/S

    • It enables to capture all information coming different fields & countries + link data across different providers

    • INSPIRE was not exploited as did not cover all requirements:

      • Capturing of scientific events

      • Complicated and cumbersome for information integration

      • In some cases, does not cover all appropriate information required by the data providers in particular thematic fields

    • GSOM-to-INSPIRE mapping specification to enable exporting INSPIRE-compliant data


    Lod integration publishing ii
    LOD Integration & Publishing (II)

    • Two alternatives for publishing LOD:

    • Create and import RDF-based descriptions of data-sets via particular LMS method

      • Data update process must be controlled by performing SPARUL updates via particular LMS method

      • Data provider responsibility to keep synchronized relational & RDF data

        • A perfect synchronization may be also not required as it may incur costs -> second alternative becomes more preferable


    Lod integration publishing iii
    LOD Integration & Publishing (III)

    • Data provider publishes relational data of his/her data sets + provides a mapping file in R2RML to enable the synchronization of relational to RDF data (by executing LMS method)

      • System takes care of this synchronization

      • Relational storage in the way used many years + additional RDF storage for the data with automatic one-way synchronization between the two

      • Provider should have a good knowledge of GSOM & RDF


    Lod integration publishing iv
    LOD Integration & Publishing (IV)

    • R2RML:

      • W3C recommendation since 2012

      • Can specify customized mappings between RDB & RDF data

      • R2RML specification is just a RDF graph in Turtle

      • No specific implementation is imposed

    • Virtuoso supports R2RML by processing the R2RML specification & creating the respective RDB2RDF triggers (used for creating/updating RDF data from relational ones)

      • An RDF view or physical RDF graph can be created with the second option mapping to far better performance


    Linked open geodata management in the cloud

    R2RML

    E26.Physical_Feature

    GSOM

    O4.sampled_from

    S15.Acquifer_

    Concept

    Intake

    P121F.overlaps_

    with

    S16.Borehole

    S2.SampleTaking

    O5.removed

    P43F.has_

    dimension

    P1F.is_identified_by

    S13.Sample

    E41.Appelation

    Borehole_Name

    E42.Identifier

    Sample_ID,

    E54.Dimension

    Waterlevel

    URI Identification:

    http://orgURL/SampleID/XYZ

    P1F.is_identified_by

    Publication

    Borehole Relational Model

    RDB

    Synchronization


    Lod management service i
    LOD Management Service (I)

    • REST-based service with API exposing all appropriate management functionality needed by geo-spatial LOD users

      • Abstracts away from peculiarities of RDF triple stores

      • Enables simple & intuitive use of a specific set of LOD management methods

      • Programmatic or form-based access to methods

      • Production of query results in different forms, such as WKT, GML, & KML

      • Imporing/exporting capabilities in different formats (RDF/XML, NTriples, Turtle)


    Lod management service ii
    LOD Management Service (II)

    • The provided methods are:

      • meta_query (SPARQL string, timeout (opt.), row limit (opt.)): user-requested format (e.g., JSON)

      • meta_update (SPARUL string, baseURI, timeout (opt.), row limit (opt.))

      • meta_addMappings(R2RML string, graphURI) -> initiates mapping procedure

      • meta_export(graphURI, subjURI, predURI, objURI, internal): user-requested format -> last param indicates if result will be inline in the response

      • meta_import(url, graphUri, format, blocking): ImportStatus -> RDF data are imported by downloading them via provided URL or inline in user-request + method can be blocking or non-blocking

      • import_status(importID): ImportStatus -> in case of blocking import request, the user can inquire the status of his/her import by exploiting the value of a specific field (importID) returned from the previous method as input to this method


    Lod management service iii
    LOD Management Service (III)

    • Each method accessible via specific URL + produces meaningful exception messages (e.g., in case user input is wrong)

    • User-friendly HTML Documentation produced via Enunciate

    • Implementation exploited Sesame RDF Data Management API, Virtuoso’s JDBC Driver & Jersey


    Open issues i
    Open Issues (I)

    • Model:

      • Extend it to capture other thematic fields

      • Data published in our system could fulfill all requirements to be 5-star LOD if respective owners decide to do so

    • Data mapping:

      • Cloud-based Virtuoso version supports native Relational DB for RDB2RDF synchronization

        • Trade-off between LOD management completeness & cost

      • Mapping tools are needed to allow visual-based editing of R2RML without needing from data providers to have good knowledge of RDF

      • Research issue: support bi-directional RDB2RDF mappings


    Open issues ii
    Open Issues (II)

    • Geo-spatial query support:

      • Virtuoso does not support GeoSPARQL

      • Virtuoso has limited geo-spatial query support only in commercial versions

        • 2D geometries + limited set of topological relation operators

      • Additional support in terms of geometry dimensionality + feature aggregation operators

        • Could extend Virtuoso via frameworks, such as uSeekM, which provide adequate geo-spatial support along with the capability of evaluating GeoSPARQL queries

          • Such solutions require processing all RDF data stored to create geo-spatial indices as well as deploy another DB -> do not fit well with automatic geo-spatial LOD management

          • Could resolve problem by: (a) performing re-indexing in infrequent time intervals, (b) create specialized triggers which trigger re-indexing only when RDF data are updated


    Open issues iii
    Open Issues (III)

    • Quality & provenance:

      • Original input data sets may not have the appropriate quality -> resulting RDF data can have the same or lower quality level

      • Proposed infrastructure must be extended with quality resolving procedures & methods (e.g., data cleansing methods for correcting the data exploited)

      • Provenance information can ensure the correct updating of LD + assist in LD reasoning process by deriving additional facts

      • Thus, provenance information should be exploited by our system, especially if we consider that such exploitation is not enabled by most LOD management systems


    Conclusions
    Conclusions

    • Proposed a scalable, geo-spatial LOD as-a-Service management system deployed on Amazon cloud

      • Distributes query load + scales-up/down when CPU utilization surpasses specific thresholds

      • Exposes REST-based service with LOD management methods

      • Provides two different ways for publishing open geo-spatial data sets

    • Advance geo-spatial support level by following two directions:

      • Realize GSOM-to-INSPIRE mapping to enable producing INSPIRE-compliant data

      • Extend Virtuoso with geo-spatial indexing & query systems to enable the efficient processing of rich & expressive geo-spatial queries, expressed either in SPARQL or GeoSPARQL