1 / 31

Publishing georeferenced statistical data using linked open data technologies

Publishing georeferenced statistical data using linked open data technologies. Mirosław Migacz GIS Consultant Statistics Poland. Merging statistics and geospatial information grant series. The project.

hmolina
Download Presentation

Publishing georeferenced statistical data using linked open data technologies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Publishing georeferencedstatistical data usinglinked open data technologies Mirosław Migacz GIS Consultant Statistics Poland Merging statistics and geospatial information grant series NTTS 2019 Conference / Brussels / Belgium

  2. The project • Title: „Development of guidelines for publishingstatistical data as linked open data” • „Mergingstatistics and geospatialinformation” grant series • 2016 – 2017 • maingoal: prepare a background for LOD implementation in officialstatistics

  3. Before 3218 4.4.32.64.18 powiat łobeski (LAU 1) lobeski 4326418

  4. After powiat łobeski http:// nts.stat.gov.pl/4/4/32/64/18

  5. Specificobjectives • identify data sources • identifystatisticalunits • harmonize, generalize and buildURIs for statistical units • transformstatistical data, geospatial data and metadatainto RDF (pilot) • conclude the pilot transformation and fomulaterecommendations for a full-on implementation

  6. Primary data sources

  7. Identification of data sources • Other data sources: • publications • tables • communiques • announcements • articles

  8. Data sources - inventory • Metadata: • thematiccategory, • format (PDF, DOC, XLS, CSV), • spatialreference(country, NUTS, LAU, functionalareas, urbanareas), • temporalreference (years) • presence of identifiers(TERYT, NTS, NUTS) • update cycle • Preliminary analysisof data sources: • openness • redundance of information • popularity (based on view/ downloadstats)

  9. Statistical unitsinventory • administrativeboundaries: • administrativeunits • NUTS • Non-standard statisticalunits: • functionalareas/ urbanareas • Groups of administrative / statisticalunits • Derivemostlyfrom strategicdocuments NUTS ADMINISTRATIVE

  10. Statistical unitsharmonization – KTS • KTS – classificationcombining administrative and statisticalunits • introducedlastyear to comply with NUTS 2016 • 14-digit code

  11. Geometry harmonization/generalization • Input data: • administrativeboundariessince 2002 for LAU 2 (gmina), excluding 2007 • Harmonizationprocess: • structurestandardization • standardization of identifiers(creating KTS identifiers) • aggregationto higherlevelunits (LAU 1 -> NUTS 1) • Generalization: • severalgeneralizationscenariostested for purposesof choosingan optimal one • datasets with generalized and non-generalizedgeometriesprepared for 2002-2016

  12. Linked open data pilot

  13. LOD pilot – statistical data • data: • demographic data for 2016 from three major databases (Local Data Bank, Demography Database, STRATEG system), • ontologiesfor classifications: • agecodelistdefinedusingSKOS (skos) & Dublin Core (dct), • sexcodelistre-used from SDMX, addedPolishtranslation, • defininingmetadata for statisticalvalues (observations): • basedprimarily on SDMX ontologies (attribute, code, measure, dimension), • qb:Observationclass from Data Cube.

  14. LOD pilot – geospatial data • inputgeometries: • voivodshipgeometries for 2016, • ontologies: • ontology for the KTS classificationdefinedusing RDF Schema (rdfs) & GeoSPARQL (geo) vocabularies, • geometry encoding: • separategeo:Geometryentities with geometry encoded in WKT (WellKnownText) format (geo:wktLiteral).

  15. LOD pilot – data sourcescatalogue • DCAT-AP (dcat) application profile for data portals in Europe, • data sources as dcat:Datasetclasses, • links to othervocabularies: • EuroVoc (for thematiccategories), • EU Publication Office continent / country codelist (for spatialreference) • Internet Media Type (MIME)

  16. LOD pilot – linking datasetdefinitions for statistical data spatialdomainfor datasets geometries for observations

  17. Data transformationinto RDF 1. Source files in CSV

  18. Data transformationinto RDF 2. PythonscriptusingRDFlib module for transformation:

  19. Data transformationinto RDF 3a. Results in anydesired format (RDF-XML):

  20. Data transformationinto RDF 3b. Results in anydesired format (Turtle):

  21. LOD pilot – triplestore • Apache Jena Fusekiused as a SPARQL server, • 71717 triplesloaded, • single Fusekidataset(STAT_LOD) to allowcross-querying and cross-browsing data created initially in separate files • SPARQL endpointfor querying

  22. LOD pilot – SPARQL endpoint

  23. LOD pilot – Pubbyfrontend (catalogue)

  24. LOD pilot – Pubbyfrontend (dataset)

  25. LOD pilot – Pubbyfrontend (value)

  26. LOD pilot – Pubbyfrontend (geometry)

  27. LOD pilot – conclusions • No referenceimplementation for statisticallinked open data: • lack of integrity between RDF metadata sets published by one authority, • links to non-existing entities, • lack of maintenance, • Lack of pan-European guidelines for statistical linked open data: • commonvocabularies, • recommendedordedicated software components, • DIGICOM ESSNet LOD project.

  28. LOD pilot – conclusions • Some software/ programming components not being developed anymore, • implementationsmightbecomeunstable, • Python-basedimplementationseemsustainableatthis point, • Semantic harmonization of statistical classifications: • differentmeanings for supposedly the same classificationelements, e.g. 0-5 can be “0 to 5” or “0 to less than five”, • not only a pan-Europeanissue, mayexistat country level,

  29. LOD pilot – conclusions • Methodology for publishing spatial data as linked open data: • single entity per single geometry: • inventory of boundarychanges, • geometry instances with non-meaningfulidentifiers (UUIDs), • separategeometries for respectiveyears: • a complete set of geometrieseachyear, regardless of changes, • geometry instances with meaningfulidentifiers(KTS + year).

  30. LOD pilot – conclusions • Most linked open data implementations are technically correct: • it is nearly impossible to produce incorrect RDF metadata files, • youcanputanything in the RDF graph, but doesitmakesensesemantically? • Linked open data implementations based on Python scripts are easy to amend in the future, • RDF vocabulary specifications are easier to interpret with a UML model provided(Thankyou, CaptainObvious)

  31. Merging statistics and geospatial information grant series Mirosław Migacz GIS Consultant Statistics Poland Publishing georeferencedstatistical data usinglinked open data technologies www.linkedin.com/in/migacz m.migacz@stat.gov.pl NTTS 2018 Conference / Brussels / Belgium

More Related