310 likes | 314 Views
Publishing georeferenced statistical data using linked open data technologies. Mirosław Migacz GIS Consultant Statistics Poland. Merging statistics and geospatial information grant series. The project.
E N D
Publishing georeferencedstatistical data usinglinked open data technologies Mirosław Migacz GIS Consultant Statistics Poland Merging statistics and geospatial information grant series NTTS 2019 Conference / Brussels / Belgium
The project • Title: „Development of guidelines for publishingstatistical data as linked open data” • „Mergingstatistics and geospatialinformation” grant series • 2016 – 2017 • maingoal: prepare a background for LOD implementation in officialstatistics
Before 3218 4.4.32.64.18 powiat łobeski (LAU 1) lobeski 4326418
After powiat łobeski http:// nts.stat.gov.pl/4/4/32/64/18
Specificobjectives • identify data sources • identifystatisticalunits • harmonize, generalize and buildURIs for statistical units • transformstatistical data, geospatial data and metadatainto RDF (pilot) • conclude the pilot transformation and fomulaterecommendations for a full-on implementation
Identification of data sources • Other data sources: • publications • tables • communiques • announcements • articles
Data sources - inventory • Metadata: • thematiccategory, • format (PDF, DOC, XLS, CSV), • spatialreference(country, NUTS, LAU, functionalareas, urbanareas), • temporalreference (years) • presence of identifiers(TERYT, NTS, NUTS) • update cycle • Preliminary analysisof data sources: • openness • redundance of information • popularity (based on view/ downloadstats)
Statistical unitsinventory • administrativeboundaries: • administrativeunits • NUTS • Non-standard statisticalunits: • functionalareas/ urbanareas • Groups of administrative / statisticalunits • Derivemostlyfrom strategicdocuments NUTS ADMINISTRATIVE
Statistical unitsharmonization – KTS • KTS – classificationcombining administrative and statisticalunits • introducedlastyear to comply with NUTS 2016 • 14-digit code
Geometry harmonization/generalization • Input data: • administrativeboundariessince 2002 for LAU 2 (gmina), excluding 2007 • Harmonizationprocess: • structurestandardization • standardization of identifiers(creating KTS identifiers) • aggregationto higherlevelunits (LAU 1 -> NUTS 1) • Generalization: • severalgeneralizationscenariostested for purposesof choosingan optimal one • datasets with generalized and non-generalizedgeometriesprepared for 2002-2016
LOD pilot – statistical data • data: • demographic data for 2016 from three major databases (Local Data Bank, Demography Database, STRATEG system), • ontologiesfor classifications: • agecodelistdefinedusingSKOS (skos) & Dublin Core (dct), • sexcodelistre-used from SDMX, addedPolishtranslation, • defininingmetadata for statisticalvalues (observations): • basedprimarily on SDMX ontologies (attribute, code, measure, dimension), • qb:Observationclass from Data Cube.
LOD pilot – geospatial data • inputgeometries: • voivodshipgeometries for 2016, • ontologies: • ontology for the KTS classificationdefinedusing RDF Schema (rdfs) & GeoSPARQL (geo) vocabularies, • geometry encoding: • separategeo:Geometryentities with geometry encoded in WKT (WellKnownText) format (geo:wktLiteral).
LOD pilot – data sourcescatalogue • DCAT-AP (dcat) application profile for data portals in Europe, • data sources as dcat:Datasetclasses, • links to othervocabularies: • EuroVoc (for thematiccategories), • EU Publication Office continent / country codelist (for spatialreference) • Internet Media Type (MIME)
LOD pilot – linking datasetdefinitions for statistical data spatialdomainfor datasets geometries for observations
Data transformationinto RDF 1. Source files in CSV
Data transformationinto RDF 2. PythonscriptusingRDFlib module for transformation:
Data transformationinto RDF 3a. Results in anydesired format (RDF-XML):
Data transformationinto RDF 3b. Results in anydesired format (Turtle):
LOD pilot – triplestore • Apache Jena Fusekiused as a SPARQL server, • 71717 triplesloaded, • single Fusekidataset(STAT_LOD) to allowcross-querying and cross-browsing data created initially in separate files • SPARQL endpointfor querying
LOD pilot – conclusions • No referenceimplementation for statisticallinked open data: • lack of integrity between RDF metadata sets published by one authority, • links to non-existing entities, • lack of maintenance, • Lack of pan-European guidelines for statistical linked open data: • commonvocabularies, • recommendedordedicated software components, • DIGICOM ESSNet LOD project.
LOD pilot – conclusions • Some software/ programming components not being developed anymore, • implementationsmightbecomeunstable, • Python-basedimplementationseemsustainableatthis point, • Semantic harmonization of statistical classifications: • differentmeanings for supposedly the same classificationelements, e.g. 0-5 can be “0 to 5” or “0 to less than five”, • not only a pan-Europeanissue, mayexistat country level,
LOD pilot – conclusions • Methodology for publishing spatial data as linked open data: • single entity per single geometry: • inventory of boundarychanges, • geometry instances with non-meaningfulidentifiers (UUIDs), • separategeometries for respectiveyears: • a complete set of geometrieseachyear, regardless of changes, • geometry instances with meaningfulidentifiers(KTS + year).
LOD pilot – conclusions • Most linked open data implementations are technically correct: • it is nearly impossible to produce incorrect RDF metadata files, • youcanputanything in the RDF graph, but doesitmakesensesemantically? • Linked open data implementations based on Python scripts are easy to amend in the future, • RDF vocabulary specifications are easier to interpret with a UML model provided(Thankyou, CaptainObvious)
Merging statistics and geospatial information grant series Mirosław Migacz GIS Consultant Statistics Poland Publishing georeferencedstatistical data usinglinked open data technologies www.linkedin.com/in/migacz m.migacz@stat.gov.pl NTTS 2018 Conference / Brussels / Belgium