Challenges and Innovations in Triplestore Metadata Management
This experience report by Nathan Wilhelmi discusses the development and challenges faced while managing a triplestore using Sesame for metadata storage between 2006 and 2011. Key issues included changing developer teams, schema-less design impacts, and performance concerns with query patterns. The report highlights transitions towards SOLR for efficient search capabilities and the exploration of NoSQL solutions like Neo4J for improved relationship modeling. A focus is placed on URI encoding issues, maintenance challenges, and future directions for effective metadata management.
Challenges and Innovations in Triplestore Metadata Management
E N D
Presentation Transcript
Triplestore Experiences Nathan Wilhelmi 11/27/2012 NCAR - CISL/TDD/VETS
Our Experiences… • Disclaimers: • Did not have an ontologist • Codebase passed through multiple developers • Timelines (changing landscape) • Started work 2006 • Stopped active development ~2011 • Sesame version 2.3.0
Why a Triplestore? • Search functionality • Faceted • Free text • Model metadata • Metadata storage • Display • Semantic web
Initial Architecture • Authoritative metadata source was RDBMS • Metadata harvested into the triplestore at periodic intervals • Triplestore only contained metadata to drive search • Sesame used as a stand alone service
Sesame Triplestore • Standalone Sesame server • Stability problems • No security, triplestore could be updated by anyone • Changed to in-memory store • Stable • Picked up performance improvements • Embedded triplestore was only internally referencing • RDF didn’t work outside of the application • Distilled to key-value store
Internal Referencing <rdf:RDF ...> <rdf:Descriptionrdf:about="http://www.earthsystemgrid.org/esg.owl#esg-ncar__ucar_cgd_ccsm_b30_072b"> .... <esg:hasUnconfiguredModelComponentrdf:resource="http://www.earthsystemgrid.org/esg.owl#modelcomponent_ccsm_run_b30.072b" /> .... </rdf:Description> </rdf:RDF>
Performance • For our query patterns were not seeing needed performance • Inferencing was removed and performance improved to acceptable levels for <5k datasets • Target volume 50K datasets • Sparql missing key operators: ordering, limits
Tooling Support • Managing the triplestore • Protégé round trips didn’t work well • Dump full triple store to XML and grep by hand • Deleting and updating triples • Deletes were difficult, dangling triples • Rebuild from authoritative sources
Implementation Issues • Schema-less design was perceived as faster • Rapid ontology changes during development • Still needed data migration tools • Modeling the problem domain • Modeled a triplestore, not the domain • Very tightly coupled code was difficult to maintain and replace • Steep learning curve for new developers
URIs Are Foundational • Properly encoding URIs • Created unencoded URIs within the triplestore • Queries were created with string concentration • Lead to broken queries and data • Generated instance URIs through a lossy algorithm to get around encoding • Could only relate from source -> triple store
Our Current Path Forward • Using SOLR Search • Fantastic search tool! • Metadata in RDBMS • Working well • Effective tools, including schema migration • Scales very well for our metadata • Still needed to expose RDF metadata…
RDF with RDBMS <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:sesame="http://www.openrdf.org/schema/sesame#" xmlns:esg="http://www.earthsystemgrid.org/esg.owl#"> <rdf:Descriptionrdf:about="http://www.earthsystemgrid.org/esg.owl#${rdfIdFactory.getDatasetId(dataset)}"> <rdf:typerdf:resource="http://www.earthsystemgrid.org/esg.owl#Resource" /> <rdf:typerdf:resource="http://www.earthsystemgrid.org/esg.owl#Dataset" /> <rdf:typerdf:resource="http://www.earthsystemgrid.org/esg.owl#GeophysicalDataset" /> <rdf:typerdf:resource="http://www.earthsystemgrid.org/esg.owl#ModelDataset" /> <esg:hasUrirdf:datatype="http://www.w3.org/2001/XMLSchema#string"> resource://${gateway.name?upper_case}#${dataset.persistentIdentifier} </esg:hasUri> <rdfs:label rdf:datatype="http://www.w3.org/2001/XMLSchema#string">${dataset.name}</rdfs:label> </rdf:Description> </rdf:RDF>
Looking Forward • Storing metadata • Content managementsystems? • NoSql storage options? • Modeling complicated relationships • Neo4J looks promising…
Questions / Discussion • NathanWilhelmi • wilhelmi@ucar.edu