1 / 27

Data Sets, Vocabularies and Tools

Data Sets, Vocabularies and Tools. Pablo N. Mendes Freie Universität Berlin 1st year review Luxembourg, December 2011. Work Plan View WP4. 24. 12. 0. 6. 18. 30. 36. 42. 48. D4.1 Assembly and maintenance of the PlanetData data set catalogue. D4.2 Best practices on how to provide

boris
Download Presentation

Data Sets, Vocabularies and Tools

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Sets, Vocabularies and Tools Pablo N. Mendes Freie Universität Berlin 1st year review Luxembourg, December 2011

  2. Work Plan View WP4 24 12 0 6 18 30 36 42 48 D4.1 Assembly and maintenance of the PlanetData data set catalogue D4.2 Best practices on how to provide self-describing data Task 4.1 Assembly and maintenance of the PlanetData data set catalogue FUB D4.3 PlanetData data sets, vocabularies and provisioning tools catalogue and access portal Task 4.2 Community-driven creation and maintenance of vocabularies KIT Task 4.3 Development of best practices for providing self-describing data D4.4 Data quality benchmark dataset KIT D4.5 PlanetData data sets, vocabularies and provisioning tools catalogue and access portal Task 4.4 Assembly and maintenance of a catalogue of data provisioning tools UPM

  3. Work Plan View WP5 24 12 0 6 18 30 36 42 48 D5.1PlanetData data management tools catalogue and access portal D5.3 PlanetData data management tools catalogue and access portal Task 5.1 Assembly and maintenance of PlanetData technology catalogue EPFL D5.2 Best practices on how to deploy tools on large-scale infrastructures D5.3 PlanetData data management tools catalogue and access portal Task 5.2 Development of best practices of large-scale data management infrastructures KIT

  4. Summary • WP4 • Assembly and maintenance of the PlanetData data set, vocabularies and tools catalogue; • Community-driven creation and maintenance of vocabularies; • Development of best practices; • WP5 • Assembly and maintenance of the PlanetData technology catalogue; • Best practices for large-scale data management infrastructure;

  5. Deliverables in Year 1 • D 4.1 • Data Sets Catalog • Vocabularies Catalog • D 5.1 • Data Management Tools Catalog

  6. Data Sets Catalog • Where to maintain the catalog? • How to catalog? • What to catalog? • How to provide access for humans and machines? • How to organize a community around the catalog?

  7. Repository: TheDataHub.org • Maintained by Open Knowledge Foundation (OKF) and world-wide open data community • Widely used catalog • Dec 1st 2012: has 2418 datasets, 314 LOD • Features of the portal: • Tagging, Rating, Feedback, Discussions, Groups

  8. Cataloguing Process • Planet Data Editor • Collected a list of new datasets → 49 new entries • Updated existing entries (537 edits) • Crowdsourcing: data providers and third parties • Public call for action to mailing lists, OKFN blog • Supported the community contributions • Quality Assurance • Tools to support cataloguing (validator, auto-complete) • Joint work with LATC

  9. Catalog Metadata QuickRef • What? • package name, title, url • tag:lod • topic • shortname • format-* • Who? • author || maintainer • published by producer • provenance metadata • license • When? • version • last updated • How much? • triples • links:* (outlinks) • namespace (inlinks) • vocab mappings • Where to find? • example URI • downloads/dumps • SPARQL endpoint • Why? • package description

  10. Catalog Metadata • How are datasets described? • Resources: • example URIs • SPARQL endpoint • RDF Dumps • Sitemaps, VoID files http://www.w3.org/wiki/TaskForces/CommunityProjects/LinkingOpenData/DataSets/CKANmetainformation

  11. Cataloguing process overview

  12. Catalog Entry Validator http://www4.wiwiss.fu-berlin.de/lodcloud/ckan/validator/validate.php • Checks levels of metadata completeness • Step-by-step annotation instructions • Already checks some quality indicatorse.g. availability, provenance, access methods

  13. CKAN Entry Validator (2)

  14. Auto-completion scripts • For the entries that pass the validator, we can auto-complete metadata with information such as: • Number of triples • Links to other sources • Vocabularies used • Quality indicators

  15. Catalog Access Portal • For machines • CKAN API (continuously improved by OKFN) • VOID descriptions for LOD group (will be continuously improved in cooperation with LATC) • For humans • LOD Cloud Diagram • State of the LOD Report

  16. LOD Cloud Diagram

  17. LOD Cloud Diagram (zoom in)

  18. State of the LOD Cloud Triples by domain Links by domain http://www4.wiwiss.fu-berlin.de/lodcloud/state/

  19. State of the LOD Cloud (2) • SPARQL Endpoint: 68.14% • RDF Dumps: 39.66% • Provide provenance:36.63 % • Provide licensing:17.84% vocabulary use:

  20. Vocabularies Catalog • Based on BTC Dataset (2.1 billion triples) • Shows vocabulary usage in practice • Executed on a 54 node Hadoop cluster • Access portal: • Searchable • URI Lookup • Top usage statistics Hosted at http://vocab.cc

  21. Top Classes per Dataset

  22. Top Properties per Dataset

  23. Vocabularies Catalog vocab.cc search query results vocab.cc URI Lookup Results

  24. Tools Catalog • Initial focus on tools from the consortium • Currently 15 tools Entry for Global Sensor Networks (GSN) Available from planet-data.eu

  25. Tools Description • Textual description • What is it? • Documentation • Publications • Requirements • License • Contact person/mailing list • Organization • Events • Tags • Produce • Publish • Consume • Provisioning

  26. Names of Tools in the Catalog • CumulusRDF • D2R • DBpedia Spotlight • GSN (Global Sensor Networks) • Geometry2RDF • LDIF • LDSpider (Linked Data Spider) • LarKC (Large Knowledge Collider) • MonetDB • NOR2O • R2O&ODEMapster • OKKAM • Pubby • R2R • S2O • Silk

  27. Tools Catalog • Related: LATC Tools Catalog • 11 tools • 5 tools in both, 10 new tools in PlanetData • Proposal for next year: • Join catalogs at linkeddata.org • Jointly maintain catalog until LATC finishes • Build a community → people can add their own tools • Afterwards PlanetData takes over and maintains the catalog for another 2 years

More Related