performing object consolidation on the semantic web data graph
Skip this Video
Download Presentation
Performing Object Consolidation on the Semantic Web Data Graph

Loading in 2 Seconds...

play fullscreen
1 / 15

Performing Object Consolidation on the Semantic Web Data Graph - PowerPoint PPT Presentation

  • Uploaded on

Performing Object Consolidation on the Semantic Web Data Graph. Aidan Hogan Andreas Harth Stefan Decker. Introduction. Aim: To merge equivalent RDF instances for large scale RDF datasets; a.k.a. perform object consolidation Background:

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Performing Object Consolidation on the Semantic Web Data Graph' - jodie

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
performing object consolidation on the semantic web data graph

Performing Object Consolidation on the Semantic Web Data Graph

Aidan Hogan

Andreas Harth

Stefan Decker

  • Aim: To merge equivalent RDF instances for large scale RDF datasets; a.k.a. perform object consolidation
  • Background:
  • RDF (Resource Description Framework) is data model used in Semantic Web technologies
  • Ideal for entity centric applications where structured descriptions of entities are provided in RDF (e.g. SWSE); anything can be described in RDF
  • URIs are used as identifiers for entities
  • Ideally, URIs are used consistently across data sources to describe entities; information on entities can be collected and merged from different sources
  • Problem:
  • URIs often not agreed upon (or not provided) for entities across sources; especially real world entities (e.g. cannot achieve agreement upon a URI for a person). Therefore, may have many instances split for one entity.
  • Entity centric applications will see multiple instances as multiple entities – problematic! Example later…
towards a solution
Towards a Solution
  • Towards a solution:
  • RDF data backed by ontologies in which certain properties may be described as being Inverse Functional
  • Inverse Functional Properties have values unique to an entity (e.g., chat usernames unique to people, ISBN code unique to books, etc.).
  • Therefore, if two instances have the same value for the same Inverse Functional Property, they are equivalent and can be merged.
  • Three sources provide data on one person – different identifiers used
  • Two different Inverse Functional Properties:
    • foaf:mbox referring to a person’s email
    • foaf:homepage referring to a person’s homepage
  • Before consolidation, three instances one entity. For example an entity centric search engine would return three results for the one person.
  • After consolidation, one instances one entity.
our dataset
Our Dataset
  • Want to perform object consolidation on entire RDF Semantic Web data graph…
  • 470M statements from multiple schemas describing 72M instances from over 3M data sources
  • 84% of instances have no URI identifier
  • Majority of data is FOAF (Friend of a Friend) descriptions of people (78%) with 99.9% having no idenitifiers
  • => We need scalable algorithm for performing object consolidation
step 1
Step 1
  • Need to identify Inverse Functional Properties in dataset
  • Inverse functional properties are defined in ontologies
  • Need to retrieve ontologies describing properties in the dataset
  • Can dereference the property URIs to find the pertinent ontologies
  • Examples of inverse functional properties found were
    • foaf:mbox (email property), foaf:homepage, foaf:weblog, foaf:aimChatID and other chat ID properties, doap:homepage
step 2
Step 2
  • Need to re-order data on-disk
  • initially data in NQuads unsorted SPOC order
  • Subject = identifier of entity being described
  • Predicate = property of entity being described
  • Object = value of property
  • Context = data-source of SPO triple
  • data re-ordered to POCS order…


Andreas Harth

  • …and sorted. Now data is grouped by same predicates and then objects.
step 3
Step 3
  • Scan data for equivalent instances
  • scan sorted POCS data looking for equivalent instances
  • if a predicate is an inverse functional property and has two identical values as object, the instances with identifiers as subject are equivalent and describe the same entity
  • equivalence is transitive and so a “same-as table” is used to store and perform transitive closure.
    • each row of the table contains equivalent identifiers
    • no identifier can appear in more than one row
step 4
Step 4
  • Pick identifiers
  • Now we have a list of equivalent instance identifiers… we need to pick one and use it for consolidated instance
  • We…
    • Pick URIs before blank nodes
    • Pick more common used identifiers after above restriction
  • Another scan of data is performed to count the number of statements identifiers appear in (if they appear in same-as list).
  • The new identifiers are called pivot identifiers
step 5
Step 5
  • Rewrite identifiers
  • Data is scanned and identifiers in subject and object position are rewritten to pivot identifiers
  • …one iteration complete
  • It’s possible that more than one iteration may be required… If a value of an inverse functional property is changed in one iteration, more equivalences may be found by another iteration
  • Encountered issues applying algorithm to 480M dataset
  • foaf:weblog defined as inverse functional -- given values which are communal weblogs or shared weblogs (not unique to a person)
    • we removed foaf:weblog from list of inverse functional properties
  • many people define common arbitrary values for properties such as chat IDs; e.g., ask, none
    • we define a black-list for such values
  • 2,443,939 instances consolidated to 401,385
  • 1 iteration required
  • The following table shows the number of atomic equivalences found through the main inverse functional properties