Characterizing Knowledge on the Semantic Web with Watson

Characterizing Knowledge on the Semantic Web with Watson Mathieu d’Aquin, Claudio Baldassarre, Laurian Gridinoc, Sofia Angeletou, Marta Sabou, Enrico Motta The Knowledge Media Institute, The Open University m.daquin@open.ac.uk

The Semantic Web is Growing Lee, J., Goodwin, R. (2004) The Semantic Webscape: a View of the Semantic Web. IBM Research Report.

The Semantic Web is growing… http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData

Next Generation Semantic Web applications Need for a Gateway to the Semantic Web Exploiting the Semantic Web rather than engineering their own knowledge/ontologies

Watson: a Gateway to the Semantic Web

More on Watson? See also… • Watson Web Interface:http://watson.kmi.open.ac.uk • Watson poster and demoat ISWC 2007…

Characterizing Knowledge in Watson? • Beside being a gateway for applications, Watson gives the opportunity to better understand: • How semantic technologies are used to published knowledge online • How knowledge is structured on the Semantic Web • How ontologies and semantic documents are interconnected in a semantic network • through an analysis of its repository. • Such an analysis provides valuable information for application and tool developers concerning the knowledge they have to manipulate.

The Watson Collection • Collecting Semantic Content: • A number of specialized crawlers for Google, ontology repositories (e.g. Swoogle), PingTheSemanticWeb, etc. • Validated by parsing with Jena, to get only RDF documents • Filters: • Before filtering, the repository was composed almost entirely of RSS and FOAF (more than 5 times the number of other documents) • Therefore, the analysis would have been more an analysis of RSS and FOAF than anything else. • These have been filtered out. • An analysis of the FOAF part of the repository separately would be interesting.

The Watson Collection Result: almost 25,500 semantic documents

The Watson Collection • In order to index these documents, Watson extracts information about them. • Information about the content: classes, properties and individuals, the relations between them, the coverage in terms of domain topics, etc. • Information about the representation: the language used and its expressivity, the size and structure of the document, etc. • Information about the network aspects of semantic documents: identification, links between documents, etc. • It is these elements of information that we intend to analyse. • Note that all these elements of information are freely available through the Watson API.

In the Following Measures on the following aspects: • Usage of semantic technologies to publish knowledge on the Web • Structure and coverage of semantic documents • The knowledge network Focusing more on the most “debatable” elements.

Semantic Web languages… • Here a document is considered in a given language if it instantiates an entity of the language • The majority is factual data in RDF • OWL adopted as ontology language • Less overlap between OWL and RDF-S than between DAML+OIL and RDFS: • better separation of the meta-models in OWL • e.g. it is in OWL and RDF-S if it contains an owl:property and an rdfs:class for example

… and their expressivity • Apparent contradiction: • Most of the documents are in OWL FULL • But 95% use only a very restricted part of the expressive power of OWL (below OWL Lite) • OWL Full because of simple syntactic mistakes

Size of the documents • Like for expressivity, a power law distribution: lots of very small document and a few very large ones (both for ontological knowledge and factual data, but on different scales) Number of classes Number of instances Documents Documents

Density of the representation • In average, classes are: • Poorly defined (small number of properties and super-classes per class) • Highly instantiated (high number of instances per class) Even the best represented class in each ontology only have 1 property in avg.

Topic Domain Coverage • Level of coverage of ontologies for the top categories in DMOZ (details in the paper) • Very heterogeneous distribution • Not well correlated with the one of the Web

Identification of semantic document • Participates to the networkedand distributed aspects of the Semantic Web • URI are unique identifiers, but when applied to ontologies, they may be duplicated: • Default URI of the ontology editor (Protégé) • Misuse of the URI of existing vocabularies (OWL) • Different versions of an ontology having the same URI • Also, it is a good practice for URIs to be dereferenceable, but only 30% of the semantic documents can be reached through their URI.

Connectedness and Redundancy • Connectedness and redundancy are both important aspects of distributed systems. • Connectedness: • A few large providers (W3.org, Stanford) and a few locally dense networks (Ontoworld) • Otherwise, very local ontologies • Redundancy: • Almost 30% of the semantic documents are duplicates • 12% of the entities are described more than once •  Abetter support of the network aspects of ontologies is required.

Conclusion • Our analysis allows to draw some conclusions about some of the characteristics of the knowledge published online. • In particular, it shows that • Semantic Web documents tend to be small, lightweight and weakly structured • Efforts are still required to publish knowledge in a variety of domains • The network aspects are not taken enough into consideration in semantic technologies • These constitute valuable information for tools and applications developers.

Limitations • This work can be seen as a first step towards a fine grained characterization of the Semantic Web. • But in its current state, it suffers from a number of limitations: • Only a sample of the Semantic Web • A snapshot of the current dataset. Should consider evolution • Simple analysis methods. Would data mining approaches be relevant? • The analyzed aspects are insufficient to fully capture the quality of the knowledge available online

A last word… • We believe that the field of evaluation of ontologies and ontology based tools could provide valuable inputs to this study, so please: • Watson is an open system, our data is available through the Watson API. Comment, suggest, question… http://watson.kmi.open.ac.uk m.daquin@open.ac.uk

Characterizing Knowledge on the Semantic Web with Watson

Characterizing Knowledge on the Semantic Web with Watson

Presentation Transcript

Semantic Web and Knowledge Representation

Trust on the Semantic Web

Finding knowledge, data and answers on the Semantic Web

Finding knowledge, data and answers on the Semantic Web

Data on the (Semantic) Web

Distributed Imprecise Design Knowledge on the Semantic Web

The Ontological Semantic Perspective on the Semantic Web

Standards for the Representation of Knowledge on the Semantic Web

Languages on the Semantic Web

Semantic Web and Knowledge Representation

Characterizing Knowledge on the Semantic Web with Watson

Presenting Knowledge on the Semantic Web

Semantic Web and Knowledge Management

Finding and Ranking Knowledge on the Semantic Web

Semantic Web and Knowledge Management

Knowledge Representation on the Semantic Web by

Finding knowledge, data and answers on the Semantic Web

Searching for Knowledge and Data on the Semantic Web

Presenting Knowledge on the Semantic Web

Multimedia on the Semantic Web

Knowledge Standards W3C Semantic Web

Characterizing Semantic Web Applications