Benchmarking ontology-based annotation tools for the Semantic Web

Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK

What? • Work in the context of the EU Network of Excellence KnowledgeWeb • Case studies in field of bioinformatics • Developing benchmarking tools and test suites for ontology generation and evaluation • New metrics for evaluation • New visualisation tools • Development of usability criteria

Why? • Increasing interest in the use of ontologies in bioinformatics, as a means of accessing information automatically from large databases • Ontologies such as GO enable annotation and querying of large databases such as SWISS-PROT. • Methods for IE have become extremely important in these fields. • Development of OBIE applications is hampered by lack of standardisation and suitable metrics for testing and evaluation • Main focus till now on performance over practical aspects such as usability and accessibility.

Gene Ontology • Collaborative ontology construction has been practiced in the gene ontology community for a long time compared with other communities. • This makes it a good case study for testing applications and metrics. • Used in KnowledgeWeb to show that the SOA tools supporting communities creating their own ontologies can be further advanced by suitable evaluation techniques, amongst other things.

Automatic Annotation Tools • Semantic annotation is used to create metadata linking the text to one or more ontologies • Enables us to combine and associate existing ontologies, to perform more detailed analysis of the text, and to extract deeper and more accurate knowledge • Semantic annotation generally relies on ontology-based IE techniques • Suitable evaluation metrics and tools for these new techniques are currently lacking

Requirements for Semantic Annotation Tools • Expected functionality: level of automation, target domain, text size, speed • Interoperability: ontology format, annotation format, platform, browser • Usability: installation, documentation, ease of use, aesthetics • Accessibility: flexibility of design, input and display alternatives • Scalability: text and ontology size • Reusability: range of applications

Performance Evaluation Metrics • Evaluation metrics mathematically define how to measure the system’s performance against human-annotated gold standard • Scoring program implements the metric and provides performance measures • for each document and over the entire corpus • for each type of annotation • may also evaluate changes over time • A gold standard reference set also needs to be provided – this may be time-consuming to produce • Visualisation tools show the results graphically and enable easy comparison

GATE AnnotationDiff Tool

Correct and incorrect instances attached to concepts

Evaluation of instances by source

Methods of evaluation • Traditional IE is evaluated in terms of Precision, Recall and F-measure. • But these are not sufficient for ontology-based IE, because the distinction between right and wrong is less obvious • Recognising a Person as a Location is clearly wrong, but recognising a Research Assistant as a Lecturer is not so wrong • Similarity metrics need to be integrated so that items closer together in the hierarchy are given a higher score, if wrong

Learning Accuracy • LA [Hahn98] originally defined to measure how well a concept had been added in the right level of the ontology, i.e. ontology generation • Later used to measure how well the instance has been added in the right place in the ontology, i.e. ontology population. • Main snag is that it doesn’t consider the height of the Key concept, only the height of the Response concept. • Also means that similarity is not bidirectional, which is intuitively wrong.

Balanced Distance Metric • We propose BDM as an improvement over LA • Considers the relative specificity of the taxonomic positions of the key and response • Does not distinguish between the directionality of this relative specificity, e.g. Key can be a specific concept (e.g. 'car') and the response a general concept (e.g. 'relation'), or vice versa. • Distances are normalised wrt average length of chain • Makes the penalty in terms of node traversal relative to the semantic density of the concepts in question

BDM – the metric • BDM is calculated for all correct and partially correct responses CP = distance from root to MSCA DPK = distance from MSCA to Key DPR = distance from MSCA to Response n1: average length of the set of chains containing the key or the response concept, computed from the root concept.

Augmented Precision and Recall BDM is integrated with traditional Precision and Recall in the following way:

Conclusions • Semantic annotation evaluation requires: • New metrics • Usability evaluation • Visualisation software • Bioinformatics field is a good testbench, e.g. evaluation of protein name taggers • Implementation in GATE • Knowledge Web benchmarking suite for evaluating ontologies and ontology-based tools

A final thought on evaluation “We didn’t underperform. You overexpected.”

Benchmarking ontology-based annotation tools for the Semantic Web