Towards Mutual Understanding: Ontologies, Ontology Matching, and their Applications

Towards MutualUnderstanding: Ontologies, Ontology Matching,and their Applications Jingshan Huang Assistant Professor School of Computer and Information Sciences University of South Alabama http://cis.usouthal.edu/~huang/ CIS Department @ UO Eugene, OR May 21, 2010

Presentation Outline Research Motivation Learning-Based Ontology Matching – SOCCER Ongoing Research Summary

Research Motivation – Overview • Information from heterogeneous sources has different semantics Long (English) Long (Chinese Pinyin) -> 龙 -> • Integrating the information from heterogeneous sources must make use of all available clues, including syntax, semantics, context, and pragmatics • Ontologies are a formal model to encode semantics • Ontological techniques are critical in semantic integration

Quick Facts What is Ontology? a computational model of some domain of the world describes the semantics of the terms used in the domain often captured in the form of DAG (directed acyclic graph) a finite set of concepts + properties + relationships What is Ontology Heterogeneity? an inherent characteristic of ontologies developed by different parties for the same (or similar) domains the heterogeneous semantics may occur in different ways (1) different terms could be used for the same concept; (2) an identical term could be adopted for different concepts; (3) properties and relationships could be different “translation” is way from good enough, not even close… What is Ontology Matching? a.k.a. “Ontology Alignment” or “Ontology Mapping” the process of determining correspondences between concepts from heterogeneous ontologies involving many different relationships, e.g., equivalentWith, subClassOf, superClassOf, and siblings

Heterogeneity in Ontologies – A Simple Example • Formal definition of ontologies A knowledge representation model of some portion of the world It reflects its designers’ conceptual views • Ontology = Concepts + Relationships + Constraints • Concept – a category “President” • Property – maps between concepts and data types “gender” of “President” • Relationship – maps between concepts “President” is a subClassOf “People” • Constraint – on properties or relationships “gender”: range = “male” Concept semantics: name + properties + relationships President sex Person female or male

Heterogeneity in Ontologies – Running Example The Semantic Web

Heterogeneity in Ontologies – Running Example (cont.) • Type “professor university” in Swoogle, 129 different results are returned • All created and maintained by ontology professionals

Heterogeneity in Ontologies – Running Example (cont.)

Semantic integration is important in Computer Science and Information Technology Ontologies are the foundation for semantic integration; at the same time, they are inherently heterogeneous The only way out – match/align ontologies such that to understand different semantics Ontology matching is far from being solved despite its importance and the number of researchers that have investigated it Research Motivation – Summary

Classification For Current Algorithms Rule-Based Matching Consider schema information alone Specify a set of rules Apply them to schema information Learning-Based Matching Consider both schema and instances Apply different machine learning techniques

Pros and Cons for Current Approaches • Rule-Based Matching • Is relatively fast () • Ignores instance information () • Uses ad hoc predefined weights () concept semantics: name + properties + relationships • Learning-Based Matching • Obtains extra clues from instances () • Runs longer () • Has difficulty in getting sufficient instances ()

SOCCER (Similar Ontology Concept ClustERing) – a learning-based algorithm Challenges and main idea Details Evaluation

Problems with Existing Matching Algorithms Rule-Based Matching Ignores instance information () Requires ad hoc predefined weights () Learning-Based Matching Runs longer () Has difficulty in getting sufficient instances () Try to: Adopt machine learning techniques to avoid ad hoc predefined weights Base learning on schema information alone to avoid the difficulty in getting sufficient instances The goal: To find equivalent concept pairs among different ontologies, which is the first, and the most critical step in semantic integration

Challenges Very difficult for machines to learn how to match ontology schemas by providing schema information alone Diversities in terminology Diversities in relationships Current learning-based algorithms make use of instances, more or less Anecdotally, instances usually has much less variety than schemas have

Main Idea of SOCCER Equivalent concepts from different ontologies tend to stay “closer” to each other in a clustering space with structural dimensions Each cluster contains a number of concepts that are from different ontologies and are equivalent to each other SOCCER aims at finding such clusters by exploiting ontology schemas alone

Details – Overview Build a three-dimensional vector for each concept, corresponding to name, properties, and relationships Calculate the similarity between pairwise concepts Apply an agglomerative algorithm to generate clusters Therefore, SOCCER has two phases: Phase I – weight learning Phase II – clustering

Task T: match two ontologies Performance measure P: Precision, Recall, F-Measure, and Overall with regard to manual matching Training experience E: a set of equivalent concept pairs by manual matching Target function V: a pair of concepts Target function representation: SOCCER Phase I – learn weights (1) Learning problem’s formal description

SOCCER Phase I – learn weights (2) Hypothesis space: weight vector (w1, w2, w3) Learning objective: find the weight vector that best fits the training examples Training rule: delta rule Searching strategy: minimize the training error

SOCCER Phase I – learn weights (3) Similarity in concept names d: edit distance between two strings l: length of the longer string Similarity in concept properties n: number of pairs of matched properties m: smaller cardinality of lists p1 and p2 Similarity in concept relationships (super/subClassOf) calculate the similarity values for pairwise concepts in ancestor lists and choose the maximum value

SOCCER Phase I – learn weights (4) Overall similarity Create a matrix M between O1 and O2 (n1 x n2) cell[i, j] stores the similarity between the ith concept in O1 and the jth concept in O2 wi’s are randomly initialized, and then updated by the learning process

SOCCER Phase I – learn weights (5) Training error Weight update rule D: training example set tr: maximum value for row i tc: maximum value for column j od: network output for a specific training example d : the learning rate sid: the si value for d

SOCCER Phase II – clustering (1) • Apply the learned weights to recalculate similarity matrices for pairwise ontologies • Cluster similar concepts among a set of ontologies Input: A set of ontologies and the corresponding matrices • Each concept forms a singleton cluster • Find two clusters, (a) and (b), with maximum similarity • If s[(a), (b)] > threshold, go to step 4; else go to step 7 • Merge (a) and (b) into (a, b) • Update matrix: s[(a, b), (c)] = (s[(a), (c)] + s[(b), (c)])/2 • Repeat steps 2 and 3 • Output current clusters The key is then to determine the threshold

SOCCER Phase II – clustering (2) • Let the number of concepts in Oi be ni (i in [1, k]) • WLOG, suppose n1 is the largest one in ni’s • Total number of clusters should be in [ ]

Evaluation Strategy • The hypothesis: a set of clusters exist across different ontologies • Need to show: • Weight learning is correct • Resultant clusters are meaningful

Evaluation – test ontologies (1) Test ontologies are eight independently developed, real-world ones http://www.csd.abdn.ac.uk/~cmckenzi/playpen/rdf/akt_ontology_LITE.owl http://www.mindswap.org/2004/SSSW04/aktive-portal-ontology-latest.owl http://annotation.semanticweb.org/iswc/iswc.owl http://www.mondeca.com/owl/moses/ita.owl http://protege.stanford.edu/plugins/owl/owl-library/ka.owl http://ontoware.org/frs/download.php/18/semiport.owl http://www.mondeca.com/owl/moses/univ.owl http://reliant.teknowledge.com/DAML/Mid-level-ontology.owl

Evaluation – test ontologies (2) Characteristics of test ontologies

Evaluation – result (1) Weight convergence

Evaluation – result (2) Clustering result

Evaluation – Four Measures Precision p – percentage of correct predictions over all predictions Recall r – percentage of correct predictions over correct matching F-Measure f (= ) – a.k.a. Harmonic Mean, avoids the bias from adopting Precision or Recall alone Overall o(= ) – Post-Match Effort, i.e., how much human effort is needed to remove false matches and add missed ones

Evaluation – result (3) Four measures

SOCCER Summary SOCCER: A learning-based ontology matching algorithm, and the first one based on ontology schemas alone Our contributions: 1. ANN technique was integrated so that the weights for different semantic aspects can be learned instead of being specified by a human in advance 2. Moreover, the learning technique was carried out based on the ontology schemas alone, which distinguishes it from most other learning-based algorithms.

Ongoing Research: Bioinformatics/Medical Informatics (1) An abundance of medical/biological digital data has promised a profound impact in both the quality and rate of discovery and innovation Worldwide health scientists are producing, accessing, analyzing, integrating, and storing massive amounts of digital medical data daily If we were able to effectively transfer and integrate data from all possible resources, then the following would be granted: A deeper understanding of all these data sets Better exposed knowledge Appropriate insights and actions that follow But…in many cases, the data users are not the data producers, and they thus face challenges in harnessing data in unforeseen/unplanned ways Fortunately, ontological techniques can render help in this regard!

Ontological techniques have been widely applied to medical and biological research The most successful example is the Gene Ontology (GO) project The GO’s aim: to standardize the representation of gene and gene product attributes across species and databases Three ontologies in the GO: Cellular Component, Molecular Function, and Biological Process The GO provides a controlled vocabulary of terms for describing gene product characteristics and gene product annotation data It also provides tools to access and process such data The focus of the GO is to describe how gene products behave in a cellular context Ontologies constructed under the auspices of the OBO (Open Biomedical Ontologies) group exhibit great variety Semantic integration becomes an indispensable step in biological and biomedical data mining Ongoing Research: Bioinformatics/Medical Informatics (2)

Ongoing Research: Bioinformatics/Medical Informatics (3) An Experiment in Bio Data Mining • The characteristics of many biomedical ontologies: i) a rich set of super/subClassOf relationships; ii) numeric strings adopted as concept names; and iii) little, if any, instance data • SOCCER suitably serves the goal of integrating semantics in computational biology

Ongoing Research: Digital Forensics (1) • Challenges exist in Digital Forensics • to maintain the integrity of evidence found by different parties (usually from distributed geographic areas, or even with cultural barriers) • the accurate interpretation of evidence • the trustworthy conclusion drawn thereafter • Different parties are likely to adopt different formats and metadata for storing evidence’s contents – due to different people’s specific needs • The seamless communication among different parties, along with the knowledge sharing and reuse that follow, become a non-trivial problem

Ongoing Research: Digital Forensics (2) • Being a formal knowledge representation model, ontologies may help us to handle the aforementioned challenges in Digital Forensics • But … There is no such central ontology that is large enough to include all concepts of interest to every individual criminal investigator • Anyone can design ontologies according to his/her own conceptual view, ontological heterogeneity is thus an inherent feature • That is, each need for a conceptual model from any individual party will have to provide its own particular extensions – different from and incompatible with extensions added by other parties

Ongoing Research: Digital Forensics (3) • An agreed-upon, global, and “all-in-one” ontology is not a feasible solution • Different groups should maintain their own conceptual models, while utilizing ontological techniques to synthesize their data with others’ models • This way, it is possible to effectively decouple the evidence semantics from its logical description and organization Digital Investigation Evidence Acquisition Model Based on Ontology Matching (DIEAOM) to facilitate: (1) knowledge collection from disparate, heterogeneous evidence sources (2) knowledge sharing and reuse (3) decision support for criminal investigators • The DIEAOM aims to synthesize vast amounts of evidence from different parties by matching conceptual models • Our goal is to benefit the current criminal investigation procedure with higher automation, enhanced effectiveness, and better knowledge sharing and reuse

Other Research Opportunities (1) Heterogeneous Knowledge Acquisition/Management Increasing growth in the scale, complexity, and diversity of data has been witnessed in recent years In addition, the data are often used in ways not envisioned by those who created them New techniques are thus needed to repurpose, transform, and integrate multiple and uncoordinated data sources; interoperability is the fundamental goal In order to better achieve interoperability among distributed knowledge sources, accurate and effective semantic integration is the first, critical step to handle the heterogeneity in data

Other Research Opportunities (2) Component-Based Software Engineering Engineered software is decomposed into functional or logical components, with well-defined interfaces for communication across components Reusability is an important feature of a high quality component (Semi)automated methodology to annotate, discover, compose, and execute the software components Semantic integration techniques are important and fundamental in such automation processes

Other Research Opportunities (3) Semantics-Enriched Image Knowledge Bases Create image knowledge bases by using ontologies to semantically encode image features Semantic search allows users to make use of concept search, instead of traditional keyword search It also paves the way for more advanced search strategies Users can specialize or generalize a query with the help of a concept hierarchy Queries can be formed using information from ontologies

Summary Information from heterogeneous sources has different semantics, and semantic integration is necessary for a better use of every possibly available information As a formal knowledge representation model, ontologies can render help in this regard SOCCER, the first learning-based approach relied on schemas alone, was developed to tackle the ontology-matching problem, which is a critical component in semantic integration Ontological techniques can be applied to many areas to generate challenging interdisciplinary research topics

Thank you!!! • Suggestions? • Comments? • Questions?

Towards Mutual Understanding: Ontologies, Ontology Matching, and their Applications