Ontology-Driven Semantic Matches Between Database Schemas

Ontology-Driven Semantic Matches Between Database Schemas Dennis McLeod and Sangsoo Sung Year: 2006 Presenter : Munish Chopra

Overview • Introduction • Schema Matching • Types of Ontology driven Matching • Data Driven matching framework • Semantic Driven Matching frameworks • Different Experiments • Conclusion

Introduction What is Ontology? • An ontology is a collection of concepts and their inter relationships with each other. • It can also be defined as a semantic graph, semantic conceptual database schema or a knowledge representation. • It is also called as explicit specification of a conceptualization.

Schema Matching • Schema Matching is a technique of identifying that two objects are semantically related to each other. • It has been performed by gathering information for mapping from various phases of attributes • Example: Comparing names ,types, sample instances between attributes phone number and telephone in compatible tables, then these attributes can be matched • It can be represented in a similarity matrix Mst where s has attributes s1,s2…sn and t has attributes t1,t2,..tn

Problems in Schema Matching • Schema matching has been difficult to automate • In the past, there has been a lot of research to find matches between • by exploiting information on schema and data instances • But, these research failed because the schema and data instances could not fully capture the semantic information of the databases i.e. some attributes were matched to other attributes. • A possible solution to this problem was to introduce a schema matching framework that supports identification of the correct matches by extracting the semantics from ontologies.

Types of Mapping 1. Data Driven Mapping • Attribute names in schemas can be very difficult to understand and interpret due to which it becomes very difficult to perform matching • Data Driven Matching Framework functions correctly in the presence of opaque attribute names using data values • It utilizes the combination of searching overlaps in the selection of data types and representation of data values, comparing patterns of data instances between schema attributes and by using different learning techniques • Thus, the mapping can be easily found as the similar attribute always share similar patterns or representation of data values and their instance

Types of Base Matchers • A. Pattern Based Matcher • The Pattern based matcher tries to find a common pattern of the instance values such as a fax/phone numbers or some monetary units. • It determines a sequence of alphabets, symbols that are most characteristic in the instances of attribute. • Given any value of instances, we transform each alphabet to “A”, symbol to “S” and Number to “N”. Then, it compares the patterns by calculating the edit distance of the pair of patterns. • Edit distance between two strings is given by minimum number of operations needed to transform one string to another, where the operation can be either an addition, deletion or a substitution.

An Example • Consider a mobile number “(213)321-4321”.This number can be transformed into “SNNNSNNN-SNNNN and “213-321-4321” can be transformed into “NNNSNNNSNNNN”, edit distance is 1. • Let ai and bj be instances of attributes s and t(1≤i≤Naand1≤jb≤Nb).Let EditDist(Ea,Eb) be edit distance between instance a and b • Let gi be number of instances of a and hj be number of instances of b. We assume that ai and bj are sorted in descending order with respect to gi and hj. • Similarity between instance patterns:

B. Attribute Based Matcher • The attribute based matching tries to find the common properties of attribute like comparing various phases of attribute such as name and domain information. • It is basically used to map attributes by comparing attribute names and their types. • Let • Where SIMatrdat(si,tj) is the prediction produced by attribute based matcher.

PROBLEMS • Matching ambiguity resolution • It can identify actual mappings although they are ambiguous • Providing candidates that refer to a similar or same object • It also provides matching candidates even if the data driven framework fails to select the candidates

AN EXAMPLE

2. Semantics Driven Mapping Framework • It tries to identify the most similar semantics of attribute in the other schema when the attribute names are not opaque. • Name of an attribute consist of a word or compound words that contain semantics of the attribute. • Thus, the semantic similarity between si and tjcan be measured by finding how many words in the two attributes are semantically alike which can be determined by measuring semantic similarity

Semantic Similarity • Semantic similarity is the approach to find relatedness between two words, based on the information contained in ontology hierarchy. • Two types of interrelations • Child concept many be instance of a parent concept(is-a relationship) • Child concept may be a component of a parent concept(part- of relationship) • Example: WordNet database which is well suited for similarity measures as it organizes nouns or verbs into hierarchies of is-a or part –of relations

Explanation • Let c denote a word of an attribute. • Information Content(IC(w))=−log( p(w)) • p(w)= probability of how much word w occurs • Frequencies of words can be estimated by counting the number of occurrences in the corpus. • freq(w)= • where Cc is the set of concepts subsumed by a word w • Concept probability for w is defined as follows: • p(w) = freq(w)/N where N is total number of words observed

Example

Compound Word Processing • In this concept, our approach is to give consequence to matching of the head word.The head word is placed between the two words. • For example, if we are given a word agent name, so the head word will be placed on the rightmost position of the word [13,14] • This concept ideally works by decomposing the word into atomic words and try to predict similarities between each word to the attributes in other schema • There are many problems in decomposition like the words can appear in different formats which makes it difficult to recognize the individual words in a compound word and unnecessary wordswhich are not required

Solutions 1. Tokenization: It is a process that identifies the boundaries of words i.e. non-content bearing tokens(slash,comma,parenthesis,underscore,dash,uppercase,etc) is skipped in the matching phase.Example: agent name can appear in many forms like agent_nameor AgentName, thus here tokenization will find the individual words like “agent” and “name”. 2. Stopwords Removal Stopwords are the words that occur frequently in the attribute but do not carry useful information (eg: of).Thus, we eliminate these stopwords from the vocabulary list and hence, it provides us flexibility in matching .

Similarities Regression • Using the machine learning tools, we can easily combine the predicted similarities together. • But, each similarity can have a different significance with the combined , thus a weight is assigned to each similarity. • Inorder to improve the predictions of the different mapping framework, parameter optimization is performed where by cross validating on the training data takes place along with logistic regression. • Example: Similarity between s and t is given by:

Experiments • Test Datasets • In this case, we use two real estate domain datasets which have information regarding the sale of houses. • Experimental Procedure • In this case, we evaluate our technique by training the system on real estate 1 domain • Training is carried out by performing cross validation ten times to attain more weights for the combination of the predictions of the different frameworks

Experiment Results • This experiment aimed at determining the relative contributions of utilizing ontologies to identify the semantics of the attributes. • In the figure given below, we have 7.5% and 19.7% higher accuracy than that of complete LSD on the two domains.

CONCLUSION • In this paper, we studied about the different semantic similarity techniques from ontologies to identify the correspondence between the database schemas. • Future work includes applying mapping techniques to seismology domain which contains of data which is distributed and organized in different manners and obtained from earthquake information providers.

Thank You

Ontology-Driven Semantic Matches Between Database Schemas