270 likes | 424 Views
Entity Disambiguation. By Angela Maduko Directed by Amit Sheth. Entity Disambiguation Problem. Emerges mainly while merging information from different sources Two major levels
E N D
Entity Disambiguation By Angela Maduko Directed by Amit Sheth
Entity Disambiguation Problem • Emerges mainly while merging information from different sources • Two major levels • 1. Schema/Ontology level : Determining the similarity of attributes/concepts/classes from the different schema/ontology to be merged • 2. Instance level: Which instances of concepts/classes (/tuples in relational databases ) refer to the same entity
Current approaches for both levels • Feature-based Similarity Approach (FSA) • Set-Theory Similarity Approach (STA) • Information-Theory Similarity Approach (ITA) • Hybrid Approach (HA) • Relationship-based Similarity Approach (RSA) • Hybrid Similarity Approach (HSA)
ITA • In [1], Dekang presents a measure for the similarity between two concepts based on both their commonalities and differences • Intuition 1:The similarity between A and B is related to their commonality. The more commonality they share, the more similar they are. • Intuition 2:The similarity between A and B is related to the differences between them. The more differences they have, the less similar they are. • Intuition 3:The maximum similarity between A and B is reached when A and B are identical, no matter how much commonality they share.
ITA • Consider the concept Fruit • A is an Apple • B is an Orange • Commonality of A and B? • Common(A, B) = Fruit(A) and Fruit(B) • Measures the commonality between A and B = I(common(A, B)) by the amount of information contained in common(A, B) • Where the information content of S I(S) = -logP(S)
ITA • Differences is measured by I(description(A, B)) – I(common(A, B)) • Decription(A, B) is a proposition which describes what A and B are • Can be applied at both levels 1 & 2 • Intuitively, sim(A, B) = • 1 when A and B are exactly alike; • 0 when they share no commonalities • Proposes sim(A, B) =
ITA • In [2], Resnik measures the similarity between two concepts in an is-a taxonomy based on the information content of their most specific common super-concept • Define P(c) as the probability of encountering an instance of a concept c in the taxonomy • For any two concepts c1and c2, define S(c1, c2) as the set of concepts that subsume both c1and c2 • Proposes sim(c1, c2) =
ITA • 100 instances of concept X • 4 instances of concept Y • 200 instances of concept Z • 2000 instances of all concepts • sim(A, B) • Sim(C, D) • sim(A, D) • sim(A, E) • sim(C, D) > sim(A, B). Should this be so? Y C D Z E F X A B
ITA • Define s(w) as the set of concepts that are word senses of word w. Proposes a measure for word similarity as follows • Sim(w1, w2) = • Can be applied at level 1 only • Doctor (medical and PhD) • Nurse (medical and nanny) • Sim(Doctor, Nurse)
STA • [3] introduces a set theoretical notion of a matching function F based on the following assumptions for classes a, b, c with description sets A, B, C respectively • Matching: s(a, b) = F(A B, A - B, B - A) • Monotonicity: s(a, b) ≥ s(a, c) whenever A B A C, A - B A - C, B - A C - A
STA • Proposes two models: • Contrast model: Similarity is defined as • An increasing function of common features • A decreasing function of distinctive features (features that apply to one object but not the other) • S(a, b) = f(A B) - f(A -B) - f(B - A) (,, ≥ 0) • Function f measures the salience of set of features • f depends on intensity and context factors • Intensity – physical salience (eg physical features) • Context – salience of features varies with context
STA • Ratio Model • S(a, b) = • ,, ≥ 0 • Can be applied at both levels 1 & 2
HA • [7] combines clustering and information content approaches for entity disambiguation (Scalable Information Bottleneck (LIMBO) method) • Attempts to cluster entities in such a way that the clusters are informative about the entities within them • Model: A set T of n entities (relational tuples), defined on m attributes (A1, A2, …, Am) .Domain of attribute Ai is the set Vi = {Vi,1, Vi,2, …, Vi, di} • Let T and V be two discrete random variables that can take values from T and V respectively • Initially, assigns each entity to a cluster ie #clusters = #entities. Let Cq denote this initial clustering, then the mutual information of Cqand T, I(Cq, T) = the mutual information of V and T, I(V, T)
HA • Assumes number of distinct entities k is known • Seeks a clustering Ckof V such that I(Ck, T) remains as large as possible or the information loss I(V, T) - I(Ck, T) is minimal
HSA • In [8], Kashyap and Sheth introduce the concept of semantic proximity (semPro) between entities to capture their similarity • In addition to context, employs relationships and features of entities in determining their similarity • semPro(O1,O2) = <Context, Abstraction, (D1, D2), (S1, S2)> • Context context in which objects O1 and O2 are being compared • Abstraction abstraction/mappings relating domains of the objects • (D1, D2) domain definitions of the objects • (S1, S2) states of the objects
HSA • Abstractions • Total 1-1 value mapping • Partial many-one mapping. • Generalization/specialization. • Aggregation. • Functional dependencies. • ANY • NONE
HSA • Semantic Taxonomy • Defines 5 degrees of similarity between objects • Semantic Equivalence • Semantic Relationship • Semantic Relevance • Semantic Resemblance • Semantic Incompatibility
HSA • Semantic Equivalence: strongest measure of semantic proximity • Two objects are said to be semantically equivalent when they represent the same real world entity ie • semPro(O1,O2) = <ALL, total 1-1 value mapping, (D1, D2), - > (domain Semantic Equivalence) • semPro(O1,O2) = <ALL, M, (D1, D2), (S1, S2)> where M = a total 1-1 value mappings between (D1, S1) and (D2, S2) (state Semantic Equivalence)
HSA • Semantic Relationship: weaker than semantic equivalence. • semPro(O1,O2) = <ALL, M, (D1 ,D2) , _)> where M = a partial many-one value mapping, generalization or aggregation • Requirement of a 1-1 mapping is relaxed such that, given an instance O1, we can identify an instance of O2, but not vice versa.
HSA • Semantic Relevance: • Two objects are semantically relevant if there exists any mapping between their domains in some context • semPro(O1,O2) = <SOME, ANY, (D1 ,D2) , _)>
HSA • Semantic Resemblance: weakest measure of semantic proximity. • There does not exists any mapping between their domains in any context • Have same roles in some contexts with coherent definition contexts
HSA • Semantic Incompatibility • Asserts semantic dissimilarity. • Asserts that there is no context and no abstraction in which the domains of the two objects are related. • semPro(O1,O2) = <NONE, NONE, (D1,D2), _>
HSA • In [5] Cho et al propose a model derived from the edge-based approach, employing information content of the node based approach based on these facts: • There exists a correlation between similarity and # of shared parent concepts in a hierarchy • Link type (hyponymy, meronymy etc) semantic relationship
HSA • Conceptual similarity between a node and its adjacent child node may not be equal • As depth increases in the hierarchy, conceptual similarity b/w a node and its adjacent child node decreases • Population of nodes is not uniform over entire ontological structure (links in a dense part of hierarchy less distance than that in a less dense part )
HSA • Proposes S(ci, cj) = D(Lj i)0≤k≤n[ W(tk)d(ck+1k)f(d) ]( max[H(c)] ), where • f(d) is a function that returns a depth factor (topological location in hierarchy) • d(ck+1k) is a density function • D(Lj i) is a function that returns a distance factor between ci and cj (shortest path from one node to the other) • W(tk) is a weight function that assigns weights to each link type (W(tk) = 1 for is-a link) • H(c) is information content of super-concepts of ci and cj • For level 1 only
References • Dekang Lin, An Information-Theoretic Definition of Similarity, Proceedings ofthe Fifteenth International Conference on Machine Learning, p.296-304, 1998 • Philip Resnik, Using Information Content to Evaluate Semantic Similarity in a Taxonomy, IJCAI, 1995. • Tversky Amos, Features of Similarity, Psychological Review84(4), 1977, pp 327 - 352. • Debabrata Dey, A Distance-Based Approach to Entity Reconciliation in Heterogeneous Databases, IEEE Transactions on Knowledge and Data Engineeing, 14 (3), May/June 2002. • Hui Han, Hongyuan Zha and C. Lee Giles, A Model-based K-means Algorithm for Name Disambiguation in Proceedings of the Second International Semantic Web Conference (ISWC-03) Workshop on Semantic Web Technologies for Searching and Retrieving Scientific Data. 2003 • M. Andrea Rodriguez and Max J. Egenhofer, Determining Semantic Similarity Among Entity Classes from Different Ontologies, IEEE Transactions on Knowledge and Data Engineering , 15 (2): 442-456, 2003 • Periklis Andritsos, Renee J. Miller and Panayiotis Tsaparas, Information-Theoretic Tools for Mining Database Structure from Large Data Sets, SIGMOD Conference 2004: 731-742 • Vipul Kashyap, Amit Sheth, Semantic and schematic similarities between database objects: a context-based approach, VLDB Journal 5, no. 4 (1996): 276--304. 367 • Miyoung Cho, Junho Choi and Pankoo Kim, An Efficient computational Method for Measuring Similarity between Two Conceptual Entities, WAIM 2003: 381-388