1 LRI –Paris Sud University & CNRS 2 ETIS, Cergy-Pontoise 3 INRIA Saclay Île de France

N2R-Part: Identity Link DiscoveryusingPartiallyAligned Ontologiesby Nathalie Pernelle1, Fatiha Saïs1, Brigitte Safar1, Maria Koutraki2 and TusharGhosh1,3 1LRI –Paris Sud University & CNRS 2ETIS, Cergy-Pontoise 3INRIA SaclayÎle de France WOD 2013

Context • Discover identity links between data items inRDF data sources structured by distinct owl ontologies same restaurant, same laboratory … • Existing data linking tools exploit the mapped entities (classes, properties) of the ontologies to definelinkingrules [Silk, LDIF] A = {HumanBeingPerson, foaf:namename, …} • Some of these mappings can be declared or discovered by (semi-)automatic alignment tools [Shvaiko & all 2012] • But the set of mappings can be incomplete, in particular the set of property mappings

Two simple ontologies subsumption street street String String Address Address city city mappedprop. location hasLocation own hasOwner Person Restaurant Restaurant Person unmappedprop. hasChief Chief hasCook food cuisineType smoking name name phonenum phone acceptedCard creditCard rname title O2 O1 String String Class mappings (complete set) : {Restaurant  Restaurant, …} Propertymappings: {street street, rname  title, city  city, hasLocation  Location}

Two Restaurants to compare Lotus bleu Lotus bleu name title r2 r1 food thai creditCard location food Visa card hasLocation asian food cuisineType cuisineType a2 smoking chinese a1 acceptedCard thai acceptedCard Onlyat bar own asian phone phonenum phone hasOwner 3368555158 Visa card 3368555158 p2 3368555158 p1 Mastercard in O1 in O2

Aim The “values” of the mapped properties can be very heterogeneous, or even unknown for some instances Street : downing St, London, SW1A 2AA 10 Downingstreet How to improve the recall in such a context ?

Main ideas • Exploit unmapped properties to increase the similarity scores • Exploit the ontology semantics and the property values to select the best comparable properties for two compared class instances • Combine similarities between mapped properties and selected unmapped properties • Propagate the similarities thanks to a graph-based data linking approach same Restaurant  same Address  same City  sameCountry • Focus on Data sources that can be replicated locally • Extend an existing graph-based data linking tool (N2R [Sais et al 09])

N2R LinkingTool • Knowledge-basedapproach (i.e. keys) Common mappedkeys of O1/O2 (cartesianproduct) O1:name,O2:birthDate,deathDate name+birthdate+deathdate • Non linearequation system • Eachequationrepresentshow a similarity score xi canbecomputedusingrelatedsimilarity scores fi(X)= max (fi-df(X), fi-ndf(X)) • Solvedthanks to an iterativemethod

Impacts and propagation Mapped {Le lotus bleu}, {le lotus bleu} Mapped {17 rue Polar}, {rue Polar} r1,r2 key key a1,a2 Best comparable Object Properties key Best comparable Data Type properties {thai,asian} {thai,asian, chinese} {3368555158},{3368555158, 33888…} {Visacard}, {Mastercard, Visacard} p1,p2 Mapped {Paris}, {Paris} Mapped {Chang Lee} {Chang lee}

Comparable properties • Exploit the ontology to select comparable properties • Comparable objectproperties itexists one compatible (more specific or equivalent) domain and one compatible range, and inverse properties are considered own (domain Person, range Restaurant) is comparable to Inverse(hasOwner) (domain Restaurant, range Person) Inverse(haschief) (domain Restaurant, range Chief) • Comparable datatypeproperties compatible w.r.t the datatypes of XML schema cuisineTypeis comparable to food, acceptedCard … (domain Restaurant, range string)

Similarity of Best comparable properties Exploit property values to select the best comparable properties for two compared class instances • For 2 datatypeproperty values : elementarysimilaritymeasures sim(«asian », « asian ») =1 • Sum ( >giventhreshold) (i1, i2, prop1, prop2, sum,maxNumberOfPropertyInstances) (r1, r2, cuisineType, food, 2, 3) • Finally, similarity of (r1,r2) based on unmappeddatatypeproperties simNAP(r1,r2)= (1+2+1)/(2+3+2)=0.43 • Sameprocess for objectproperty values, but propagation

Extension of N2R • Keep the key importance in the equation • Give a bigger importance to the mappedproperties fi(X)=max(fi-df(X), (fi-map(X) + α fi-unmap(X))

Conclusions – Future Work • Conclusions • Extension of a graph-based data linking tool to take into account unmapped properties • Future Work • Evaluation of this strategy on real data sets • Focus on declared (or learned) unmapped keys/unmapped discriminative properties [symeonidou11, atencia12] (i.e select phone, but not creditCard) • Discover new mappings between properties thanks to discovered links

Thank you for your attention! Questions?

1 LRI –Paris Sud University & CNRS 2 ETIS, Cergy-Pontoise 3 INRIA Saclay Île de France