Multifaceted Exploitation of Metadata for Attribute Match Discovery in Information Integration

Multifaceted Exploitation of Metadata for Attribute Match Discovery in Information Integration David W. Embley David Jackman Li Xu

Background • Problem : Attribute Matching • Matching Possibilities (Facets) • Attribute Names • Data-Value Characteristics • Expected Data Values • Data-Dictionary Information • Structural Properties

Approach • Target Schema T • Source Schema S • Framework • Individual Facet Matching • Combining Facets • Best-First Match Iteration

Year Year Year Year Make Make Make Feature Make has has has has has 0:1 0:1 0:1 0:1 0:* 0:1 0:1 Car Cost Model Model Model Car Model has has 0:1 has 0:1 has Phone Mileage Miles Example Car Car Style 0:1 has 0:* 0:1 0:1 has has has Mileage Miles Cost Target Schema T Source Schema S

Individual Facet Matching • Attribute Names • Data-Value Characteristics • Expected Data Values

Attribute Names • Target and Source Attributes • T : A • S : B • WordNet • C4.5 Decision Tree: feature selection • f0: same word • f1: synonym • f2: sum of distances to a common hypernym root • f3: number of different common hypernym roots • f4: sum of the number of senses of A and B

The number of different common hypernym roots of A and B The sum of the number of senses of A and B The sum of distances of A and B to a common hypernym WordNet Rule

Confidence Measures

Data-Value Characteristics • C4.5 Decision Tree • Features • Numeric data (Mean, variation, standard deviation, …) • Alphanumeric data (String length, numeric ratio, space ratio)

Expected Data Values • Target Schema T and Source Schema S • Regular expression recognizer for attribute A in T • Data instances for attribute B in S • Hit Ratio = N’/N for (A, B) match • N’ : number of B data instances recognized by the regular expressions of A • N: number of B data instances

1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 Combined Measures Threshold: 0.5

Final Confidence Measures

F1 93.75% F2 84% F3 92% F1 98.9% F2 97.9% F3 98.4% Experimental Results • Matched Attributes • 100% (32 of 32); • Unmatched Attributes • 99.5% (374 of 376); • “Feature” ---”Color”; • “Feature” ---”Body Type”.

Conclusions • Direct Attribute Matching – feasible • Individual-Facet Matching – good • Multifaceted Matching – better

Future Work • Additional Facets • More Sophisticated Combinations • Additional Application Domains • Automating Feature Selection • Indirect Attribute Matching www.deg.byu.edu

Multifaceted Exploitation of Metadata for Attribute Match Discovery in Information Integration