170 likes | 181 Views
This study explores attribute matching possibilities for information integration by leveraging metadata such as attribute names, data value characteristics, and data dictionary information. The approach focuses on combining facets and using a best-first match iteration for improved accuracy. Using various measures and decision trees, the study achieves high confidence levels in attribute matching. Experimental results show successful matched attributes and suggest future work to further enhance the multifaceted matching approach.
E N D
Multifaceted Exploitation of Metadata for Attribute Match Discovery in Information Integration David W. Embley David Jackman Li Xu
Background • Problem : Attribute Matching • Matching Possibilities (Facets) • Attribute Names • Data-Value Characteristics • Expected Data Values • Data-Dictionary Information • Structural Properties
Approach • Target Schema T • Source Schema S • Framework • Individual Facet Matching • Combining Facets • Best-First Match Iteration
Year Year Year Year Make Make Make Feature Make has has has has has 0:1 0:1 0:1 0:1 0:* 0:1 0:1 Car Cost Model Model Model Car Model has has 0:1 has 0:1 has Phone Mileage Miles Example Car Car Style 0:1 has 0:* 0:1 0:1 has has has Mileage Miles Cost Target Schema T Source Schema S
Individual Facet Matching • Attribute Names • Data-Value Characteristics • Expected Data Values
Attribute Names • Target and Source Attributes • T : A • S : B • WordNet • C4.5 Decision Tree: feature selection • f0: same word • f1: synonym • f2: sum of distances to a common hypernym root • f3: number of different common hypernym roots • f4: sum of the number of senses of A and B
The number of different common hypernym roots of A and B The sum of the number of senses of A and B The sum of distances of A and B to a common hypernym WordNet Rule
Data-Value Characteristics • C4.5 Decision Tree • Features • Numeric data (Mean, variation, standard deviation, …) • Alphanumeric data (String length, numeric ratio, space ratio)
Expected Data Values • Target Schema T and Source Schema S • Regular expression recognizer for attribute A in T • Data instances for attribute B in S • Hit Ratio = N’/N for (A, B) match • N’ : number of B data instances recognized by the regular expressions of A • N: number of B data instances
1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 Combined Measures Threshold: 0.5
F1 93.75% F2 84% F3 92% F1 98.9% F2 97.9% F3 98.4% Experimental Results • Matched Attributes • 100% (32 of 32); • Unmatched Attributes • 99.5% (374 of 376); • “Feature” ---”Color”; • “Feature” ---”Body Type”.
Conclusions • Direct Attribute Matching – feasible • Individual-Facet Matching – good • Multifaceted Matching – better
Future Work • Additional Facets • More Sophisticated Combinations • Additional Application Domains • Automating Feature Selection • Indirect Attribute Matching www.deg.byu.edu