170 likes | 176 Views
Multifaceted Exploitation of Metadata for Attribute Match Discovery in Information Integration. David W. Embley David Jackman Li Xu. Background . Problem : Attribute Matching Matching Possibilities (Facets) Attribute Names Data-Value Characteristics Expected Data Values
E N D
Multifaceted Exploitation of Metadata for Attribute Match Discovery in Information Integration David W. Embley David Jackman Li Xu
Background • Problem : Attribute Matching • Matching Possibilities (Facets) • Attribute Names • Data-Value Characteristics • Expected Data Values • Data-Dictionary Information • Structural Properties
Approach • Target Schema T • Source Schema S • Framework • Individual Facet Matching • Combining Facets • Best-First Match Iteration
Year Year Year Year Make Make Make Feature Make has has has has has 0:1 0:1 0:1 0:1 0:* 0:1 0:1 Car Cost Model Model Model Car Model has has 0:1 has 0:1 has Phone Mileage Miles Example Car Car Style 0:1 has 0:* 0:1 0:1 has has has Mileage Miles Cost Target Schema T Source Schema S
Individual Facet Matching • Attribute Names • Data-Value Characteristics • Expected Data Values
Attribute Names • Target and Source Attributes • T : A • S : B • WordNet • C4.5 Decision Tree: feature selection • f0: same word • f1: synonym • f2: sum of distances to a common hypernym root • f3: number of different common hypernym roots • f4: sum of the number of senses of A and B
The number of different common hypernym roots of A and B The sum of the number of senses of A and B The sum of distances of A and B to a common hypernym WordNet Rule
Data-Value Characteristics • C4.5 Decision Tree • Features • Numeric data (Mean, variation, standard deviation, …) • Alphanumeric data (String length, numeric ratio, space ratio)
Expected Data Values • Target Schema T and Source Schema S • Regular expression recognizer for attribute A in T • Data instances for attribute B in S • Hit Ratio = N’/N for (A, B) match • N’ : number of B data instances recognized by the regular expressions of A • N: number of B data instances
1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 Combined Measures Threshold: 0.5
F1 93.75% F2 84% F3 92% F1 98.9% F2 97.9% F3 98.4% Experimental Results • Matched Attributes • 100% (32 of 32); • Unmatched Attributes • 99.5% (374 of 376); • “Feature” ---”Color”; • “Feature” ---”Body Type”.
Conclusions • Direct Attribute Matching – feasible • Individual-Facet Matching – good • Multifaceted Matching – better
Future Work • Additional Facets • More Sophisticated Combinations • Additional Application Domains • Automating Feature Selection • Indirect Attribute Matching www.deg.byu.edu