Schema Matching Algorithms

1 / 17

Schema Matching Algorithms - PowerPoint PPT Presentation

Schema Matching Algorithms

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

1. Schema Matching Algorithms Phil Bernstein CSE 590sw February 2003

2. Acknowledgments • Many of these slides are from presentations by • Erhard Rahm, Hong Hai Do, and Sergey Melnik (Univ. of Leipzig)

3. Mapping Schemas • Given two schemas, return an expression that translates instances of one schema into instances of the other (i.e., performs data translation). • Applications • Web site integration • Catalog integration • Schema evolution • Data translation • Reverse engineering • Data warehouse loading • XML message translation • Ontology integration

4. Partitioning the Problem • Schema matching (aka mapping discovery) • Given two schemas, return a set of corres-pondences that specify pairs of related terms • Semantic Mapping (aka query discovery) • Given correspondences between two schemas, return an expression that translates instances of one schema into instances of the other (i.e., performs data translation).

5. TheSchema Matching Problem • Types of schemas: • Database schemas, XML schemas, ontologies, …, • Input: • Two (or more) schemas, S1 and S2 • Possibly data instances for S1 and S2 • Background knowledge – thesauri, validated matches, constraints (keys, data types), standard schemas, ontologies, NLP, etc. • Output: • A mapping between S1 and S2 Phils’ mods are in underscored Times Roman Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik

6. Tool 3 (Data Warehousing schemas) Tool 2 (E-Business schemas) Tool 4 (Database Design) Tool 1 (Portal schemas) Schema import/ export Global libraries (dictionaries, schemas …) Generic Match Implementation Internal Schema Representation Generic Match Implementation Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik

7. Match Example 1 Epinions.com Yahoo! Shopping Electronics and Photography Electronics Camcorders Video DV Camcorders Remote Controls Computers and Software Digital Cameras Home PDAs and Handhelds Computers & Internet Computer Hardware PDAs Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik

8. Containment link Node Match Example 2 CREATE TABLE PurchaseOrder.Customer ( custNo INT, custName VARCHAR(50), custStreet VARCHAR(50), custCity VARCHAR(50), custZip VARCHAR(10), PRIMARY KEY (custNo) ) ; CREATE TABLE PurchaseOrder.ShipTo ( poNo INT, custNo INT REFERENCES PO1.Customer, shipToStreet VARCHAR(50), shipToCity VARCHAR(50), shipToZip VARCHAR(10), PRIMARY KEY (poNo) ) ; <xsd:schema xmlns:xsd="http://www.w3.org/XMLSchema"> <xsd:complexType name=“POrder" > <xsd:sequence> <xsd:element name=“DeliverTo" type="Address"/> <xsd:element name=“BillTo" type="Address"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="Address" > <xsd:sequence> <xsd:element name=“Street" type="xsd:string"/> <xsd:element name=“City" type="xsd:string"/> <xsd:element name=“Zip" type="xsd:decimal"/> </xsd:sequence> </xsd:complexType> </xsd:schema> a) A relational schema and an XML schema POrder PurchaseOrder Customer ShipTo DeliverTo BillTo custName Address custStreet shipToStreet shipToCity custCity custZip shipToZip City Street Zip Legends: b) Their corresponding graph representation Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik

9. Tool Example (Biztalk Mapper) Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik

10. Current Situation • Finding mappings is now the bottleneck! • largely done by hand • labor intensive, tedious & error prone • Will only get worse • data sharing & XML become pervasive • proliferation of DTDs and XML schemas • translation of legacy data • reconciling ontologies on semantic web • Need semi-automatic approaches to scale up! Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik

11. Why Matching is Difficult • Aims to identify same real-world entity • using names, structures, types, data values, etc. • Schemas represent same entity differently • different names => same entity (synonyms): • client & user => customer • same names => different entities (homonyms): • bug=>insect or software error • Schema & data never fully capture semantics! • not adequately documented, not sufficiently expressive • data values suffer from synonyms and homonyms too • Intended semantics is typically subjective! • Cannot be fully automated. Often hard for humans. Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik

12. Desiderata for Match Solution • Low degree of manual work • Accuracy, efficiency, ease of use • Extensibility • Exploit additional match techniques • Exploit additional background knowledge • Support for Reuse • Exploit previous manual or automatically generated matchings • Generic approach • Different schema languages • Different application areas Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik

13. Special Situations • Match schema S to an incremental modification of S • Can ignore homonyms and possibly synonyms • Little if any reshaping of the structure • Instances probably don’t help • Lightweight integration for the semantic web vs E-commerce or data warehouse loading. • The former can’t afford much human review • The latter needs “perfect” mappings and hence human review

14. Automatic Match Approaches • Individual approaches • Combining approaches: hybrid vs. composite Schema-based Instance-based Reuse-oriented Element Structure Element Structure Element • Dictionaries • Thesauri • Previous match results Constraint-based Constraint-based Linguistic Constraint-based Linguistic • Types • Keys • Parents • Children • Leaves • IR (word frequencies, key terms) • Value pattern and ranges • Names • Descriptions Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik

15. Real matches Suggested matches A: False Negatives B: True Positives C: False Positives D: True Negatives A B C D Match Quality Measures • Comparison of automatically with manually (i.e. real) derived match correspondences • Quality measures: • Overall : post-match effort to add missed (A) and to remove false matches (C): negative Overall no gain SimilarityFlooding [ICDE02]: Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik

16. UW & MSR Contributions • LSD [SIGMOD ] • Learning algorithm based on structures and instances • AnHai Doan, Pedro Domingos, Alon Halvey • Cupid [VLDB 02] • Structure matching • Jayant Madhavan, Phil Bernstein • Glue [WWW 03] • Taxonomy matching. Uses relaxation labeling. • An Hai Doan, Jayant Madhavan, Pedro Domingos,Alon Halevy • Mapping Knowledge Base (MKB) [submitted] • Reuses past mappings to help produce new mappings • Jayant Madhavan, Phil Bernstein, Chuang Chen, Alon Halevy, Pradeep Shenoy

17. Where’s the Research Action? • There’s always room for new techniques • Compare distribution of data values between elements of two schemas • Create a global schema by clustering elements from different sample schemas • Re-using mappings • Combining techniques • Better user interfaces