Schema Matching Algorithms for Data Translation and Integration

Schema Matching Algorithms Phil Bernstein CSE 590sw February 2003

Acknowledgments • Many of these slides are from presentations by • Erhard Rahm, Hong Hai Do, and Sergey Melnik (Univ. of Leipzig)

Mapping Schemas • Given two schemas, return an expression that translates instances of one schema into instances of the other (i.e., performs data translation). • Applications • Web site integration • Catalog integration • Schema evolution • Data translation • Reverse engineering • Data warehouse loading • XML message translation • Ontology integration

Partitioning the Problem • Schema matching (aka mapping discovery) • Given two schemas, return a set of corres-pondences that specify pairs of related terms • Semantic Mapping (aka query discovery) • Given correspondences between two schemas, return an expression that translates instances of one schema into instances of the other (i.e., performs data translation).

TheSchema Matching Problem • Types of schemas: • Database schemas, XML schemas, ontologies, …, • Input: • Two (or more) schemas, S1 and S2 • Possibly data instances for S1 and S2 • Background knowledge – thesauri, validated matches, constraints (keys, data types), standard schemas, ontologies, NLP, etc. • Output: • A mapping between S1 and S2 Phils’ mods are in underscored Times Roman Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik

Tool 3 (Data Warehousing schemas) Tool 2 (E-Business schemas) Tool 4 (Database Design) Tool 1 (Portal schemas) Schema import/ export Global libraries (dictionaries, schemas …) Generic Match Implementation Internal Schema Representation Generic Match Implementation Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik

Match Example 1 Epinions.com Yahoo! Shopping Electronics and Photography Electronics Camcorders Video DV Camcorders Remote Controls Computers and Software Digital Cameras Home PDAs and Handhelds Computers & Internet Computer Hardware PDAs Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik

Containment link Node Match Example 2 CREATE TABLE PurchaseOrder.Customer ( custNo INT, custName VARCHAR(50), custStreet VARCHAR(50), custCity VARCHAR(50), custZip VARCHAR(10), PRIMARY KEY (custNo) ) ; CREATE TABLE PurchaseOrder.ShipTo ( poNo INT, custNo INT REFERENCES PO1.Customer, shipToStreet VARCHAR(50), shipToCity VARCHAR(50), shipToZip VARCHAR(10), PRIMARY KEY (poNo) ) ; <xsd:schema xmlns:xsd="http://www.w3.org/XMLSchema"> <xsd:complexType name=“POrder" > <xsd:sequence> <xsd:element name=“DeliverTo" type="Address"/> <xsd:element name=“BillTo" type="Address"/> </xsd:sequence> </xsd:complexType> <xsd:complexType name="Address" > <xsd:sequence> <xsd:element name=“Street" type="xsd:string"/> <xsd:element name=“City" type="xsd:string"/> <xsd:element name=“Zip" type="xsd:decimal"/> </xsd:sequence> </xsd:complexType> </xsd:schema> a) A relational schema and an XML schema POrder PurchaseOrder Customer ShipTo DeliverTo BillTo custName Address custStreet shipToStreet shipToCity custCity custZip shipToZip City Street Zip Legends: b) Their corresponding graph representation Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik

Tool Example (Biztalk Mapper) Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik

Current Situation • Finding mappings is now the bottleneck! • largely done by hand • labor intensive, tedious & error prone • Will only get worse • data sharing & XML become pervasive • proliferation of DTDs and XML schemas • translation of legacy data • reconciling ontologies on semantic web • Need semi-automatic approaches to scale up! Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik

Why Matching is Difficult • Aims to identify same real-world entity • using names, structures, types, data values, etc. • Schemas represent same entity differently • different names => same entity (synonyms): • client & user => customer • same names => different entities (homonyms): • bug=>insect or software error • Schema & data never fully capture semantics! • not adequately documented, not sufficiently expressive • data values suffer from synonyms and homonyms too • Intended semantics is typically subjective! • Cannot be fully automated. Often hard for humans. Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik

Desiderata for Match Solution • Low degree of manual work • Accuracy, efficiency, ease of use • Extensibility • Exploit additional match techniques • Exploit additional background knowledge • Support for Reuse • Exploit previous manual or automatically generated matchings • Generic approach • Different schema languages • Different application areas Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik

Special Situations • Match schema S to an incremental modification of S • Can ignore homonyms and possibly synonyms • Little if any reshaping of the structure • Instances probably don’t help • Lightweight integration for the semantic web vs E-commerce or data warehouse loading. • The former can’t afford much human review • The latter needs “perfect” mappings and hence human review

Automatic Match Approaches • Individual approaches • Combining approaches: hybrid vs. composite Schema-based Instance-based Reuse-oriented Element Structure Element Structure Element • Dictionaries • Thesauri • Previous match results Constraint-based Constraint-based Linguistic Constraint-based Linguistic • Types • Keys • Parents • Children • Leaves • IR (word frequencies, key terms) • Value pattern and ranges • Names • Descriptions Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik

Real matches Suggested matches A: False Negatives B: True Positives C: False Positives D: True Negatives A B C D Match Quality Measures • Comparison of automatically with manually (i.e. real) derived match correspondences • Quality measures: • Overall : post-match effort to add missed (A) and to remove false matches (C): negative Overall no gain SimilarityFlooding [ICDE02]: Used by permission of Erhard Rahm, Hong Hai Do, and Sergey Melnik

UW & MSR Contributions • LSD [SIGMOD ] • Learning algorithm based on structures and instances • AnHai Doan, Pedro Domingos, Alon Halvey • Cupid [VLDB 02] • Structure matching • Jayant Madhavan, Phil Bernstein • Glue [WWW 03] • Taxonomy matching. Uses relaxation labeling. • An Hai Doan, Jayant Madhavan, Pedro Domingos,Alon Halevy • Mapping Knowledge Base (MKB) [submitted] • Reuses past mappings to help produce new mappings • Jayant Madhavan, Phil Bernstein, Chuang Chen, Alon Halevy, Pradeep Shenoy

Where’s the Research Action? • There’s always room for new techniques • Compare distribution of data values between elements of two schemas • Create a global schema by clustering elements from different sample schemas • Re-using mappings • Combining techniques • Better user interfaces

Schema Matching Algorithms for Data Translation and Integration

Schema Matching Algorithms for Data Translation and Integration

Presentation Transcript

Automatic Schema Matching

Informationsintegration Schema Matching

Automating Schema Matching for Data Integration

Chapter 5: Schema Matching and Mapping

Efficient Algorithms for Matching

Automating Schema Matching

Managing Uncertainty of XML Schema Matching

Privacy-Preserving Schema Matching Using Mutual Information

Corpus-based Schema Matching

Generic Schema Matching with Cupid

Generic Schema Matching using Cupid

Informationsintegration Schema Matching

Putting Context into Schema Matching

Exact String Matching Algorithms

String Matching Algorithms

String Matching Algorithms

Schema matching for Database Systems

SCHEMA-BASED SEMANTIC MATCHING

Recruit candidates using matching Algorithms