Automatic Schema Matching - PowerPoint PPT Presentation

Mia_John
automatic schema matching n.
Skip this Video
Loading SlideShow in 5 Seconds..
Automatic Schema Matching PowerPoint Presentation
Download Presentation
Automatic Schema Matching

play fullscreen
1 / 62
Download Presentation
Automatic Schema Matching
433 Views
Download Presentation

Automatic Schema Matching

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Automatic Schema Matching Nicole Oldham CSCI 8350 (Semantic Web Course @ Univ of Georgia) Topic Presentation

  2. Outline • Introduction • Application Domains • Classification of Schema Matching Approaches • Current Work • MWSAF Matching • Open Research Directories • Conclusion

  3. Schema Matching • Match: Takes two schemas as input and produces a mapping between the elements that correspond to each other semantically. • It is usually performed manually. • Tedious • Time Consuming • Error Prone • Expensive We must automate this process!

  4. Example • GTE telecommunications needed to integrate 40 databases with a total of 27,000 elements. • Project planners estimated that manual matching would take 12 person years to integrate. Doan A, Halevy A. Semantic Integration Research in the Database Community: A Brief Survey.

  5. Various Levels of Heterogenity ftp://ftp.dagstuhl.de/pub/Proceedings/04/04391/04391.ChristophidesVassilis.Slides.pdf

  6. How to deal with Semantic Heterogenity 1. Standardize: agree on a common representation 2. Translate: create mappings between different schemas 􀂾 -requires human input and machine reasoning 􀂾 -mappings can be difficult and expensive 3. Annotate: create relationships between agreed upon conceptualizations 􀂾 -requires human input and machine reasoning 􀂾 -annotation can be difficult and expensive 􀂾 ftp://ftp.dagstuhl.de/pub/Proceedings/04/04391/04391.ChristophidesVassilis.Slides.pdf

  7. Challenges • Actual semantics of the involved elements are typically only from the creators or documentation – so we must use clues in the schema and data instead. • These clues are often misleading. • Ie. ‘Area’ can refer to different entities • Ie. The same entities can have very different names. • Clues are often ambiguous. • Ie. ‘Contact-agent’ Agent name or phone number? • Matching process can be very costly • Each element of the schema must be examined to ensure discovery of the best match. • Matching is often subjective depending on the application. Doan A, Halevy A. Semantic Integration Research in the Database Community: A Brief Survey.

  8. Outline • Introduction • Application Domains • Classification of Schema Matching Approaches • Current Work • MWSAF Matching • Open Research Directories • Conclusion

  9. Where is Schema Matching used? • Database Application Domains • Data Integration • Data Warehousing • E-Business • Query Processing • Semantic Web • XML/HTML to an Ontology • Semantic Web Services Bernstein P, Rahm E. A survey of approaches to automatic schema matching

  10. Schema Integration Problem:Construct a global view from a set of independently constructed schemas. (ie: ontologies) - Different structure and terminologies Solution: Schema Matching is performed to find relationships between concepts in each schema. Then the matching elements can be unified. Bernstein P, Rahm E. A survey of approaches to automatic schema matching

  11. Data Warehouses Problem: Integrating data sources into a data warehouse. - Different formats between the source and warehouse. Solution: Use matching to find the elements of the source that are also present in the warehouse. Then the details of the semantics can be examined to integrate the two. Bernstein P, Rahm E. A survey of approaches to automatic schema matching

  12. E-Commerce Problem: Message translation. -Each trading partner uses its own message format. Solution: A match operation would reduce the amount of manual work to specify how the formats are related. Bernstein P, Rahm E. A survey of approaches to automatic schema matching

  13. Query Processing Problem: The terms used in the user’s query may be different from those in the database. Solution: Matching is used to map the user-specified concepts in the query to schema elements. Bernstein P, Rahm E. A survey of approaches to automatic schema matching

  14. Need for Data Integration on the Semantic Web • Problem: Web documents are not in RDF or any form suitable for the SW. • We must annotate them with concepts from ontologies. • Solution: Use schema matching to map between elements represented in OWL and the different schemas of web documents.

  15. Semantic Web Services • Problem: Web Services are currently searched for using keywords. • We need to annotate the WSDLs with semantic metadata so that they can be discovered efficiently. • WSDLs are in XML, Ontologies in OWL! • Solution: Use schema matching approaches to map between the two different schemas.

  16. Outline • Introduction • Application Domains • Classification of Schema Matching Approaches • Current Work • MWSAF Matching • Open Research Directories • Conclusion

  17. Term Definitions • Schema: a set of elements connected by some structure. • Mapping: a set of mapping elements , each of which indicates that certain elements of schema s1 are mapped to certain elements in s2. • Mapping Expression: Tells how s1 and s2 elements are related. Bernstein P, Rahm E. A survey of approaches to automatic schema matching

  18. Example A mapping between s1 and s2 might contain these elements: • Cust.C#=Customer.CustID • Concatenate(Cust.FirstName, Cust.LastName) = Customer.contact • Cust.CName = Customer.Company Bernstein P, Rahm E. A survey of approaches to automatic schema matching

  19. Example Doan A, Halevy A. Semantic Integration Research in the Database Community: A Brief Survey.

  20. Classification of Schema Matching Approaches • Instance vs Schema: matching approaches can consider instance data or schema-level information. • Element vs Structure matching: match can be performed for individual schema elements or combinations of elements. • Language vs Constraint: linguistic (names) or constraint-based (keys and relationships). • Matching Cardinality: match result may relate one or more elements of one schema to one or more elements of another. • Auxiliary Information: matcher relies on other information besides the input schemas, such as dictionaries, user input, global schemas. Bernstein P, Rahm E. A survey of approaches to automatic schema matching

  21. Classification of Schema Matching Approaches Schema Matching Approaches Individual Matchers Combining Matchers Schema-only Instance/Contents Hybrid Matchers Composite Matchers Element Level Structure Level Element Level Manual Composition Automatic Composition Linguistic Constraint Constraint Linguistic Constraint Further Criteria -Match Cardinality -Auxiliary information used… … … … … … • Word Frequency • Name Similarity • Description Similarity • Global Namespaces • Type Similarity • Key Properties • Group Matching • Value Pattern and Ranges Sample Approaches Bernstein P, Rahm E. A survey of approaches to automatic schema matching

  22. Schema Level Matchers • Consider schema information instead of instance data: Name, Description, Data Type, Relationship Types, Constraints, Structure • Often produces multiple candidates and estimates a degree of similarity for each • Granularity of match (element level vs structure level) • Match Cardinality • Linguistic Approaches: Name or Description Matching • Constraint-Based Approaches • Reusing Schema and Matching Information Bernstein P, Rahm E. A survey of approaches to automatic schema matching

  23. Element-Level • Element-Level: Identifies all elements of S1 that are the same or similar to elements of S2. • The match comparison can be based on name, description, or data type of the element. • Example of name-based element-level matching: Address = CustomerAddress Bernstein P, Rahm E. A survey of approaches to automatic schema matching

  24. Structure-Level • Structure-Level: Matches combinations of elements that appear together in S1 with combinations of elements that appear together in S2. • Full Structure Match: • Partial Structure Match: • Equivalence Patterns: Can enhance structure matching by considering known equivalence patterns stored in a library. Bernstein P, Rahm E. A survey of approaches to automatic schema matching

  25. Match Cardinality • One or more S1 elements can match one or more S2 elements. • Complex matches Bernstein P, Rahm E. A survey of approaches to automatic schema matching

  26. Complex Matches • 1:1 matches are bounded by the sizes of the schemas but there are an unbounded number of functions for combining attributes in a schema • Only a few works on complex matching have been done. • Some hard code complex matches into rules. • Some rely on a domain specific ontology. • We need domain knowledge to accurately perform complex matching. • The best match isn’t always the top match returned by the matcher – so human involvement is still needed. Doan A, Halevy A. Semantic Integration Research in the Database Community: A Brief Survey.

  27. Linguistic Approaches • Language based matchers use names and text (i.e. words or sentences) to find semantically similar schema elements. • Name Matching: match elements with similar names • Description Matching: match comments in the schemas Bernstein P, Rahm E. A survey of approaches to automatic schema matching

  28. Linguistic Approaches:Name Matching • Matches schema elements with equal or similar names. • How similarity is defined: 1. Equality of names 2. Equality of names after stemming, deals with prefixes/suffixes. 3. Equality of synonyms 4. Equality of hypernyms (suv is a type of car) 5. Similarity of names based on common substrings, soundex, pronunciation (ShipTo = Ship2) 6. User provided name matches. • Can be element or structure-level. • Cardinality is not limited to 1:1. Bernstein P, Rahm E. A survey of approaches to automatic schema matching

  29. Linguistic Approaches:Description Matching • Schemas can contain comments in natural language that express the intended semantics of the schema elements. • Example S1: empn // employee name S2: name // name of employee • Can be as simple as keyword extraction and synonym matching, or as complex as using natural language understanding technology. Bernstein P, Rahm E. A survey of approaches to automatic schema matching

  30. Schemas often contain constraints to define data types and value ranges, optionality, relationship types, cardinalities, etc. Constraint Based Bernstein P, Rahm E. A survey of approaches to automatic schema matching

  31. Reusing Schema and Mapping Information • The effectiveness of matching can be improved with the reuse of common schema components and previously determined mappings. • Many schemas are often very similar to each other and previously matched schemas. i.e. In E-Commerce, substructures often repeat within different message formats (address fields, name fields) • A schema library should be created and the schema editors should access the library to use predefined terms and definitions. Bernstein P, Rahm E. A survey of approaches to automatic schema matching

  32. Schema S1 Schema S Schema S2 Purchase-order Product BillTo Name Address ShipTo Name Address ContactPhone Purchase-order Product BillTo Name Address ShipTo Name Address Contact Name Address POrder Article Payee BillAddress Recipient ShipAddress Schema Mapping Reuse • Example • Problems: 1. Determining which part of a new schema is similar to some part of a previously matched one is a match problem itself. 2. Similarity values may depend on the domain. i.e. Salary and income may be identical in payroll application but not in a tax reporting application Bernstein P, Rahm E. A survey of approaches to automatic schema matching

  33. Instance Level Approaches • Why? 1. Little or no schema information available. 2. Enhancement of schema-level matchers. Instance data gives insight to the contents and meaning of schema elements. 3. To match instance-level data. • How? 1. Preferred Method: Linguistic Characterization 2. Constraint-based Characterization i.e. Ranges 3. Auxiliary Information 4. Also uses both rule-based and learner-based techniques. • Main Problem: When comparing data at the instance-level it is likely that there will be a ton of possible match combinations, a lot of which are irrelevant. Bernstein P, Rahm E. A survey of approaches to automatic schema matching

  34. Rule Based Solutions • Rule-Based: hand crafted rules to exploit schema information • element names, data types, structures and subelements. • Ie: two elements match if they have the same name and the same number of subelements Doan A, Halevy A. Semantic Integration Research in the Database Community: A Brief Survey.

  35. Learner Based Solutions • Learner-Based: exploit both schema and data. • Requires a lot of training data but can exploit data. • Rule and learner based techniques combined provide an effective matching solution. Doan A, Halevy A. Semantic Integration Research in the Database Community: A Brief Survey.

  36. Combining Different Matchers • The ideal matching system must exploit many different types of information and technique for maximum accuracy. • More match candidates will be produced if the previous approaches are combined. • Two Combination Methods: 1. Hybrid: integrates multiple matching criteria. Better performance. 2. Composite: combine the results of independently executed matchers. More flexible. Can be done automatically or manually. Bernstein P, Rahm E. A survey of approaches to automatic schema matching

  37. Outline • Introduction • Application Domains • Classification of Schema Matching Approaches • Current Work • MWSAF Matching • Open Research Directories • Conclusion

  38. LSD (Univ. of Washington) • Learning Source Descriptions • Uses machine learning techniques to match a new data source against a previously determined global schema. • Uses a name matcher and several instance-level matchers. • System is trained with sample user inputs and it learns patterns and matching rules. • Mostly instance-oriented but can use schema information too. • Also supports user input domain constraints on the global schema. Bernstein P, Rahm E. A survey of approaches to automatic schema matching

  39. SKAT (Stanford University) • Semantic Knowledge Articulation Tool • Follows a rule-based approach to semi-automatically determine matches between two ontologies. • User input required: * The user must provide application specific match/mismatch relations. * The user must approve or reject matches. • SKAT matching is used within the ONION architecture for ontology integration. • In ONION, an “articulation ontology” is constructed from the rules. Matching is based on is-a relationships between the articulation ontology and the source ontology. Bernstein P, Rahm E. A survey of approaches to automatic schema matching

  40. TransScm (Tel Aviv University) • Uses schema matching to derive an automatic data translation between schema instances. • Schemas are transformed into labeled graphs. • Matching is performed node by node (element-level, 1:1) starting at the top. • Requires user intervention if no match is found (i.e. to provide a new rule). Bernstein P, Rahm E. A survey of approaches to automatic schema matching

  41. DIKE (Univ. of Reggio Calabria, Univ. of Calabria) • Compares pairs of objects by their attributes and the is-a relationships that they are involved in. • These pairs are given a match score between 0 and 1. • User must specify synonyms, homonyms, and inclusion properties. Bernstein P, Rahm E. A survey of approaches to automatic schema matching

  42. Cupid (Microsoft Research) • Hybrid matcher • Element and Structural-Level matches. Phase 1: Linguistic Element-Level. - categorizes elements based on name, data types, and domains. - calculates a linguistic similarity coefficient. Phase 2: - transform the original schema into a tree then perform a bottom-up structure matching. - calculates a similarity value. - calculates a weighted mean of linguistic and structural similarity of pairs of elements Phase 3: - uses the mean from phase 2 to decide on a mapping. Bernstein P, Rahm E. A survey of approaches to automatic schema matching

  43. Clio (IBM Almaden and Univ. of Toronto) • Aims at a semi-automatic creation of match mappings between a given target schema and a new data source schema. • Three Components: Schema Readers: read schema and translate it into an internal representation. Correspondence Engine: is used to identify matching parts of the schemas or databases. Mapping Generator: generates view definitions to map data in the source schema to data in the target schema. Bernstein P, Rahm E. A survey of approaches to automatic schema matching

  44. Similarity flooding (Stanford Univ. and Univ. of Leipzig) • Graph Matching Algorithm. • Converts schemas into directed labeled graphs and determines the matches between corresponding nodes of the graphs. • Uses a name matcher to get an initial element-level match that is then given to the structural matcher. Bernstein P, Rahm E. A survey of approaches to automatic schema matching

  45. Delta (Mitre) • Uses attribute descriptions to determine attribute matches. • The method is to group the metadata about an attribute into a text string which is presented as a document. The user is then presented with other ‘documents’ with matching attributes and can chose from those. Bernstein P, Rahm E. A survey of approaches to automatic schema matching

  46. Tess (Univ. of Massachusetts, Amherst) • System for helping to cope with schema evolution. • Takes a definition of the old schema and produces a program that will transform data that conforms to the old schema into data that conforms to the new schema. Bernstein P, Rahm E. A survey of approaches to automatic schema matching

  47. Outline • Introduction • Application Domains • Classification of Schema Matching Approaches • Current Work • MWSAF Matching • Open Research Directories • Conclusion

  48. MWSAF: Meteor-S Web Service Annotation FrameworkLSDIS Lab, UGA • What is it? A tool for semi-automatically marking up web service descriptions with ontologies. It helps in describing services semantically and aids in efficient web service discovery and composition.

  49. MWSAF Annotation Tool • Input: WSDL File • Individual elements of the WSDL are matched to concepts in the domain • The WSDL is classified into a domain. • The Matches are given to the user to accept or reject. • Upon the user’s acceptance, the annotations are written to the WSDL. • Output: WSDL File with semantic annotations

  50. MWSAF Architecture Main Components of the System: • Ontology Store: stores the DAML and RDF ontologies that will be used to annotate the WSDL files. Ontologies are categorized by domain. • Parser Library: consists of the parsers used to generate the SchemaGraphs. • Matcher Library: provides schema matching algorithm. Patil A, Oundhakar S, Sheth A, Verma K. METEOR-S Web service Annotation Framework