1 / 34

A survey of approaches to automatic schema matching

A survey of approaches to automatic schema matching. Sushant Vemparala Gaurang Telang. Motivating Example. Assume UTA needs to integrate 40 databases from its different schools with a total of 27,000 elements. It would take approximately 12 person years to integrate them if done manually.

necia
Download Presentation

A survey of approaches to automatic schema matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A survey of approaches to automatic schema matching SushantVemparalaGaurangTelang

  2. Motivating Example • Assume UTA needs to integrate 40 databases from its different schools with a total of 27,000 elements. • It would take approximately 12 person years to integrate them if done manually. • How would you reduce the manual burden ?

  3. Schema Matching Schema 1 Schema 2 <Schema name="Schema T“> <ElementType name="Customer"> <element type="FName"/> <element type="LName"/> <element type="CAddress"/> </ElementType> <ElementType name="CAddress"> <element type="street"/> <element type="city"/> <element type="provine"/> <element type="code"/> </ElementType> </Schema> <Schema name="Schema S"> <ElementType name="AccountOwner"> <element type="Name"/> <element type="Address"/> <element type="BirthDate"/> </ElementType> <ElementTypename="Address"> <element type="street"/> <element type="city"/> <element type="state"/> <element type="ZIP"/> </ElementType> </Schema>

  4. Schema Matching Definition Schema matching is defined as the task of finding the semantic correspondences between elements of two schemas. S1 Match Match Result S2 Auxiliary information ( User feedback, Dictionaries, Previous mappings)

  5. Application Domains • Schema integration Developing global view over set of independently developed schemas • Comparing data schemes: • Items from different shopping sites • Merger between two corporations • Preparation of data for data warehousing and analyzing processes Any other examples?

  6. High Level Architecture of Generic Match http://db18.informatik.uni-leipzig.de:8080/WebEdition/

  7. Classification of Schema Matching Approaches 1) Schema Level Matching Granularity of Schema Level • Element Level • Structural Level 2) Instance level Matching 3) Hybrid and composite Matching

  8. Schema Level Matching • Only Schema level information(No data content) • Properties? (Name, description, data type ,is-a /part-of relationship, constraints and structure) • Match will find match candidates (each having similarity value)

  9. Granularity: Element Level • For each element of Source Schema determine matching elements in Target Schema • Element Level • atomic level (Attributes in XML schema) • higher level (Columns in Relational tables) Eg: Address = CustomerAddress

  10. Granularity: Structure-Level • Structure-Level: Matches combinations of elements that appear together in S1 with “combinations” of elements that appear together in S2. • Full Structure Match vs Partial Structure Match

  11. Granularity: Structure-Level (Contd) • Equivalence Patterns: Can enhance structure matching by considering known equivalence patterns stored in a library.

  12. Matching Cardinality • One or more S1 elements can match one or more S2 elements. • 1:1, 1:n, n:1, (m:n) 1:1 n:1 1:n m:n

  13. Instance Level Matching • Insight into the contents and meaning of schema elements • Useful when schema information is limited and when semi-structured data is used • Incorrect interpretation of schema level information can be corrected Eg : X is match candidate for CompanyName and Manufacturer

  14. Techniques for Schema Level Matching • Linguistic approaches Name based (equality of names) • equality of canonical name (Cust# = CustNo) • equality of synonyms (make = brand) • equality of hypernyms (book is-a publication & article is-a publication implies book =article)

  15. Techniques for Schema level Matching Name Matching (Contd) • Similarity based on pronunciation or soundex (ship2=ShipTo) • user-provided name matches (issue=bug) • Not limited to 1:1 matches (phone = {homePhone, officePhone} ) • Context based :Payroll application(salary=income) vs Tax reporting application(salary!=income)

  16. Techniques for Schema Level Matching • Description based Eg: Comments in schema elements

  17. Techniques for Schema Level Matching • Constraint based Mapping - Eg:data types and value ranges, optionality, relationship types, cardinalities, etc. - Combined with other matchers to limit match candidates

  18. Techniques for Schema Level Matching • Reusing Schema and Mapping Information -Idea: schemas from same domains are often very similar eg address fields and name fields repeated -Create schema library and schema editors should access library ( Analogy: XML namespaces) S->S2(known) Goal:S1->S? S1->S2?(easy to find)

  19. Techniques for Instance Level • IR techniques (Measures such as Jacard coefficient) • Constraint-based Characterization (EmpNo range vs Dept No range) • Auxiliary Information • Learning (Eg :Evaluate S1 contents Characterization 1, Evaluate S2 contents against Characterization 1 ) Drawback of Instance based?

  20. Combining Matcher: Hybrid Matcher • Integrates multiple matching criteria Eg:-A Matcher with Name matching and constraint based matching • Single Pass • Matching criteria is hard-wired

  21. Combining Matcher: Composite Matcher • Combine the result of several independently executed Matchers • Iterative (Match result of 1st Matcher is consumed by the 2nd Matcher) • Flexible ordering Which is efficient –Hybrid and Composite?

  22. Summarization

  23. How good is a Match? • Assessing match quality is difficult • Human verification and tuning of matching is often required • A useful metric would be to measure the amount of human work required to reach the perfect match Recall: how many good matches did we show? Precision: how many of the matches we show are good?

  24. Current Work • LSD • SKAT • Similarity Flooding

  25. LSD(Learning Source Description) • Produces 1:1 Instance level Mapping Suppose user wants to integrate 100 data sources • User: • manually creates mappings for a few sources, say 3 • shows LSD these mappings • LSD learns from the mappings • “Multi-strategy” learning incorporates many types of info in a general way • Knowledge of constraints further helps • LSD proposes mappings for remaining 97 sources

  26. LSD: Example Mediated schema address price agent-phone description locationlisted-pricephonecomments Learned hypotheses If “phone” occurs in the name => agent-phone Schema of realestate.com location Miami, FL Boston, MA ... listed-price $250,000 $110,000 ... phone (305) 729 0831 (617) 253 1429 ... comments Fantastic house Great location ... realestate.com If “fantastic” & “great” occur frequently in data values => description homes.com price $550,000 $320,000 ... contact-phone (278) 345 7215 (617) 335 2315 ... extra-info Beautiful yard Great beach ...

  27. LSD: Training the Learners Mediated schema address price agent-phone description locationlisted-pricephonecomments Schema of realestate.com Name Learner (location, address) (listed-price, price) (phone, agent-phone) (comments, description) ... <location> Miami, FL </> <listed-price> $250,000</> <phone> (305) 729 0831</> <comments> Fantastic house </> realestate.com Naive Bayes Learner <location> Boston, MA </> <listed-price> $110,000</> <phone> (617) 253 1429</> <comments> Great location </> (“Miami, FL”, address) (“$ 250,000”, price) (“(305) 729 0831”, agent-phone) (“Fantastic house”, description) ...

  28. LSD: Applying the Learners Mediated schema Schema of homes.com address price agent-phone description area day-phone extra-info Name Learner Naive Bayes <area>Seattle, WA</> <area>Kent, WA</> <area>Austin, TX</> (address,0.8), (description,0.2) (address,0.6), (description,0.4) (address,0.7), (description,0.3) Meta-Learner Name Learner Naive Bayes Meta-Learner (address,0.7), (description,0.3) <day-phone>(278) 345 7215</> <day-phone>(617) 335 2315</> <day-phone>(512) 427 1115</> (agent-phone,0.9), (description,0.1) (address,0.6), (description,0.4) <extra-info>Beautiful yard</> <extra-info>Great beach</> <extra-info>Close to Seattle</>

  29. SKAT(Semantic Knowledge Articulation) • Expert supplies SKAT with few initial rules Ex : 1) Match US.president US.chancellor 2) MisMatch human.nail factory.nail • SKAT articulates on supplied matching rules • Expert approves/rejects. • Creates correct rules and computes an updated articulation (Knowledge gained from irrelevant and rejected rules stored)

  30. Similarity Flooding • Intuition : Whenever any two elements in the graphs G1 and G2 are similar, their neighbors tend to be similar. • Transform schemas into directed labeled graphs

  31. Similarity Flooding Example

  32. Conclusion • User feedback: • User Interaction: minimize user input but maximize impact of the feedback • If we require user acceptance for our matches, then what happens if our matcher returns thousands or hundreds of matches? • The more configurable the matcher,the better • Problem with Schema representation and Data • Dealing with inconsistent data values for a schema element. • independence of schema representation • Mapping Maintenance: what happens when you map between two schemas and then one changes? • Sophisticated techniques required for n:m matches [Current work based on 1:1]

  33. Conclusion • More attention 1) Re-use opportunities 2) Learning from User feedback Any other issues to address?

  34. THANK YOU!

More Related