Learning to Map Between Schemas Ontologies

Learning to Map Between Schemas Ontologies Alon Halevy University of Washington Joint work with Anhai Doan and Pedro Domingos

Agenda • Ontology mapping is a key problem in many applications: • Data integration • Semantic web • Knowledge management • E-commerce • LSD: • Solution that uses multi-strategy learning. • We’ve started with schema matching (I.e., very simple ontologies) • Currently extending to more expressive ontologies. • Experiments show the approach is very promising!

The Structure Mapping Problem • Types of structures: • Database schemas, XML DTDs, ontologies, …, • Input: • Two (or more) structures, S1 and S2 • Data instances for S1 and S2 • Background knowledge • Output: • A mapping between S1 and S2 • Should enable translating between data instances. • Semantics of mapping?

Semantic Mappings between Schemas • Source schemas = XML DTDs house address contact-info num-baths agent-nameagent-phone 1-1 mapping non 1-1 mapping house location contact full-baths half-baths name phone

Motivation • Database schema integration • A problem as old as databases themselves. • database merging, data warehouses, data migration • Data integration / information gathering agents • On the WWW, in enterprises, large science projects • Model management: • Model matching: key operator in an algebra where models and mappings are first-class objects. • See [Bernstein et al., 2000] for more. • The Semantic Web • Ontology mapping. • System interoperability • E-services, application integration, B2B applications, …,

Desiderata from Proposed Solutions • Accuracy, efficiency, ease of use. • Realistic expectations: • Unlikely to be fully automated. Need user in the loop. • Some notion of semantics for mappings. • Extensibility: • Solution should exploit additional background knowledge. • “Memory”, knowledge reuse: • System should exploit previous manual or automatically generated matchings. • Key idea behind LSD.

LSD Overview • L(earning) S(ource) D(escriptions) • Problem: generating semantic mappings between mediated schema and a large set of data source schemas. • Key idea: generate the first mappings manually, and learn from them to generate the rest. • Technique: multi-strategy learning (extensible!) • Step 1: • [SIGMOD, 2001]: 1-1 mappings between XML DTDs. • Current focus: • Complex mappings • Ontology mapping.

Outline • Overview of structure mapping • Data integration and source mappings • LSD architecture and details • Experimental results • Current work.

Data Integration Find houses with four bathrooms priced under $500,000 mediated schema Query reformulation and optimization. source schema 1 source schema 2 source schema 3 wrappers realestate.com homeseekers.com homes.com Applications: WWW, enterprises, science projects Techniques: virtual data integration, warehousing, custom code.

Semantic Mappings between Schemas • Source schemas = XML DTDs house address contact-info num-baths agent-nameagent-phone 1-1 mapping non 1-1 mapping house location contact full-baths half-baths name phone

Semantics (preliminary) • Semantics of mappings has received no attention. • Semantics of 1-1 mappings – • Given: • R(A1,…,An) and S(B1,…,Bm) • 1-1 mappings (Ai,Bj) • Then, we postulate the existence of a relation W, s.t.: • P(C1,…,Ck) (W) = P(A1,…,Ak) (R) , • P(C1,…,Ck) (W) = P(B1,…,Bk) (S) , • W also includes the unmatched attributes of R and S. • In English: R and S are projections on some universal relation W, and the mappings specify the projection variables and correspondences.

Why Matching is Difficult • Aims to identify same real-world entity • using names, structures, types, data values, etc • Schemas represent same entity differently • different names => same entity: • area & address => location • same names => different entities: • area => location or square-feet • Schema & data never fully capture semantics! • not adequately documented, not sufficiently expressive • Intended semantics is typically subjective! • IBM Almaden Lab = IBM? • Cannot be fully automated. Often hard for humans. Committees are required!

Current State of Affairs • Finding semantic mappings is now the bottleneck! • largely done by hand • labor intensive & error prone • GTE: 4 hours/element for 27,000 elements [Li&Clifton00] • Will only be exacerbated • data sharing & XML become pervasive • proliferation of DTDs • translation of legacy data • reconciling ontologies on semantic web • Need semi-automatic approaches to scale up!

The LSD Approach • User manually maps a few data sources to the mediated schema. • LSD learns from the mappings, and proposes mappings for the rest of the sources. • Several types of knowledge are used in learning: • Schema elements, e.g., attribute names • Data elements: ranges, formats, word frequencies, value frequencies, length of texts. • Proximity of attributes • Functional dependencies, number of attribute occurrences. • One learner does not fit all. Use multiple learners and combine with meta-learner.

Example Mediated schema address price agent-phone description locationlisted-pricephonecomments Learned hypotheses If “phone” occurs in the name => agent-phone Schema of realestate.com location Miami, FL Boston, MA ... listed-price $250,000 $110,000 ... phone (305) 729 0831 (617) 253 1429 ... comments Fantastic house Great location ... realestate.com If “fantastic” & “great” occur frequently in data values => description homes.com price $550,000 $320,000 ... contact-phone (278) 345 7215 (617) 335 2315 ... extra-info Beautiful yard Great beach ...

Multi-Strategy Learning • Use a set of baselearners: • Name learner, Naïve Bayes, Whirl, XML learner • And a set of recognizers: • County name, zip code, phone numbers. • Each base learner produces a prediction weighted by confidence score. • Combine base learners with a meta-learner, using stacking.

Base Learners • Name Learner (contact-info,office-address) (contact-info,office-address) (contact,agent-phone) (contact,agent-phone) (contact-phone, ? ) (phone,agent-phone) (phone,agent-phone) (listed-price,price) (listed-price,price) • contact-phone => (agent-phone,0.7), (office-address,0.3) • Naive Bayes Learner[Domingos&Pazzani 97] • “Kent, WA” => (address,0.8), (name,0.2) • Whirl Learner[Cohen&Hirsh 98] • XML Learner • exploits hierarchical structure of XML data

Training the Base Learners Mediated schema address price agent-phone description locationlisted-pricephonecomments Schema of realestate.com Name Learner <location> Miami, FL </> <listed-price> $250,000</> <phone> (305) 729 0831</> <comments> Fantastic house </> (location, address) (listed-price, price) (phone, agent-phone) ... realestate.com Naive Bayes Learner <location> Boston, MA </> <listed-price> $110,000</> <phone> (617) 253 1429</> <comments> Great location </> (“Miami, FL”, address) (“$ 250,000”, price) (“(305) 729 0831”, agent-phone) ...

Entity Recognizers • Use pre-programmed knowledge to identify specific types of entities • date, time, city, zip code, name, etc • house-area (30 X 70, 500 sq. ft.) • county-name recognizer • Recognizers often have nice characteristics • easy to construct • many off-the-self research & commercial products • applicable across many domains • help with special cases that are hard to learn

Meta-Learner: Stacking • Training of meta-learner produces a weight for every pair of: • (base-learner, mediated-schema element) • weight(Name-Learner,address) = 0.1 • weight(Naive-Bayes,address) = 0.9 • Combining predictions of meta-learner: • computes weighted sum of base-learner confidence scores Name Learner Naive Bayes (address,0.6) (address,0.8) <area>Seattle, WA</> Meta-Learner (address, 0.6*0.1 + 0.8*0.9 = 0.78)

Training the Meta-Learner • For address Name Learner Naive Bayes True Predictions Extracted XML Instances <location> Miami, FL</> <listed-price> $250,000</> <area> Seattle, WA </> <house-addr>Kent, WA</> <num-baths>3</> ... 0.5 0.8 1 0.4 0.3 0 0.3 0.9 1 0.6 0.8 1 0.3 0.3 0 ... ... ... Least-SquaresLinear Regression Weight(Name-Learner,address) = 0.1 Weight(Naive-Bayes,address) = 0.9

Applying the Learners Mediated schema Schema of homes.com address price agent-phone description area day-phone extra-info Name Learner Naive Bayes <area>Seattle, WA</> <area>Kent, WA</> <area>Austin, TX</> (address,0.8), (description,0.2) (address,0.6), (description,0.4) (address,0.7), (description,0.3) Meta-Learner Name Learner Naive Bayes Meta-Learner (address,0.7), (description,0.3) <day-phone>(278) 345 7215</> <day-phone>(617) 335 2315</> <day-phone>(512) 427 1115</> (agent-phone,0.9), (description,0.1) (description,0.8), (address,0.2) <extra-info>Beautiful yard</> <extra-info>Great beach</> <extra-info>Close to Seattle</>

The Constraint Handler • Extends learning to incorporate constraints • hard constraints • a = address & b = addressa = b • a = house-ida is a key • a = agent-info & b = agent-nameb is nested in a • soft constraints • a= agent-phone &b= agent-name a&bare usually close to each other • user feedback = hard or soft constraints • Details in [Doan et. al., SIGMOD 2001]

The Current LSD System Training Phase Matching Phase Mediated schema Source schemas Domain Constraints Data listings User Feedback Constraint Handler Base-Learner1 Base-Learnerk Meta-Learner Mappings

Empirical Evaluation • Four domains • Real Estate I & II, Course Offerings, Faculty Listings • For each domain • create mediated DTD & domain constraints • choose five sources • extract & convert data listings into XML (faithful to schema!) • mediated DTDs: 14 - 66 elements, source DTDs: 13 - 48 • Ten runs for each experiment - in each run: • manually provide 1-1 mappings for 3 sources • ask LSD to propose mappings for remaining 2 sources • accuracy = % of 1-1 mappings correctly identified

Matching Accuracy Average Matching Acccuracy (%) LSD’s accuracy: 71 - 92% Best single base learner: 42 - 72% + Meta-learner: + 5 - 22% + Constraint handler: + 7 - 13% + XML learner: + 0.8 - 6%

Sensitivity to Amount of Available Data Average matching accuracy (%) Number of data listings per source (Real Estate I)

Contribution of Schema vs. Data LSD with only schema info. LSD with only data info. Complete LSD Average matching accuracy (%) • More experiments in the paper [Doan et. al. 01]

Reasons for Incorrect Matching • Unfamiliarity • suburb • solution: add a suburb-name recognizer • Insufficient information • correctly identified general type, failed to pinpoint exact type • <agent-name>Richard Smith</><phone> (206) 234 5412 </> • solution: add a proximity learner • Subjectivity • house-style = description?

Moving Up the Expressiveness Ladder • Schemas are very simple ontologies. • More expressive power = More domain constraints. • Mappings become more complex, but constraints provide more to learn from. • Non 1-1 mappings: • F1(A1,…,Am) = F2(B1,…,Bm) • Ontologies (of various flavors): • Class hierarchy (I.e., containment on unary relations) • Relationships between objects • Constraints on relationships

Finding Non 1-1 MappingsCurrent work • Given two schemas, find • 1-many mappings: address = concat(city,state) • many-1: half-baths + full-baths = num-baths • many-many: concat(addr-line1,addr-line2) = concat(street,city,state) • 1-many mappings • expressed as query • value correspondence expression: room-rate = rate * (1 + tax-rate) • relationship: state of tax-rate = state of hotel that has rate • special case: 1-many mappings between two relational tables Mediated schema Source schema address description num-baths city state comments half-baths full-baths

Brute-Force Solution • Define a set of operators • concat, +, -, *, /, etc • For each set of mediated-schema columns • enumerate all possible mappings • evaluate & return best mapping Mediated-schema columns Source-schema columns compute similarity using all base learners m1 m1, m2, ..., mk

Search-Based Solution • States = columns • goal state: mediated-schema column • initial states: all source-schema columns • use 1-1 matching to reduce the set of initial states • Operators: concat, +, -, *, /, etc • Column-similarity: • use all base learners + recognizers

Multi-Strategy Search • Use a set of expert modules: L1, L2, ..., Ln • Each module • applies to only certain types of mediated-schema column • searches a small subspace • uses a cheap similarity measure to compare columns • Example • L1: text; concat; TF/IDF • L2: numeric; +, -, *, /; [Ho et. al. 2000] • L3: address; concat; Naive Bayes • Search techniques • beam search as default • specialized, do not have to materialize columns

Multi-Strategy Search (cont’d) • Combine modules’ predictions & select the best one • Apply all applicable expert modules L1: m11, m12, m13, ..., m1x L2: m21, m22, m23, ..., m2y L3: m31, m32, m33, ..., m3z compute similarity using all base learners m11 m11, m12, m21, m22, m31,m32

Related Work Recognizers + Schema + 1-1 Matching Single Learner + 1-1 Matching TRANSCM [Milo&Zohar98] ARTEMIS [Castano&Antonellis99] [Palopoli et. al. 98] CUPID [Madhavan et. al. 01] SEMINT [Li&Clifton94] ILA [Perkowitz&Etzioni95] DELTA [Clifton et. al. 97] Hybrid + 1-1 Matching DELTA [Clifton et. al. 97] Multi-Strategy Learning Learners + Recognizers Schema + Data 1-1 + non 1-1 Matching Schema + Data 1-1 + non 1-1 Matching Sophisticated Data-Driven User Interaction CLIO [Miller et. al. 00],[Yan et. al. 01] LSD [Doan et. al. 2000, 2001] ?

Summary • LSD: • uses multi-strategy learning to semi-automatically generate semantic mappings. • LSD is extensible and incorporates domain and user knowledge, and previous techniques. • Experimental results show the approach is very promising. • Future work and issues to ponder: • Accommodating more expressive languages: ontologies • Reuse of learned concepts from related domains. • Semantics? • Data management is a fertile area for Machine Learning research!

Backup Slides

Mapping Maintenance • Ten months later ... • are the mappings still correct? Mediated-schema M Source-schema S m1 m2 m3 Mediated-schema M’ Source-schema S’ m1 m2 m3

Information Extraction from Text • Extract data fragments from text documents • date, location, & victim’s name from a news article • Intensive research on free-text documents • Many documents do have substantial structure • XML pages, name card, tables, list • Each such document = a data source • structure forms a schema • only one data value per schema element • “real” data source has many data values per schema element • Ongoing research in the IE community

Contribution of Each Component Average Matching Acccuracy (%) Without Name Learner Without Naive Bayes Without Whirl Learner Without Constraint Handler The complete LSD system

Exploiting Hierarchical Structure • Existing learners flatten out all structures • Developed XML learner • similar to the Naive Bayes learner • input instance = bag of tokens • differs in one crucial aspect • consider not only text tokens, but also structure tokens <contact> <name> Gail Murphy </name> <firm> MAX Realtors </firm> </contact> <description> Victorian house with a view. Name your price! To see it, contact Gail Murphy at MAX Realtors. </description>

Domain Constraints • Impose semantic regularities on sources • verified using schema or data • Examples • a = address & b = addressa = b • a = house-ida is a key • a = agent-info & b = agent-nameb is nested in a • Can be specified up front • when creating mediated schema • independent of any actual source schema

The Constraint Handler • Can specify arbitrary constraints • User feedback = domain constraint • ad-id = house-id • Extended to handle domain heuristics • a = agent-phone & b = agent-namea & b are usually close to each other Predictions from Meta-Learner Domain Constraints a = address & b = adderssa = b area: (address,0.7), (description,0.3) contact-phone: (agent-phone,0.9), (description,0.1) extra-info: (address,0.6), (description,0.4) 0.3 0.1 0.4 0.012 area: address contact-phone: agent-phone extra-info: address area: address contact-phone: agent-phone extra-info: description 0.7 0.9 0.6 0.378 0.7 0.9 0.4 0.252

Learning to Map Between Schemas Ontologies

Learning to Map Between Schemas Ontologies

Presentation Transcript

Schemas

Learning to Map between Structured Representations of Data

Ontology-Driven Semantic Matches Between Database Schemas

Ontologies to integrate learning design and learning content

Schemas

Semantic learning with specific application to ontologies

Schemas

Extending Educational Metadata Schemas to Describe Adaptive Learning Resources

Translating Relational Schemas to XML Schemas

Learning Ontologies from RDF Annotations

Learning to Map between Ontologies on the Semantic Web

Student Learning Map

The Learning Map

Self-Learning Ontologies

Building Ontologies for The National Map

Bridging the gap between data ontologies and process ontologies

Student Learning Map

Self-Learning Ontologies

Learning to Map between Ontologies on the Semantic Web