1 / 24

Making Holistic Schema Matching Robust: An Ensemble Approach

Making Holistic Schema Matching Robust: An Ensemble Approach. Bin He Joint work with: Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign. Background: MetaQuerier – large-scale integration of the deep Web. Query. Result. MetaQuerier. The Deep Web. The Deep Web.

keona
Download Presentation

Making Holistic Schema Matching Robust: An Ensemble Approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Making Holistic Schema Matching Robust: An Ensemble Approach Bin He Joint work with: Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign

  2. Background: MetaQuerier – large-scale integration of the deep Web Query Result MetaQuerier The Deep Web

  3. The Deep Web MetaQuerier: System architecture [CIDR’05] MetaQuerier Front-end: Query Execution Result Compilation Query Translation Source Selection Query Web databases Find Web databases Deep Web Repository Query Interfaces Query Capabilities Subject Domains Unified Interfaces Back-end: Semantics Discovery Database Crawler Interface Extraction Source Organization Schema Matching

  4. Matching query interfaces (QIs) Book Domain m:n complex matching 1:1 simple matching Music Domain

  5. Pairwise Matching S1: author title subject ISBN S2: writer title category format S3: name title keyword binding Pairwise Attribute Correspondence S1.author « S3.name S1.subject « S2.category Traditional approaches of schema matching – Pairwise attribute correspondence • Typical matching approaches • Cupid [VLDB’01] • LSD [SIGMOD’01] • Scale is a challenge • Only small scale • Large-scale is a must for our task • Scale is an opportunity • Context information is not exploited • similar attributes across multiple schemas • co-occurrence patterns among attributes

  6. S1: author title subject ISBN S2: writer title category format S3: name title keyword binding Emerging paradigm: Holistic schema matching approach • Match many schemas at the same time and find all the matchings at once Input: a set of schemas Output: a ranked list of matchings Holistic Schema Matching author = writer = name subject = category format = binding

  7. Various techniques to realize holistic matching • Matching as hidden model discovery: Model generative behavior of schemas from attributes and their semantic relationships • The MGS framework [SIGMOD’03] • Matching as correlation mining: The correlation of attributes across sources reflect complex relationships • The DCM framework [KDD’04] • Matching as clustering: Attributes in two schemas may be similar through attributes in other schemas • Interactive clustering based matcher [SIGMOD’04] • WISE-Integrator [VLDB’03]

  8. Holistic matching is, in essence– Data mining to discover semantics for information integration Generation • Hypothesis Observations (attribute occurrences) Semantics (semantic correspondences) Hidden Regularities • Holistic matching approach Statistical Analysis hidden model discovery correlation mining clustering

  9. The baseline holistic matching architecture with matching as correlation mining AA.com United.com Expedia.com Delta.com The DCM matcher {adult, child, senior} = passenger departure date = depart

  10. The Deep Web The challenge in holistic input: Noisy data quality Database Crawler Interface Extraction With the mining nature, holistic matching suffers the inherent problem of noisy data quality! • Noisy input is inevitable • extraction of QIs may contain errors • organization of QIs may not be fully accurate Holistic Schema Matching Source Organization

  11. Example of errors in interface extraction Result of extraction: AA.com The correlation between (adult, children) and passenger is affected by a single extraction error!

  12. A general solution The impact of noises: Error cascade Accuracy Ai Accuracy Aj Accuracy = Aj? Error Cascade Q: Errors are often minority, why cascade? A: The technique of a semantics related task, e.g., data integration, is often context-sensitive: constraints, heuristics, measures, parameters, procedures Accuracy = Ai*Aj? (e.g., Interface Extraction) (e.g., Holistic Schema Matching) Sampling and voting techniques: The ensemble framework

  13. The intuition of the ensemble idea 1) Contain sufficient good schemas to mine matchings 2) Contain fewer noises to have more chance to sustain the holistic matcher • Sampling: a way to reduce noises in the input Sampling • Voting: a single sampling may be biased, so let us repeat it multiple times and then vote It is likely that the holistic matcher can be sustained in most samples

  14. S3: writer title category format S3: writer title category format S1: author title subject ISBN S1: author title subject ISBN S2: name title keyword binding S2: name title keyword binding author = name = writer author = name = writer subject = category subject = category The ensemble framework for holistic schema matching 1st trial Tth trial Multiple Sampling Sampling Sampling Holistic Schema Matching Holistic Schema Matching Holistic Schema Matching Rank Aggregation Voting

  15. How the ensemble framework works: An example 1. author = ISBN 2. publisher= category 3. author = name Holistic Schema Matching Holistic Schema Matching 1. author = name 2. subject = category 3. author = ISBN 1. author = name 2. subject = category 3. author = ISBN Holistic Schema Matching 1. subject = category 2. author = ISBN 3. author = name Holistic Schema Matching 1. author = name 2. publisher = category 3. author = ISBN Please refer to our paper for more formal analysis

  16. The ensemble idea is inspired by bagging predictors • Bagging is used in machine learning to maintain the accuracy of a classifier with the presence of biased distribution of input data • We are essentially applying bagging techniques in a new scenario of schema matching • However, we are different in • setting: supervised vs. unsupervised • technique: sampling and voting tech • analytic model: our modeling is specific to matching

  17. Configuration of multiple sampling • The configuration dilemma • Sample size S • If S is too small, the sampled data may not be sufficiently representative • If S is too large, the sampled data may contain too many noises • Number of trials T • If T is too small, the voting result may not be sufficiently convincing • If T is too large, more execution time is needed • Two ways to choose S and T • ST: first choose an S, then derive an appropriate T • TS: first choose an T, then derive an appropriate S • TS is better than ST, since the accuracy is very sensitive to S, not T

  18. Aggregating matchings from all trials: Enforcing the majority matching results • Each trial outputs a ranked list of matchings • Voting is thus to aggregate a set of ranked list into a single ranked list R, which reflects the ranking results in the majority • Candidate selection • If the majority of trials do not find a matching M, M is not considered as a correct matching and thus does not appear in R • Ranking aggregation • If the majority of trials ranks M1 higher than M2, it will be good if we can also rank M1 higher than M2 in R

  19. An example of voting T1: T2: T3: 1. author = name 2. subject = category 3. author = ISBN 1. subject = category 2. author = ISBN 3. author = name 1. author = name 2. publisher = category 3. author = ISBN All Matchings: M1. author = name, M2. subject = category, M3. author = ISBN, M4. publisher = category Candidate Selection: M1. author = name, M2. subject = category, M3. author = ISBN, M4. publisher = category Rank Aggregation: Borda’s aggregation:B(Mi) = Σ rank of Mi in Tj B(M1) = 1 + 3 + 1 = 5, B(M2) = 2 + 1 + 3 = 6, B(M3) = 3 + 2 + 2 = 7 Rank matchings according to B(Mi) M1. author = name M2. subject = category M3. author = ISBN

  20. Experimental setup • Subsystems integration scenario • Interface Extraction + Holistic Schema Matching • Interface Extractor [SIGMOD’04] • The DCM Matcher [KDD’04] • Datasets • Two representative domains in the TEL-8 dataset in UIUC Web Integration Repository • Books and Airfares • http://metaquerier.cs.uiuc.edu/repository/

  21. Experimental result: Baseline vs. Ensemble Baseline approach (a) Precision of Books (b) Precision of Airfares (c) Recall of Books (d) Recall of Airfares Ensemble approach

  22. Experimental result: Outliers vs. Missing Data • Upper bound exists • Two types of data quality problems • Outliers (noises) • Missing data • Outliers • data ideally should not be observed, but observed • can be solved by the ensemble approach • Missing data • data ideally should be observed, but not • cannot be solved by the ensemble approach (a) Precision of Books (b) Precision of Airfares (c) Recall of Books (d) Recall of Airfares

  23. Contributions • Problem • noisy data quality is an inherent challenge for large scale schema matching • critical for sustaining holistic schema matching as a practical and viable technique • Solution • an ensemble framework with sampling and voting techniques, inspired by bagging predictors • we are essentially applying bagging techniques in a new scenario of schema matching

  24. Thank You!

More Related