Query Translation of Web Database Integration: Issues, Advances and Directions

Query Translation of Web Database Integration: Issues, Advances and Directions Fangjiao Jiang

Outline • Query translation in web database integration • Introduction • Problems • A simple framework • Survey the current works • The Challenges and opportunities of query translation in Web DB integration • Our future works of query translation

Outline • Query translation in web database integration • Introduction • Problems • A simple framework • Survey the current work • The Challenges and opportunities of query translation in Web DB integration • Our future works of query translation

Introduction of query translation • Query translation A user’s query submitted to the integrated interface must be translated to web database interfaces automatically. query translation

DB2 DB3 DBn DB1 query …… ProblemsProblem1: Web DBs • Should we translate the query to every web database? • Necessary? costly? redundant? • What? • What web databases should we select to translate the user’s query? • Database selection? Q-Web DB Q Q Q

1-1 match Complex match ProblemsProblem2: {Depart City}={leaving from} 1:1 {Destination}={Going to} 1:1 …… {Adult, Child}={Passengers} n:1 • How? • How to translate a query from integrated interface to a web database interface? • Attribute matching? A local web DB interface Integrated interface

local web DB interface1 Integrated interface local web DB interface2 ProblemsProblem3: • Title contains “red storm” Title contains “red storm” (any words) or Title contains “red storm” (all words) or Title contains “storm” (any words) • Price<$35 Price<$25 U $25<Price<$45 • …… • How? • How to translate a query from integrated interface to a web database interface? • Constraint mapping?

Author, Title could be queried together. Author, Title, Subject could only be queried one of them at a time. How? How to translate a query from integrated interface to a web database interface? Capability-based query rewriting? Class=Economy There is no attribute in the web database interface that match the attribute “class”, so Class=true？ A local web DB interface Integrated interface Integrated interface A local web DB interface ProblemsProblem4:

Returned results Correct results ProblemsProblem5: • It is unavoidable that some of returned results are incorrect. • How? • How to filter the returned results to get the correct results? • Result filter?

The simple framework of query translation pre--processing core--processing post--processing

Outline • Query translation in web database integration • Introduction • Problems • A simple framework • Survey the current work • The Challenges and opportunities of query translation in Web DB integration • Our future works of query translation

Survey the current work (Ⅰ) • Database selection • A Frequency-based Approach for Mining Coverage Statistics in Data Integration. Zaiqing Nie and Subbarao Kambhampati. In Proceedings of the 20th ICDE 2004. • A. Y. Levy, A. Rajaraman, and J. J. Ordille. Querying heterogeneous information sources using source descriptions. In VLDB Conference, 1996. • A. Y. Halevy. Answering queries using views: A survey. The VLDB Journal, 10(4):270.294, 2001. • …… • Attributes matching • E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. VLDB Journal, 10(4):334–350, 2001 • Bin He.Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach.KDD,2004 • B. He and K. C.-C. Chang. Statistical schema matching across web query interfaces. In SIGMOD Conference, 2003. • Jiying Wang, Ji-Rong Wen, Fred Lochovsky, Wei-Ying Ma.Instance-based Schema Matching for Web Databases by Domain-specific Query Probing. Proceedings of the 30th VLDB Conference,2004. • Bin He, Kevin Chen-Chuan Chang: Making holistic schema matching robust: an ensemble approach. KDD 2005, 429-438. • ……

Survey the current work (Ⅱ) • Constraint mapping • Z. Zhang, B. He, and K. C.-C. Chang. Light-weight domain-based form assistant: Querying Web Databases On the Fly, In VLDB Conference, 2005. • K. C.-C. Chang and H. Garcá-Molina. Approximate query mapping: Accounting for translation closeness. VLDB Journal 2001. • K. C.-C. Chang, H. Garcá-Molina, and A. Paepcke. Boolean Query Mapping Across Heterogeneous Information Sources. IEEE Transactions on Knowledge and Data Engineering 1996. • K. C.-C. Chang, H. Garcá-Molina : Mind Your Vocabulary: Query Mapping Across Heterogeneous Information Sources. Proceedings of the 1999 ACM SIGMOD Conference. • …… • Query rewriting • C. Li, R. Yerneni, V. Vassalos, H. Garcia-Molina, Y. Papakonstantinou,J. Ullman, and M. Valiveti. Capability based mediation in TSIMMIS. SIGMOD Conference, 1998. • Y. Papakonstantinou, A. Gupta, and L. Haas. Capabilities-based query rewriting in mediator systems. In International Conference on Parallel and Distributed Information Systems, 1996. • Y. Papakonstantinou, A. Gupta, H. Garcia-Molina, and J. D. Ullman.A query translation scheme for rapid implementation ofwrappers. In International Conference on Deductive and Object-Oriented Databases, 1995. • …… • Result filter • ……

Need to know the coverage of each source S with respect to the query Q. UseCoverage and Overlap Statistics to Rank Sources. Only need to know the coverage of each source S with respect to the frequent query Q. Learn statistics only with respect A smaller set of frequently asked queries Store statistics with respect to query class The new query which is not in the query list can be mapped into most similar query classes AV Hierarchies and query classes Related works(1)—Database selection:A Frequency-based Approach for Mining Coverage Statistics in Data Integration. (ICDE 2004) • BibFinder Scenario — A publicly fielded computer science bibliography mediator. It integrates several online Computer Science bibliography sources, such as CSB, DBLP, ACM Digital Library, CiteSeer, and so on. • Approach

Query List

AV Hierarchies and Query Classes … …

1 . A new user query is mapped to a set of least general query classes . 2 . The mediator estimates the statistics for the query DBL using a weighted sum of the statistics of the P mapped classes . ACMDL 3 . Data sources are ranked and called in order of relevance using the estimated statistics . In particular : - The most relevant source has highest coverage CSB - The next best source has highest residual coverage As a result , the maximum number of tuples are obtained while the least number of sources are Example : called . Here , CSB has highest coverage , followed by DBLP . However , since ACMDL has higher residual coverage than DBLP , the top 2 sources that would be called are CSB and ACMDL . Using Coverage and Overlap Statistics to Rank Sources

Related works(2)—Database selection:Query heterogeneous information sources using source description（VLDB 1996） • Information Manifold • an implemented system that provides uniform access to a heterogeneous collection of more than 100 information sources on the WWW. • IM contains declarative description of the contents of the information sources. Example: Q: Get the price and review of cars for sale that were manufactured no earlier than 1992.

Use the relational model, augmented with certain object-oriented features to describe the content of information sources.

Related works(3): constraint mapping: Light-weight domain-based form assistant: Querying Web Databases On the Fly (VLDB 2005)

semantic closeness • Definition 1: Given a source query Qs and a target query form T, a query Qt* is a minimal subsuming translation w.r.t. T if: • 1. Qt* is a valid query w.r.t. T; • 2. Qt* subsumes Qs, i.e., for any database instance Di, Qs(Di) ∑Qt* (Di); • 3. Qt* is minimal, i.e., there is no query Qt such that Qt satisfies (1) and (2) and Qt* subsumes Qt. Approach: • find 37 template patterns in 150 sources. • notice that two predicate templates have mapping correspondence only if there exists a concept expressed with these two templates in different sources. CM (i, j) denote the number of concepts that are expressed using both templates Pi and Pj in the correspondence matrix CM. • As Figure 5 indicates, mappings happen mostly only within certain clusters of templates. Datetime type Numeric type Text type

Predicate mapper consists of two components: typerecognizerand type handler • predicate mapper takes a source predicate s and a matched target predicate template P as input, and outputs the closest target translation t* for s • a type handler needs to have three key components: search space, closeness estimation, and search strategy.

Closeness Estimation • Given the search space (P) covering all possible mappings, finding a Cmin mapping for numeric type and Datetime type is an easy task. • for text type, the inference of sub-sumption relationship is not trivial since it essentially needs logical reasoning.

Text Type Handler: • The question is which database instance can be used to reliably test the sub-sumption relationship? • we construct the database using words from Ws plus some additional random words. The database is composed of all possible combinations of the words (for testing the membership) with all possible orders (for testing the sequence). Figure8: t5 is the Cmin mapping.

Related works(4)—constraint mapping:Mind Your Vocabulary: Query Mapping Across Heterogeneous Information Sources. SIGMOD 1999 • Method -----Provide manuallymapping rules to translate query constraints from one to another sources consider one-to-one constraints mapping Consider inter-dependencies among constraints Explore relaxations into the closest supported version

Related works(5)—query rewriting: Query heterogeneous information sources using source description（VLDB 1996） • Information Manifold • an implemented system that provides uniform access to a heterogeneous collection of more than 100 information sources on the WWW. • IM contains declarative description of the contents and capabilities of the information sources. • use the source description to prune the set of information sources for a given query and generate executable query plans.

Use the capability records to describe the capabilities of an information sources. Every source relation associate one capability record of the form(Sin, Sout, Ssel, min, max) • Sin---must be given bindings for at least min elements of Sin, • Sout---the parameters returned from the information sources, • Ssel---the parameters that can apply selections of the form, • Min---the minimum number of inputs allowed, • Max--- the maxmum number of inputs allowed.

Related works(6)—query rewriting: Capability based Mediation in TSIMMIS (SIGMOD 1998) • keep track of the capabilities of sources to answer queries. This may not lead to generate plans involving source queries that cannot be answered by the sources. • TSIMMIS system: • The mediator encodes the relationship between the user views and the source views with a set of view definitions. • uses the Mediator Specification Language (MSL) to define user views. MSL is a logic-based language with object-oriented features. For example, the user view paper is defined as follows: <paper (<title T><author A><abs B><conf C>}> :- <entry {<title T><author A><abs B>}>Qs1, <entry {<title T><conf C>}>QS2 • Suppose the user wants to find the title and abstract of each paper written by ‘Smith’ in ‘SIGMOD-97’. The user formulates the following query, based on the user view paper: <ans {<title T><abs B>}> :- <paper {<title T><author ‘Smith’><abs B><conf ‘SIGMOD-97’>}>

When the user query arrives at the mediator, the mediator uses the view definitions to translate the query on the user views into a logical plan. The following is the logical plan for the example • user query: <ans {<title T><abs B>}> :- <entry {<title T><author ‘Smith’><abs B>}>Qsl, <entry {<title T><conf ‘SIGMOD-97’>}>Qs2 Three possible physical plans for the logical plan of the example user query are: • PI: Send query <entry {<title T><author ‘Smith’><abs B>}> to s1 ‘; send query <entry {<title T><conf ‘SIGMOD-97’>}>to s2; join the results of these source queries on the title attribute. • P2: Send query <entry {<title T><author ‘Smith’><abs B>}> to s1; for each returned title, send query <entry {<title T> <conf ‘SIGMOD-97’>}>to s2, with T bound. • Ps: Send <entry {<title T><conf ‘SIGMOD-97’>}> to s2; for each returned title, send <entry {<title T><author ‘Smith’> <abs B>}> to s1, with T bound. Some of these physical plans may or may not be feasible depending on the query capabilities of the sources,

In order to describe the capabilities of sources, the TSIMMIS system uses templates to represent sets of queries that can be processed by each source. Suppose s1 and s2 only have the following templates. templates. • TII : X:-X:<entry {<title $T><author A><abs B>}>Qs1 • T21 : X:-X:<entry {<title T><conf $C>}>Qs2 • T22 : X:-X:<entry {<title $T><conf C>}>Qs2 • PI: Send query <entry {<title T><author ‘Smith’><abs B>}> to s1 ‘; send query <entry {<title T><conf ‘SIGMOD-97’>}>to s2; join the results of these source queries on the title attribute. • P2: Send query <entry {<title T><author ‘Smith’><abs B>}> to s1; for each returned title, send query <entry {<title T> <conf ‘SIGMOD-97’>}>to s2, with T bound. • Ps: Send <entry {<title T><conf ‘SIGMOD-97’>}> to s2; for each returned title, send <entry {<title T><author ‘Smith’> <abs B>}> to s1, with T bound.

Outline • Query translation in web database integration • Introduction • Problems • A simple framework • Survey the current work • The Challenges and opportunities of query translation in Web DB integration • Our future work of query translation

Challenges How can we translate the query from the uniform integrated interface to web database interfaces? • The number of web databases we can access is very large even if in one domain. The meta-information about the web databases is very difficult to access. Logical source contents (books, new cars) Source capabilities (can answer the query) Source completeness (has all books) Statistics about the data (like in an RDBMS) Source reliability Update frequency • The web databases are heterogeneous. with heterogeneous schema • The web databases are autonomous. No central administration Uncontrolled source content overlap • The web databases are dynamic. • Approximate query translations will be unavoidable andcomplex. • manually rule-based constraints mapping will be replaced by automatically rule-based one.

Opportunities • The aggregate schema vocabulary of sources in the same domain trends to converge at a relatively small size. • The distributions of attribute frequencies is non-uniform Zipf-like. • There are 25constraints patterns overall. • The distributions of constraints patterns is Zipf-like, too. • Data-model is simple. • Some related works of schema matching must be useful to query translation.

Outline • Query translation in web database integration • Introduction • Problems • A simple framework • Survey the current work • The differences and Challenges of query translation in Web DB integration • Our future works of query translation

Our framework of query translation

Questions • How do we characterize, get and exploit source content, completeness, reliability, coverage and overlap? • How to create a hierarchy relationship with respect to the semantic mapping of attributes? • How to generate constraints mapping rules automatically according to the semantic mapping and type of attributes? • How to get and describe the capabilities of the local interfaces? How to rewrite query based on them?

Thanks!  Questions

Main references • A. Y. Levy, A. Rajaraman, and J. J. Ordille. Querying heterogeneous information sources using source descriptions. In VLDB Conference, 1996. • A. Y. Halevy. Answering queries using views: A survey. The VLDB Journal, 10(4):270.294, 2001. • C. Li, R. Yerneni, V. Vassalos, H. Garcia-Molina, Y. Papakonstantinou,J. Ullman, and M. Valiveti. Capability based mediation in TSIMMIS. SIGMOD Conference, 1998. • Y. Papakonstantinou, A. Gupta, and L. Haas. Capabilities-based query rewriting in mediator systems. In International Conference on Parallel and Distributed Information Systems, 1996. • Y. Papakonstantinou, A. Gupta, H. Garcia-Molina, and J. D. Ullman.A query translation scheme for rapid implementation of wrappers. In International Conference on Deductive and Object-Oriented Databases, 1995. • Vasilis Vassalos, Yannis Papakonstantinou: Describing and Using Query Capabilities of Heterogeneous Sources. VLDB 1997: 256-265 • K. C.-C. Chang and H. Garcá-Molina. Approximate query mapping: Accounting for translation closeness. VLDB Journal 2001. • K. C.-C. Chang, H. Garcá-Molina, and A. Paepcke. Boolean Query Mapping Across Heterogeneous Information Sources. IEEE Transactions on Knowledge and Data Engineering 1996. • K. C.-C. Chang, H. Garcá-Molina : Mind Your Vocabulary: Query Mapping Across Heterogeneous Information Sources. Proceedings of the 1999 ACM SIGMOD Conference. • Z. Zhang, B. He, and K. C.-C. Chang. Light-weight domain-based form assistant: Querying Web Databases On the Fly, In VLDB Conference, 2005. • A Frequency-based Approach for Mining Coverage Statistics in Data Integration. Zaiqing Nie and Subbarao Kambhampati. In Proceedings of the 20th International Conference on Data Engineering (ICDE 2004).

Main references • K. C.-C. Chang and H. Garc´a-Molina. Conjunctive constraint mapping for data translation. ACM ICDL,1998 • Bin He, Kevin Chen-Chuan Chang. Making holistic schema matching robust: an ensemble approach. KDD 2005, 429-438. • Wensheng Wu, AnHai Doan, Clement Yu.WebIQ: Learning from the Web to Match Deep-Web Query Interfaces.ICDE,2006. • B. He, K. C.-C. Chang, and J. Han. Automatic complex schema matching across web query interfaces: A correlation mining approach. Technical Report UIUCDCS-R-2003-2388, Dept. of Computer Science, UIUC, July 2003. • K. C.-C. Chang, B. He, C. Li, M. Patel, and Z. Zhang. Structured databases on the web: Observations and implications. SIGMOD Record, 2004. • Palopoli L, Sacca D, Ursino D. Semi-automatic, semantic discovery of properties from database schemas. IDEAS,1998. • K. C.-C. Chang, B. He, C. Li, M. Patel, and Z. Zhang. Structured databases on the web: Observations and implications. SIGMOD Record, 2004 • Milo T, Zohar S,Using schema matching to simplify heterogeneous data translation. Proc24th VLDB • Bin He. Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach.KDD,2004. • Jiying Wang, Ji-Rong Wen, Fred Lochovsky, Wei-Ying Ma. Instance-based Schema Matching for Web Databases by Domain-specific Query Probing. Proceedings of the 30th VLDB Conference,2004. • Hai He, Weiyi Meng, Clement Yu, and Zonghuan Wu. Automatic Integration of Web Search Interfaces with WISE-Integrator . VLDB 2003. • ……

Query Translation of Web Database Integration: Issues, Advances and Directions

Query Translation of Web Database Integration: Issues, Advances and Directions

Presentation Transcript

Query Processing

The State of the Art in Distributed Query Processing

CS 405G: Introduction to Database Systems

Outline

“ Vital Implementation Issues in Translation”

IS698: Database Management

CS 440 Database Management Systems

Outline

CS411 Database Systems

Query Optimization

Creating a database

Integrating Speech Recognition and Machine Translation

Database Management Systems

Sequence Alignment Cont’d

Our IR Strategy

Database Systems II Query Compiler

Outline

Database Queries and Structured Query Language (SQL)

Lecture 11 Introduction to Relational Database

Structured Query Language (SQL)

Chapter 23

FOCUS Quality Assurance