1 / 34

Automatic Conflict Resolution to Integrate Relational Databases

Automatic Conflict Resolution to Integrate Relational Databases. Ramon Lawrence University of Manitoba umlawren@cs.umanitoba.ca. Outline. Introduction, Motivation, and Background The integration architecture Standard dictionary, X-Specs, query processor Example integration

mariont
Download Presentation

Automatic Conflict Resolution to Integrate Relational Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Conflict Resolution to Integrate Relational Databases Ramon Lawrence University of Manitoba umlawren@cs.umanitoba.ca

  2. Outline • Introduction, Motivation, and Background • The integration architecture • Standard dictionary, X-Specs, query processor • Example integration • Northwind, Southstorm databases • Querying the integrated databases • Generating SQL queries from semantic queries • Unity implementation • Contributions, Conclusions, and Future Work

  3. What is Integration? • Two levels of integration: • Schema integration -the description of the data • Data integration -the individual data instances • Integration problems include: • Different data models and conflicts within a model • Incompatible concept representations • Different user or view perspectives • Naming conflicts (homonym, synonym) • Integration handles the different mechanisms for storing data (structural conflicts), for referencing data (naming conflicts), and for attributing meaning to the data (semantic conflicts).

  4. Why is Integration Required? • There are many integration environments: • Operational systems within an organization • System integration during company merger • Data warehouses, Intranets, and the WWW • Users require information from many data sources which often do not work together. • Companies require a global view of their entire operations which may be present in numerous operational databases for different departments and distributed geographically. • E-commerce demands integration of web databases with production systems.

  5. Previous Work • Research systems: • integrating systems by logical rules (Sheth) • defining global dictionaries (Castano) • Carnot Project using the Cyc knowledge base • wrapper and mediator systems: • Information Manifold, TSIMMIS, Infomaster • Industrial systems and standards: • Metadata Interchange Specification (MDIS) • XML, BizTalk, E-commerce portals • Query Languages: • SQL, MSQL, IDL, DIRECT, SchemaSQL

  6. Previous Work Summary • Current techniques for database integration have some of these problems: • Require integrator to understand all databases • Integration process is manual • Do not hide system complexity from the user • Force changes on the existing database systems • Construct global view manually • Suffer from query imprecision (query containment)

  7. Our Approach • Our approach combines standardization and query mapping algorithms. • The major idea is that schema conflicts can be resolved if we: • Eliminate all naming conflicts • Define a language capable of determining schema equivalence and performing transformations • Naming conflicts are eliminated by accepting a standard term dictionary. • Not a knowledge base or set of mediated views • Leverages semantic information in English words

  8. Integration Architecture Integrated Context View X-Spec Editor Standard Dictionary Integration Algorithm Query Processor and ODBC Manager Client Client • Architecture Components: • 1) Integrated Context View • user’s view of integration • 2) X-Spec Editor • stores schema & metadata • uses XML • 3) Standard Dictionary • terms to express semantics • 4) Integration Algorithm • combines X-Specs into integrated context view • 5) Query Processor • accepts query on view • determines data source mappings and joins • executes queries and formats results Multidatabase Layer Subtransactions X-Spec X-Spec Database Database Local Transactions

  9. Architecture Components • The architecture consists of four components: • A standard dictionary (SD) to capture data semantics • SD terms are used to build semantic names describing semantics of schema elements. • X-Specs for storing data semantics • Database metadata and semantic names stored using XML • Integration Algorithm • Matches concepts in different databases by semantic names. • Produces an integrated view of all database concepts. • Query Processor • Allows the user to formulate queries on the view. • Translates from semantic names in integrated view to SQL queries and integrates and formats results. • Involves determining correct field and table mappings and discovery of join conditions and join paths

  10. Integration Processes • The integration architecture consists of three separate processes: • Capture process: independently extracts database schema information and metadata into a XML document called a X-Spec. • Integration process: combines X-Specs into a structurally-neutral hierarchy of database concepts called an integrated context view. • Query process: allows the user to formulate queries on the integrated view that are mapped by the query processor to structural queries (SQL) and the results are integrated and formatted.

  11. Integration Architecture:The Capture Process Relational Schema Automatic Extraction X-Spec Specification Editor DBA Lookup of terms Standard Dictionary • Capture process involves: • Automatically extracting the schema information and metadata using a specification editor • Assigning semantic names to each schema element (tables and fields) to capture their semantics

  12. Architecture Components: The Standard Dictionary • A standard dictionary (SD) provides standardized terms to capture data semantics. • Hierarchy of terms related by IS-A or Has-A links • Contains base set of common database concepts, but new concepts can be added • A SD term is a single, unambiguous semantic definition. • Several SD entries for a single English word are required if the word has multiple definitions. • The top-level dictionary terms are those proposed by Sowa.

  13. Architecture Components: Dictionary vs. Knowledge Base • The standard dictionary differs from a knowledge base such as Cyc because: • Not intended to be a general English dictionary or contain knowledge facts about the world • Dictionary is evolved as new terms are required • Not all English words are used • Dictionary provides the systems with no “knowledge” • Since no facts are stored, system cannot deduce new facts • Dictionary terms are just semantic place holders, integrators determine the semantics of the database not the system • Simplified organization • Dictionary is organized as a tree for efficiency and simplicity in determining related concepts • Re-use of terms • Terms are re-used in semantic names

  14. Architecture Components:Using the Standard Dictionary • SD terms are used to build semantic names describing semantics of schema elements. • Semantic names have the form: • semantic name := [CT_Type] | [CT_Type] CN • CT_Type := CT | CT {; CT} | CT {,CT} • CT := context term, CN := concept name • each CT and CN is a single term from the SD • Semantic names are included in specifications describing a database.

  15. Northwind & Southstorm Integration Example

  16. Northwind & Southstorm Integration Example (2)

  17. Integration Example (3) Page 17

  18. Northwind & Southstorm Integration Example (4)

  19. What is a semantic name? • A semantic name is a universal, semantic identifier in a domain. • Similar to a field name in the Universal Relation. • Semantics are guaranteed unique by construction. • System has mechanism for comparing semantics across domains even though it does not understand them. (Exploiting semantics in English words.) • Important definitions: • context - a semantic name is a context if it maps to a table • concept-a semantic name is a concept if it maps to a field • context closure - of semantic name Si denoted Si* is the set of semantic names produced by taking ordered subsets of the terms of Si = {T1, T2 , … TN} starting with T1.

  20. Architecture Components:X-Specs • Database metadata and semantic names are combined into specifications called X-Specs: • Stored and transmitted using XML • Contains information on a relational schema • Organized into database, table, and field levels • Stores semantic names to describe and integrate schema elements

  21. Southstorm X-Spec <?xml version="1.0" ?> <Schema name = "Southstorm_xspec.xml” xmlns="urn:schemas-microsoft-com:xml-data" xmlns:dt="urn:schemas-microsoft-com:datatypes"> <ElementType name="[Order]" sys_name = "Orders_tb" sys_type="Table"> <element type = "[Order] Id" sys_name = "Order_num" sys_type = "Field"/> <element type = "[Order] Total Amount" sys_name = "Order_total" sys_type = "Field"/> <element type = "[Order;Customer] Name" sys_name = "Cust_name" sys_type = "Field"/> <element type = "[Order;Customer;Address] Address Line 1" sys_name="Cust_address" sys_type="Field"/> <element type = "[Order;Customer;Address] City" sys_name = "Cust_city" sys_type = "Field"/> <element type = "[Order;Customer;Address] Postal Code" sys_name="Cust_pc" sys_type="Field"/> <element type = "[Order;Customer;Address] Country" sys_name="Cust_country" sys_type="Field"/> <element type = "[Order;Product] Id" sys_name = "Item1_id" sys_type = "Field"/> <element type = "[Order;Product] Quantity" sys_name = "Item1_quantity" sys_type = "Field"/> <element type = "[Order;Product] Price" sys_name = "Item1_price" sys_type = "Field"/> <element type = "[Order;Product] Id" sys_name = "Item2_id" sys_type = "Field"/> <element type = "[Order;Product] Quantity" sys_name = "Item2_quantity" sys_type = "Field"/> <element type = "[Order;Product] Price" sys_name = "Item2_price" sys_type = "Field"/> </ElementType> </Schema> Page 21

  22. Integration Product:The Integrated Context View • The product of the integration is a structurally-neutral hierarchy of concepts called an integrated context view. • Define a context view (CV) as follows: • If a semantic name Si is in CV, then for any Sj in Si*, Sj is also in CV. • For each semantic name Si in CV, there exists a set of zero or more mappings Mi that associate a schema element Ej with Si. • A semantic name Si can only occur once in the CV. • A context view (CV) is a valid Universal Relation. • Each field is assigned a semantic name which uniquely identifies its semantic connotation.

  23. Northwind & Southstorm Integration Example

  24. Architecture Components: The Query Processor • The query processor: • Allows the user to formulate queries on the view. • Translates from semantic names in the context view to structural queries (SQL) on databases. • Involves determining correct field and table mappings and discovery of join conditions and join paths • Retrieves query results and formats them for display to the user. • Client-side query processing: • Perform joins between databases using common keys. • Data value formatting and transformation

  25. Advanced Query Processing • Advanced query processor features include: • global keys and joins - a mechanism for specifying when a field stores a global key such as a social security number. • result normalization - a procedure for normalizing query results returned from each individual database. (e.g. Southstorm) • data integration - transforming data representational conflicts at the global level. • For example, “M” and “F” may represent “Male” and “Female” in one database, and another may represent these concepts using “0” and “1”.

  26. Northwind & Southstorm Query Examples • Example 1: Retrieve all order ids ([Order] Id) and customers ([Customer] Name): • SS: SELECT Order_num, Cust_name FROM Orders_tb • NW: SELECT OrderID, CompanyName FROM Orders, Customers WHERE Orders.CustomerID = Customers.CustomerID • Example 2: Retrieve all ordered products ([Order;Product] Id) and their order ids. • SS: SELECT Order_num, Item1_id, Item2_id FROM Orders_tb • NW: SELECT OrderID, ProductID FROM OrderDetails • Note: In NW, selects from two different order id mappings. In SS, result normalization is required.

  27. Integration Example:Discussion • Important points: • System table and field names are not presented to the user who queries based on semantic names. • Database structure is not shown to the user. • Field and table mappings are automatically determined based on X-Spec information. • Join conditions are inserted as needed when available to join tables. • Different physical representations for the same concept are combined. • Hierarchically related concepts are combined based on their IS-A relationship in the standard dictionary.

  28. Unity Overview • Unity is a software package that implements the integration architecture with a GUI. • Developed using Microsoft Visual C++ 6 and Microsoft Foundation Classes (MFC). • Unity allows the user to: • Construct and modify standard dictionaries • Build X-Specs to describe data sources • Integrate X-Specs into an integrated view • Transparently query integrated systems using ODBC and automatically generate SQL transactions • Unity is available for demonstration and distribution.

  29. Architecture Discussion • The architecture automatically integrates relational schemas into a multidatabase. • Desirable properties: • Individual mappings - information sources integrated one-at-a-time and independently • Integrated view constructed for query transparency - user queries system by semantics instead of structure • Handles schema conflicts - including semantic, structural, and naming conflicts • Automated integration - integrated view constructed efficiently and automatically • No wrapper or mediator software is required • Transparent querying - users issue semantic queries which are translated to SQL by the query processor

  30. Contributions • Architecture contributions: • Has an unique application of a standard dictionary which is not a knowledge base • Separates the capture and integration processes • Produces an integrated, high-level view of all concepts in the underlying databases • Allows transparent querying without structure • Provides algorithms for dynamically extracting database data (creating relevant views) and for mediation of global level conflicts • Arguably simpler method for capturing data semantics than using description logic • An implementation, Unity, which demonstrates the practical benefits of the architecture

  31. Conclusions & Future Work • Automatic database integration is possible by using a standard term dictionary and defining semantic names for schema elements. • Users are able to transparently query integrated systems by concept instead of structure. • We are constantly refining Unity. • Develop an integration component for a web browser • Test the system in large industrial projects. • Allow distributed updates and global updates on all databases.

  32. References • Publications: • Unity - A Database Integration Tool, R. Lawrence and K. Barker, TRLabs Emerging Technology Bulletin, January 2000. • Multidatabase Querying by Context, R. Lawrence and K. Barker, DataSem2000, pages 127-136, Oct. 2000. • Integrating Relational Database Schemas using a Standardized Dictionary, SAC’2001 - ACM Symposium on Applied Computing, March, 2001. • Sponsors: • NSERC, TRLabs • Further Information: • http://www.cs.umanitoba.ca/~umlawren/

More Related