autojoin providing freedom from specifying joins n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
AutoJoin: Providing Freedom from Specifying Joins PowerPoint Presentation
Download Presentation
AutoJoin: Providing Freedom from Specifying Joins

Loading in 2 Seconds...

play fullscreen
1 / 52

AutoJoin: Providing Freedom from Specifying Joins - PowerPoint PPT Presentation


  • 95 Views
  • Uploaded on

AutoJoin: Providing Freedom from Specifying Joins. Terrence Mason ( terrence-mason@uiowa.edu ) Lixin Wang ( lixin-wang@uiowa.edu ) Dr. Ramon Lawrence ( ramon-lawrence@uiowa.edu ) Iowa Database and Emerging Application Laboratory University of Iowa.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'AutoJoin: Providing Freedom from Specifying Joins' - mayes


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
autojoin providing freedom from specifying joins

AutoJoin: Providing Freedom from Specifying Joins

Terrence Mason (terrence-mason@uiowa.edu)

Lixin Wang (lixin-wang@uiowa.edu)

Dr. Ramon Lawrence (ramon-lawrence@uiowa.edu)

Iowa Database and Emerging Application Laboratory

University of Iowa

7th International Conference on Enterprise Information Systems

ICEIS 2005 Miami, Florida

presentation outline
Presentation Outline
  • Define Query Inference
  • Query Languages that require Inference
  • AutoJoin Architecture
    • Join Graph represent a schema
    • Queries and Query Interpretations on a Join Graph
  • Pre-compute maximal join trees
    • Algorithm EMO
  • Query time processing – Example
  • Performance Evaluation
query inference problem new languages

Query Inference Problem New Languages

The query inference problemrequires enumerating and ranking query interpretations of a query such that the query interpretation desired by the user is among the highest ranked interpretations.

motivation for query inference
Motivation for Query Inference
  • State of the art query languages require it
    • Keyword Search – automatically relate keywords across relations of a schema
    • Conceptual Queries – Concepts mapped to database must be related
    • Natural Language Queries
      • Natural language query mapped to concepts
      • Relate concepts as in Conceptual Queries
  • Current approaches not scalable
    • Tied to specific language
    • Or conceptual model
motivation for query inference1
Motivation for Query Inference
  • Reduces to graph problem
    • Connect relations (nodes) with joins (edges)
    • Exponential solutions for highly connected graphs (database graphs less connected)
  • Approaches to join determination
    • Grow all ways
      • Universal Relation (Maier and Ullman, 1983)
      • Discover (Keyword) (Hristidis and Papakonstantinou, 2002, 2003, 2004)
    • Shortest Paths
      • CQL Conceptual Query Language (Owei and Navathe, 2001)
    • Limited Interpretations
      • Steiner Tree (2-Trees) (Wald and Sorenson, 1984)
      • Limit number of joins and interpretations (Zhang et al., 1999)
    • Query time find spanning trees of keywords
      • DBXplorer Keyword Search (Agrawal et al. 2002)
goal of autojoin
Goal of AutoJoin
  • Consistent, Scalable Inference Engine
    • Abstract database schema from users
    • Automatically determine joins to relate relations and attributes
    • Consistent approach to handle ambiguity in queries
    • Efficient algorithm to pre-compute potential joins
    • Minimal overhead at query time
    • Demonstrate efficiency and scalability
    • Structured on relational model without any required conceptual models
example query on tpc h schema
Example Query on TPC-H Schema

English Query:

List all parts ordered by Customers

in the United States.

  • Attribute-only SQL
  • Determine Joins with AutoJoin
  • New formulation for Query Inference problem.
slide8

TPC-H Schema

TPC-H BENCHMARK™ (http://www.tpc.org/)

List all parts ordered by Customers in the United States.

slide9
Attribute-only Query:

Select Part.Name where Nation.Name=‘United States’;

  • Part.Name - name attribute in Part Table
  • Nation.Name – name attribute in Nation Table
  • Select and where similar to SQL
  • No From clause or joins specified

Keyword Query:

Part ‘United States’

  • Maps Part toPart relation
  • Maps‘United States’totuple in Nationrelation
  • Nojoins specified
sql query select part name where nation name united states
SQL QuerySelect Part.Name where Nation.Name = ‘United States’;

SELECT P.name

FROM part P, nation N, partsupp PS, lineitem LI,

orders O, customer C

WHERE N.name = ‘United States’

And P.partkey = PS.partkey

And PS.partkey = LI.partkey

And PS.suppkey = LI.suppkey

And O.custkey = C.custkey

And C.nationkey = N.nationkey

And LI.orderkey = O.orderkey;

Specified Joins and

Tables

slide11

AutoJoin Architecture

User

Relational

Database

Execute Queries

Query Interface

Inference Request

Interpretations

Query Builder

Iterator

Generator

Ranker

XML

Document

Loader

AutoJoin Inference Engine

representing joins of a schema join graph
Representing Joins of a SchemaJoin Graph
  • Graph representation of relational schema
  • Nodes
    • Relations in schema
  • Directed Edges
    • Foreign key constraint between relations
      • Edges directed from N to 1 cardinality of relationships
      • Maintain Lossless property (No spurious tuples on joins)
create join graph tpc h
Create Join Graph TPC-H

Tables as Nodes

Line

Item

Part

Supp

Part

Supplier

Order

Nation

Customer

Region

pre compute maximal join trees
Pre-compute Maximal Join Trees
  • EMO Algorithm on Join Graph
    • Efficiently computes all Trees
    • Executes where previous strategy failed
    • Direction of edges results in lossless join trees
  • Pre-computed
    • Executed once prior to query time
    • Structures built for query time performance
compute lossless joins
Compute Lossless Joins
  • Maximal sets of lossless joins
  • Ambiguity inherent in the schema
  • Two types of ambiguity:
    • Single relation that plays multiple roles
      • Node with more than one incoming edge in join graph
    • Multiple semantic relationships between entities
      • Strongly connected components greater than one node
creation of maximal join trees lossless joins
Creation of Maximal Join TreesLossless Joins
  • Efficient Algorithm EMO
    • Determine all reachable graphs from nodes that may be a root for Maximal Set of Lossless Joins
    • Identify all Strong Connected Components (SCC)
    • For each SCC
      • If SCC is single node and no incoming edges, create reachable graph from this node
      • If SCC has multiple nodes, for each node in SCC with no incoming edges that are not part of SCC create reachable graph.
    • For each reachable graph find all spanning trees
    • Spanning trees represent Maximal Join Trees
maximal join trees of tpc h
Maximal Join Trees of TPC-H
  • LineItem is the only root for a reachable graph.
    • No strongly connected components
  • Join graph is reachable graph
  • Enumerate spanning trees on original graph
  • Remove shortcut joins and re-compute
slide18

TPC-H Join Graph

Line

Item

Part

Supp

Order

Part

Supplier

Nation

Customer

Region

slide19

TPC-H Maximal Join Trees

Line

Item

Line

Item

Line

Item

Line

Item

Part

Supp

Part

Supp

Part

Supp

Part

Supp

Supplier

Order

Supplier

Order

Supplier

Order

Part

Part

Supplier

Order

Part

Part

Nation

Customer

Nation

Customer

Nation

Customer

Nation

Customer

4

1

3

Region

Region

2

Region

Region

Line

Item

Line

Item

Line

Item

Line

Item

Part

Supp

Part

Supp

Part

Supp

Part

Supp

Supplier

Order

Supplier

Order

Supplier

Order

Part

Part

Supplier

Order

Part

Part

Nation

Customer

Nation

Customer

Nation

Customer

Nation

Customer

8

6

7

5

Region

Region

Region

Region

shortcut joins
Shortcut Joins
  • Semantically equivalent join paths
    • A shortcut join is a join that is semantically equivalent to a longer join path
    • Core join path (longer) preserved in join graph
    • Shortcut join removed for join determination
      • Appears to be a semantically different interpretation of the query
      • Substituted back into query
        • No nodes on core path in query (faster) execution)
  • TPC-H has two shortcut joins
slide21

TPC-H Join Graph

Remove Shortcut Joins

Line

Item

Red – Shortcut Joins

Part

Supp

Order

Part

Supplier

Nation

Customer

Region

slide22

Original TPC-H Maximal Join Trees

Line

Item

Line

Item

Line

Item

Line

Item

Part

Supp

Part

Supp

Part

Supp

Part

Supp

Supplier

Order

Supplier

Order

Supplier

Order

Part

Part

Supplier

Order

Part

Part

Nation

Customer

Nation

Customer

Nation

Customer

Nation

Customer

4

1

3

Region

Region

2

Region

Region

Line

Item

Line

Item

Line

Item

Line

Item

Part

Supp

Part

Supp

Part

Supp

Part

Supp

Supplier

Order

Supplier

Order

Supplier

Order

Part

Part

Supplier

Order

Part

Part

Nation

Customer

Nation

Customer

Nation

Customer

Nation

Customer

8

6

7

5

Region

Region

Region

Region

slide23

TPC-H Semantically Unique Maximal Join Trees

Line

Item

Line

Item

Part

Supp

Part

Supp

Part

Supplier

Order

Order

Part

Supplier

Nation

Customer

Nation

Customer

Region

Region

1

2

query and query interpretation autojoin
Query and Query Interpretation AutoJoin
  • Join Graphs
  • Query:
    • Sub-graph of the join graph
    • Nodes and (optionally) edges
      • Not connected requires inference
  • Query Interpretation:
    • Connected sub-graph of the join graph
    • Includes all specified nodes and edges
example query
Example Query

SELECT Part.Name

WHERE Nation.Name = ‘United States’;

  • Relate Part.Name to Nation.Name
    • Part and Nation Nodes.
  • QueryofPart and Nation nodes to AutoJoin.
  • The query is ambiguous
    • More than one query interpretation
    • Nation relates to Supplier and Customer
  • Return the query with fewest joins first
efficient query time execution
Efficient Query Time Execution
  • Find maximal join trees with query nodes
    • Reverse index - relation to its set of join trees
    • Intersect lists
  • Build Interpretations
    • Least common ancestor (vs. recursive prune)
    • Pre-compute ancestor lists
  • No lossless interpretations (no trees)
    • Find lossy interpretation
  • Rank interpretations by cost function

maximal sets of lossless joins

slide27

Both Trees Contain Query Nodes

Select Part.Name where Nation.Name = ‘United States’;

Line

Item

Line

Item

Part

Supp

Part

Supp

Part

Supplier

Order

Order

Part

Supplier

Nation

Customer

Nation

Customer

Region

Red – Target Nodes

Region

1

2

slide28

Query Processing

Line

Item

Line

Item

Part

Supp

Part

Supp

Part

Supplier

Order

Order

Part

Supplier

Nation

Customer

Nation

Customer

Red – Target Nodes

Blue – Tree Nodes

Gray – Nodes to Prune

Region

Region

1

2

slide29

Query Interpretations

Select Part.Name where Customer.Nation.Name = ‘United States’;

Select Part.Name where Supplier.Nation.Name = ‘United States’;

Line

Item

Part

Supp

Part

Supplier

Part

Supp

Part

Order

Nation

Nation

Customer

1

2

slide30

Unambiguous Query

Select Supplier.Name where Order.Id = 73;

Line

Item

Line

Item

Part

Supp

Part

Supp

Part

Supplier

Order

Order

Part

Supplier

Nation

Customer

Nation

Customer

Region

Red – Target Nodes

Region

1

2

slide31

Query Processing

Select Supplier.Name where Order.Id = 73;

Line

Item

Line

Item

Part

Supp

Part

Supp

Part

Supplier

Order

Order

Part

Supplier

Nation

Customer

Nation

Customer

Red – Target Nodes

Blue – Tree Nodes

Gray – Nodes to Prune

Region

Region

1

2

slide32

Query Interpretations

Select Supplier.Name where Order.Id = 73;

Line

Item

Line

Item

Part

Supp

Part

Supp

Supplier

Order

Supplier

Order

1

2

slide33

The Unambiguous Query Interpretation

Select Supplier.Name where Order.Id = 73;

Line

Item

Part

Supp

Supplier

Order

additional interpretations lossy joins
Additional InterpretationsLossy Joins
  • Related through a node involved in two distinct roles
    • Two maximal join trees contain all query nodes and have at least one node in common
    • Union maximal join trees
    • Common nodes provide relation for trees.
    • Interpretation where node will have two incoming edges
      • No longer lossless
    • Example Customer and Supplier related through Nation in TPC-H.
      • Cross products of Customers and Suppliers with the same nation
beyond natural joins
Beyond Natural Joins
  • Theta joins
    • Merge the two nodes related by theta join into single node and re-compute maximal objects.
    • Expand this node for final query interpretation with theta join
  • Tuple Variables
    • A query interface may specify tuple variables
    • Additional nodes and edges will be added to join graph to complete the query interpretations
performance experiments
Performance Experiments
  • Broad Range of Schemas
    • caBIO (NCI) 149 relations, 213 joins, and 1253 maximal join trees
    • TPC-H Standard Database
      • Inferred standard queries (21 specified queries)
      • Ambiguity reduced by removing shortcut joins
    • Tenant – 9 nodes, 50 joins, and 1286 maximal join trees
peformance results
Peformance Results
  • Time to generate all Maximal Join Trees
    • Handles schemas where previous method failed
    • Worst test 2.7 seconds
    • Average < 1 second
  • Reduce Ambiguity
    • Removing shortcut joins reduces ambiguity
    • Increased number of unambiguous query
      • From 45% to 68% for TPC-H Benchmark Queries
  • Minimal overhead of inference at query time
    • Average < 1 millisecond
    • Worst test 7.4 milliseconds
autojoin conclusions
AutoJoin Conclusions
  • Scalable inference engine
  • Efficiently pre-compute maximal join trees
  • Reduced ambiguity by removing shortcut joins
  • Overhead is minimal
  • Complex queries can be inferred
  • Built directly on relational model
future work
Future Work
  • Develop a query language
    • Remove requirement of understanding the underlying schema
    • Automatically determines joins
  • End user interface based on AutoJoin
  • Query inference for integration systems.
query inference previous

Query Inference(Previous)

The translation of a query in a query language into an unambiguous representation of the query

[Wald and Sorenson, 1984]

universal relation
Universal Relation
  • First model to require query inference
  • Maximal Objects (Maier and Ullman, 1983)
    • Lossless Join property to identify potential joins
    • Grows all ways on hyper-graph
    • Returns a union of all query interpretations
  • Minimum Directed Cost Steiner Tree (Wald and Sorenson, 1984)
    • Limited to Partial 2-Trees
    • Returns only lowest cost query interpretation
  • Generate a single interpretation
    • Do not meet need of new query languages
    • Limited query interpretations possible
state of the art query languages
State of the Art Query Languages
  • Keyword Searches
    • Keywords map to either specific data, attribute names, or relation names in a database.
    • Must identify joins to relate keywords spread across multiple relations.
    • Multiple approaches to identifying the top-k relationships between keywords.
keyword search t op k relationships
Keyword SearchTop-K Relationships
  • Discover (Hristidis and Papakonstantinou, 2002, 2003, 2004)
    • Grow all ways from a keyword
    • Limit on number of joins
    • Creates extra graphs
  • DBXplorer (Agrawal et al. 2002)
    • Generates spanning trees at query time
  • BANKS ( )
    • Graph of all tuples related by joins
    • Must fit in memory (limited to smaller databases)
state of the art query languages1
State of the Art Query Languages
  • Conceptual Query Languages or Models
    • Queries built with concepts that map to a database.
    • Remove the burden of knowledge of the schema.
    • Must determine joins to relate concepts in query.
    • Use conceptual model to determine joins
conceptual query languages
Conceptual Query Languages
  • CQL (Owei and Navathe, 2001)
    • Queries may include roles or joins required for a query
    • Pathfinder algorithm for completing the query
      • Based on shortest path between source and target concepts in query
      • Semantically Constrained ER Diagram as a graph used to determine joins.
  • Conceptual Model (Zhang et al., 1999)
    • Semantic graph of database
    • Search algorithm constrained by number of joins or number of interpretations
state of the art query languages2
State of the Art Query Languages
  • Natural Language Queries
    • Natural language queries map the language to concepts in a database
    • Joins must be determined to relate concepts in database similar to Conceptual Query Languages
slide52

TPC-H Join Graph

Line

Item

Part

Supp

Order

Supplier

Part

Nation

Customer

Region