Interactive query formulation over web service accessed sources
Download
1 / 31

Interactive Query Formulation over Web Service-Accessed Sources - PowerPoint PPT Presentation


  • 80 Views
  • Uploaded on

Interactive Query Formulation over Web Service-Accessed Sources. Michalis Petropoulos Alin Deutsch Yannis Papakonstantinou. ACM SIGMOD, June 2006. Large-Scale Data Integration Systems. . Web Domain. Web Forms & Reports. CNET’s Top Combinations CNET’s Search Desktops

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Interactive Query Formulation over Web Service-Accessed Sources' - elvis-jensen


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Interactive query formulation over web service accessed sources

Interactive Query Formulationover Web Service-Accessed Sources

Michalis Petropoulos

Alin Deutsch

Yannis Papakonstantinou

ACM SIGMOD, June 2006


Large scale data integration systems
Large-Scale Data Integration Systems

Web

Domain

Web Forms

& Reports

  • CNET’s Top Combinations

  • CNET’s Search Desktops

  • PCWorld’s Product Finder

End User

Application

Domain

  • CNET Computer

  • PCWorld Portals

Developer

Application

Application

Integration

Domain

Mediator

Integrated

Schema

Compatible Combinations

of Computers, Routers

and Printers

Integration

Engineer

Source

Domain

  • Dell Computers by CPU

  • Cisco Routers by Rate

  • HP Printers by Speed

Web

Service

Web

Service

Web

Service

Source

Owner

Data

Source

Data

Source

  • Dell Computers

  • Cisco Routers

  • HP Printers

Source

Schema

Source

Schema


Large scale data integration systems1
Large-Scale Data Integration Systems

Web

Domain

Web Forms

& Reports

End User

Application

Domain

What queries can the

mediator answer for me?

CLIDE

Developer

Application

Application

Integration

Domain

Mediator

Integrated

Schema

Integration

Engineer

Source

Domain

Web

Service

Web

Service

Web

Service

Source

Owner

Data

Source

Data

Source

Source

Schema

Source

Schema


Running example
Running Example

Parameterized Views

Schema

Computers(cid, cpu, ram, price)

NetCards(cid, rate, standard, interface)

Views

V1ComByCpu(cpu)  (Computer)*

SELECT DISTINCT Com1.*

FROM Computers Com1

WHERE Com1.cpu=cpu

V2 ComNetByCpuRate(cpu, rate) 

(Computer, NetCard)*

SELECT DISTINCT Com1.*, Net1.*

FROM Computers Com1, Network Net1

WHERE Com1.cid=Net1.cid

AND Com1.cpu=cpu

AND Net1.rate=rate

Dell

Cisco

Schema

Routers(rate, standard, price, type)

Views

V3 RouWired()  (Router)*

SELECT DISTINCT Rou1.*

FROM Routers Rou1

WHERE Rou1.type='Wired'

V4 RouWireless()  (Router)*

SELECT DISTINCT Rou1.*

FROM Routers Rou1

WHERE Rou1.type='Wireless'

Wired Routers

Computers

for a givencpu

Wireless Routers

Computers & NetCards

for a givencpu & rate

Conjunctive Queries CQ

  • Equality & Comparison Conditions

  • Parameters


Running example1
Running Example

Integrated Schema

Developer

Application

Mediator

Integrated

Schema

  • Integrated schema puts togetherthe Dell and Cisco schemas

    Attribute Associations

  • (Computers.cid, NetCards.cid)

  • (NetCards.rate, Routers.rate)

  • (NetCards.standard, Routers.standard)

V1

V2

V3

V4

Dell

Cisco


Sophisticated mediators make feasible queries hard to predict
Sophisticated Mediators MakeFeasible Queries Hard to Predict

Feasible Queries FQ

  • Equivalent CQ query rewritings using the views

  • Might involve more than one views

  • Order might matter

Query:

Get all ‘P4’ Computers, together with their NetCards

and their compatible ‘Wireless’ Routers

Query:

Get all Computers

Feasible

Infeasible

E

Mediator

B

D

Mediator

A

C

RouWireless()

ComNetByCpuRate(‘P4’, ‘10’)

V1

ComNetByCpuRate(‘P4’, ‘54’)

V4

V2


Problem
Problem

  • Large number of sources

  • Large number of views (web-services)

  • Mediator capabilities

    Developer formulates an application query

  • Is an application query feasible?

  • If not, how do I know which ones are feasible?

    Previous options:

    • The developer had to browse the view definitions and somehow formulate a feasible query

    • Or formulate queries until a feasible one is found(trial-and-error)

      No system-provided guidance


The clide solution
The CLIDE Solution

CLIDE

  • A query formulation interface, which interactivelyguides the developer toward feasible queries by employing a coloring scheme

Developer

Application

Mediator

Integrated

Schema

V1

V2

V3

V4

Dell

Cisco


Qbe like interfaces
QBE-Like Interfaces

Microsoft SQL-Server


Clide interface
CLIDE Interface

Feasibility Flag

Selection Boxes

  • Table, selection, projection and join actions

  • Feasibility Flag

  • Color-based suggestions


Example interaction
Example Interaction

Snapshot 1

Yellow required action

  • All feasible queries require this action

    White optional action

  • Feasible queries can be formulatedw/ or w/o these actions


Example interaction1
Example Interaction

Snapshot 2

Blue required choice of action

  • At least one of them is required to be taken in order to reach a feasible query

    Join Lines:

  • Only yellow and blue are displayed

  • Must appear in Attribute Associations


  • Clide properties
    CLIDE Properties

    • Completeness of Suggestions

      • Every feasible query can be formulated by performing yellow and blue actions at every step

    • Minimality of Suggestions

      • At every step, only a minimal number of actions is suggested, i.e., the ones that are needed to preserve completeness

    • Rapid Convergence By Following Suggestions

      • The shortest sequence of actions from a query to any feasible query consists of suggested actions


    Example interaction2
    Example Interaction

    Snapshot 3

    • * any other constant

    • Red prohibited action

      • Does not appear in any feasible query

      • Lead to “Dead End” state


    Example interaction3
    Example Interaction

    Snapshot 4

    F

    Mediator

    A

    D

    RouWireless()

    ComNetByCpuRate(‘P4’, rate)

    B

    E

    V4

    V2


    Clide properties1
    CLIDE Properties

    • Completeness of Suggestions

      • Every feasible query can be formulated by performing yellow and blue actions at every step

    • Minimality of Suggestions

      • At every step, only a minimal number of actions is suggested, i.e., the ones that are needed to preserve completeness

    • Rapid Convergence By Following Suggestions

      • The shortest sequence of actions from a query to any feasible query consists of suggested actions


    Interaction graph
    Interaction Graph

    Selection

    Action

    Table

    Action

    Join Action

    Com1

    Com1.ram

    Com1.price

    Com1.cpu=‘P4’

    Net1

    Com1.cid=Net1.cid

    Rou1

    • Nodes are queries

      • One for each qCQ

    • Edges are actions

      • Table, selection, projection and join actions

    • Green nodes are feasible queries

    • Infinitely big structure

      • All CQ queries

      • All possible combinations of actions formulating them


    Interaction graph colors
    Interaction Graph: Colors

    • Yellow action 

      • Every path from current node n to a feasible node contains 

    • Blue action 

      • At least one path from current node n to a feasible node nF contains 

      • There is not path from n to nF that contains a feasible node

    • Red action 

      • No path to a feasible node contains 

    Current

    Node

    Com1.cid

    Com1.cpu

    Current

    Node

    Com1.cid=*

    Com1.cpu=*

    Com1.cpu=*

    Net1

    Com1.cid=Net1.cid

    Com1.cid=Net1.cid

    Net1.rate=’54Mbps’

    Com1.ram=*

    Com1.cid=Net1.cid

    Net1.rate=’54Mbps’

    Net1.rate=’54Mbps’

    Com1.price=*

    Com1.cpu=*

    Rou1

    Rou1

    Com1.cid=Net1.cid

    Net1.rate=Rou1.rate

    Net1

    Rou1

    Com2

    Com2

    Com2

    Com2.cid=Net1.cid

    Com2.cpu=‘P4’

    Net1.rate=‘54Mbps’


    Clide properties2
    CLIDE Properties

    • Rapid Convergence By Following Suggestions

      • The shortest path from the current node to any feasible node consists of suggested actions


    Clide architecture
    CLIDE Architecture

    Actions

    • Back-End invoked every time the user performs an action

      • i.e., the user arrives at a new node in the interactions graph

    Front-End

    User

    Current Query

    Colored Actions + Feasibility Flag

    Back-End

    Color Algorithm

    Seed Queries SQ

    Parameters Algorithm

    Closest Feasible Queries FQC

    Closest Feasible Queries Algorithm

    Aliases Collapse Rule

    Minimal Feasible

    Extension Queries

    Maximally-Contained Rewriter

    Schemas

    Views

    Attribute

    Associations


    Color determined by a finite set of feasible queries
    Color DeterminedBy a Finite Set of Feasible Queries

    Challenge: Infinitely Many Feasible Queries

    ?

    Radius

    n

    Closest Feasible

    Queries FQC

    Solution:

    • Finite Set of Feasible Queries

    • Closest Feasible Queries FQC

    • FQC is sufficient to color actions

    Challenge: How far can the Closest Feasible Queries FQC be?


    Closest feasible queries fq c algorithm
    Closest Feasible Queries FQC Algorithm

    Solution: Based on Maximally Contained Queries FQMC

    Maximally Contained

    Queries FQMC

    pL Radius

    n

    Closest Feasible

    Queries FQC

    • Compute maximally contained queries FQMC

    • Radius pL is the longest path to a maximally contained query

    • Theorem: All FQC queries are reachable via a path of length p  pL

    Challenge: The set of nodes within pL can be too many to explore


    Closest feasible queries fq c algorithm1
    Closest Feasible Queries FQC Algorithm

    Solution: Find the Closest Feasible Queries Directly

    Maximally Contained

    Queries FQMC

    n

    Closest Feasible

    Queries FQC

    More feasible nodes

    • Theorem: All queries in FQMC are in FQC

    • But not all queries in FQC are in FQMC

    • Collapse Aliases to compute FQC \ FQMC


    Remaining issues of the algorithm
    Remaining Issues of the Algorithm

    • Color remaining actions white or red

    • Handling projections

    • Dealing with parameterized queries


    Clide implementation issues
    CLIDE Implementation Issues

    • Maximally contained queries need to be minimized(affect CLIDE’s rapid convergence and minimality)

       Implemented the minimization module from scratch

       MiniCon computes mappings between the query and the views

       Exposed these mappings to guide minimization

    • Optimizations to bring the response time below 3 sec

       Amortize the computation across multiple interactions steps by observing that query is changed minimally at every step

       Instead of calling MiniCon from scratch at every step, we incrementally compute the closest feasible queries


    Thank you
    Thank you

    Play with our demo:

    http://www.clide.info


    Maximally contained queries fq mc
    Maximally Contained Queries FQMC

    Not Maximally Contained

    Query: Q3

    Get all Computers

    with a given cpu & ram

    Maximally Contained Query

    Query: Q4

    Get all Computers

    with a given ram

    Maximally Contained Query

    Query: Q2

    Get all Computerswith a given cpu

    Query: Q1

    Get all Computers

    • Assuming fixed SELECT clause (projection list)

    • Covered extensively in literature

      • MiniCon, Bucket, InverseRules Algorithms

    • FQMC is finite


    Closest feasible queries fq c algorithm2
    Closest Feasible Queries FQC Algorithm

    Solution: Collapse Aliases

    Maximally

    Contained

    Feasible

    Queries FQMC

    Closest

    Feasible

    Queries FQC

    n

    • Collapse Aliases to compute FQC \ FQMC

    • Check satisfiability


    Color algorithm
    Color Algorithm

    Yellow and Blue

    • An action  is colored based on which closest feasible queries it appear in

    • Yellow, if  appears in all queries in FQC

    • Blue, if  appears in at least one (but not all) query in FQC

      White and Red

    • Attach Maximum Projection Lists to Closest Feasible Queries

      • Projections that can be added to a feasible query, without compromising feasibility

    • Projection  is white if in the maximum projection list

    • Color selections based on projections


    Clide implementation optimizations
    CLIDE Implementation & Optimizations

    Front-End

    Maximally-Contained Rewriter

    Current Query

    Colored Actions + Feasibility Flag

    Minimal Feasible Extension Queries FQME

    + Maximum Projection Lists

    Back-End

    Redundant Actions Removal

    Color Algorithm

    Maximally-Contained Feasible Extension Queries

    + Maximum Projection Lists

    Seed Queries SQ

    Redundant Queries Removal

    Parameters Algorithm

    Feasible Extension Queries

    + Maximum Projection Lists

    Closest Feasible Queries FQC

    Closest Feasible Queries Algorithm

    Views Expansion

    Aliases Collapse Rule

    Maximally-Contained Feasible Queries over Views

    + Containment Mappings

    Minimal Feasible

    Extension Queries

    Maximally-Contained Rewriter

    MiniCon

    Containment Mappings Logging

    Schemas

    Views

    Attribute

    Associations

    • Views expansion introduce redundancy

      • Affects CLIDE’s rapid convergence and minimality

    • Efficient containment test crucial to redundancy removal


    Clide performance
    CLIDE Performance

    • Views

    • Schema

    C1

    C1

    B1

    B1

    Ci

    Ci1

    B1

    C1

    CiM

    CL

    CL

    A

    A

    Bi

    Bi1

    B2

    C1

    BiM

    BK

    BK

    Chains of Stars

    • Queries

    A-span = 7

    B-span = 3

    Selections = 4,6,8,10

    A


    ad