1 / 31

Interactive Query Formulation over Web Service-Accessed Sources

Interactive Query Formulation over Web Service-Accessed Sources. SIGMOD 2006 Best Paper Runner-Up. Michalis Petropoulos Alin Deutsch Yannis Papakonstantinou. CSE 636 Data Integration, March 2008. Large-Scale Data Integration Systems. . Web Domain. Web Forms & Reports.

myrna
Download Presentation

Interactive Query Formulation over Web Service-Accessed Sources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Interactive Query Formulationover Web Service-Accessed Sources SIGMOD 2006 Best Paper Runner-Up Michalis Petropoulos Alin Deutsch Yannis Papakonstantinou CSE 636 Data Integration, March 2008

  2. Large-Scale Data Integration Systems  Web Domain Web Forms & Reports • CNET’s Top Combinations • CNET’s Search Desktops • PCWorld’s Product Finder End User  Application Domain   • CNET Computer • PCWorld Portals Developer Application Application  Integration Domain Mediator Integrated Schema Compatible Combinations of Computers, Routers and Printers Integration Engineer Source Domain • Dell Computers by CPU • Cisco Routers by Rate • HP Printers by Speed  Web Service Web Service Web Service … Source Owner Data Source Data Source • Dell Computers • Cisco Routers • HP Printers Source Schema Source Schema …

  3. Large-Scale Data Integration Systems  Web Domain Web Forms & Reports End User  Application Domain What queries can the mediator answer for me? CLIDE   Developer Application Application  Integration Domain Mediator Integrated Schema Integration Engineer Source Domain  Web Service Web Service Web Service … Source Owner Data Source Data Source Source Schema Source Schema …

  4. Running Example Parameterized Views Schema Computers(cid, cpu, ram, price) NetCards(cid, rate, standard, interface) Views V1ComByCpu(cpu)  (Computer)* SELECT DISTINCT Com1.* FROM Computers Com1 WHERE Com1.cpu=cpu V2 ComNetByCpuRate(cpu, rate)  (Computer, NetCard)* SELECT DISTINCT Com1.*, Net1.* FROM Computers Com1, Network Net1 WHERE Com1.cid=Net1.cid AND Com1.cpu=cpu AND Net1.rate=rate Dell Cisco Schema Routers(rate, standard, price, type) Views V3 RouWired()  (Router)* SELECT DISTINCT Rou1.* FROM Routers Rou1 WHERE Rou1.type='Wired' V4 RouWireless()  (Router)* SELECT DISTINCT Rou1.* FROM Routers Rou1 WHERE Rou1.type='Wireless' Wired Routers Computers for a givencpu Wireless Routers Computers & NetCards for a givencpu & rate • Conjunctive Queries CQ • Equality & Comparison Conditions • Parameters

  5. Running Example Integrated Schema   Developer Application Mediator Integrated Schema • Integrated schema puts togetherthe Dell and Cisco schemas Attribute Associations • (Computers.cid, NetCards.cid) • (NetCards.rate, Routers.rate) • (NetCards.standard, Routers.standard) V1 V2 V3 V4 Dell Cisco

  6. Sophisticated Mediators MakeFeasible Queries Hard to Predict Feasible Queries FQ • Equivalent CQ query rewritings using the views • Might involve more than one views • Order might matter Query: Get all ‘P4’ Computers, together with their NetCards and their compatible ‘Wireless’ Routers Query: Get all Computers Feasible Infeasible E Mediator B D Mediator A C RouWireless() ComNetByCpuRate(‘P4’, ‘10’) V1 ComNetByCpuRate(‘P4’, ‘54’) V4 V2

  7. Problem • Large number of sources • Large number of views (web-services) • Mediator capabilities Developer formulates an application query • Is an application query feasible? • If not, how do I know which ones are feasible? Previous options: • The developer had to browse the view definitions and somehow formulate a feasible query • Or formulate queries until a feasible one is found(trial-and-error) No system-provided guidance

  8. The CLIDE Solution CLIDE • A query formulation interface, which interactivelyguides the developer toward feasible queries by employing a coloring scheme   Developer Application Mediator Integrated Schema V1 V2 V3 V4 Dell Cisco

  9. QBE-Like Interfaces Microsoft SQL-Server

  10. CLIDE Interface Last/Next Step Table Alias Selection Boxes • Table, selection, projection and join actions • Feasibility Flag • Color-based suggestions Feasibility Flag Table Boxes Projection Box

  11. Example Interaction Snapshot 1 Yellow required action • All feasible queries require this action White optional action • Feasible queries can be formulatedw/ or w/o these actions

  12. Example Interaction Snapshot 2 • Blue required choice of action • At least one feasible query cannot be formulated unless this action is performed C Mediator A ComByCpu(‘P4’) B V1

  13. Example Interaction Snapshot 3 Join Lines: • Only yellow and blue are displayed • Must appear in Attribute Associations

  14. Example Interaction Snapshot 4 • * any other constant • Red prohibited action • Does not appear in any feasible query • Lead to “Dead End” state

  15. Example Interaction Snapshot 5 F Mediator A D RouWireless() ComNetByCpuRate(‘P4’, rate) E B V4 V2

  16. CLIDE Properties • Completeness of Suggestions • Every feasible query can be formulated by performing yellow and blue actions at every step • Summarization of Suggestions • At every step, only a minimal number of actions is suggested, i.e., the ones that are needed to preserve completeness • Rapid Convergence By Following Suggestions • The shortest sequence of actions from a query to any feasible query consists of suggested actions

  17. Interaction Graph Selection Action Table Action Join Action Com1 Com1.ram Com1.price Com1.cpu=‘P4’ Net1 Com1.cid=Net1.cid Rou1 … … … … … … … … … … • Nodes are queries: One for each qCQ • Edges are actions: Table, selection, projection and join actions • Green nodes are feasible queries • Infinitely big structure • All CQ queries • All possible combinations of actions formulating them

  18. Interaction Graph: Colorable Actions Com1.cid Current Node … Com1.cpu … Com1.cid=* … • Colorable actionsAC labeloutgoing edges of the current node Com1.cpu=* … Com1.ram=* … Com1.price=* … Net1 … Rou1 … Com2 …

  19. Interaction Graph: Colors • Yellow action  • Every path from current node n to a feasible node contains  • Blue action  • At least one feasible query cannot be formulated unless this action is performed (summarization) • Red action  • No path to a feasible node contains  Current Node Com1.cid … Com1.cpu … Current Node Com1.cid=* … … Com1.cpu=* Com1.cpu=* Net1 Com1.cid=Net1.cid Com1.cid=Net1.cid Net1.rate=’54Mbps’ … … … … Com1.ram=* Com1.cid=Net1.cid … … Net1.rate=’54Mbps’ Net1.rate=’54Mbps’ Com1.price=* … … … Com1.cpu=* Rou1 Rou1 Com1.cid=Net1.cid Net1.rate=Rou1.rate Net1 … … … … … … Rou1 Com2 Com2 … Com2 Com2.cid=Net1.cid Com2.cpu=‘P4’ Net1.rate=‘54Mbps’ … … … … …

  20. CLIDE Architecture  Actions • Back-End invoked every time the user performs an action • i.e., the user arrives at a new node in the interactions graph Front-End User Current Query Colored Actions + Feasibility Flag Back-End Color Algorithm Seed Queries SQ Parameters Algorithm Closest Feasible Queries FQC Closest Feasible Queries Algorithm Aliases Collapse Rule Minimal Feasible Extension Queries Maximally-Contained Rewriter Schemas Views Column Associations

  21. Color DeterminedBy a Finite Set of Feasible Queries Challenge: Infinitely Many Feasible Queries ? … Radius … Closest Feasible Queries FQC … n … … … … Solution: Closest Feasible Queries FQC • FQC is sufficient to color actions in AC • Theorem: Set of Closest Feasible Queries is Finite Challenge: How far can the Closest Feasible Queries FQC be? Solution: Based on Maximally Contained Queries FQMC

  22. Maximally Contained Queries FQMC Not Maximally Contained Query: Q3 Get all Computers with a given cpu & ram Maximally Contained Query Query: Q4 Get all Computers with a given ram Maximally Contained Query Query: Q2 Get all Computerswith a given cpu Query: Q1 Get all Computers • Assuming fixed SELECT clause (projection list) • Covered extensively in literature • MiniCon, Bucket, InverseRules Algorithms • FQMC is finite

  23. Closest Feasible Queries FQC Algorithm Challenge: How far can the Closest Feasible Queries FQC be? Solution: Maximally Contained Queries FQMC pLRadius … Maximally Contained Queries FQMC Closest Feasible Queries FQC … … n … … … • Compute maximally contained queries FQMC • Theorem: All FQC queries are reachable via a path of length p  pL • The radius pL is the longest path to a maximally contained query …

  24. Closest Feasible Queries FQC Algorithm Challenge: Find the Closest Feasible Queries Maximally Contained Feasible Queries FQMC … Closest Feasible Queries FQC … n … … … … More feasible nodes • Theorem: All queries in FQMC are in FQC • But not all queries in FQC are in FQMC

  25. Closest Feasible Queries FQC Algorithm Solution: Collapse Aliases Maximally Contained Feasible Queries FQMC … Closest Feasible Queries FQC … n … … … … • Collapse Aliases to compute FQC \ FQMC • Check satisfiability

  26. Color Algorithm Yellow and Blue • An action  is colored based on which closest feasible queries it appear in • Yellow, if  appears in all queries in FQC • Blue, if  appears in at least one (but not all) query in FQC White and Red • Attach Maximum Projection Lists to Closest Feasible Queries • Projections that can be added to a feasible query, without compromising feasibility • Projection  is white if in the maximum projection list • Color selections based on projections

  27. CLIDE Implementation & Optimizations Front-End Maximally-Contained Rewriter Current Query Colored Actions + Feasibility Flag Minimal Feasible Extension Queries FQME + Maximum Projection Lists Back-End Redundant Actions Removal Color Algorithm Maximally-Contained Feasible Extension Queries + Maximum Projection Lists Seed Queries SQ Redundant Queries Removal Parameters Algorithm Feasible Extension Queries + Maximum Projection Lists Closest Feasible Queries FQC Closest Feasible Queries Algorithm Views Expansion Aliases Collapse Rule Maximally-Contained Feasible Queries over Views + Containment Mappings Minimal Feasible Extension Queries Maximally-Contained Rewriter MiniCon Containment Mappings Logging Schemas Views Column Associations • Views expansion introduce redundancy • Affects CLIDE’s rapid convergence and summarization • Efficient containment test crucial to redundancy removal

  28. CLIDE Performance • Views • Schema C1 C1 … … B1 B1 Ci Ci1 B1 C1 … … … … … CiM CL CL A A Bi Bi1 B2 C1 … … … … BiM BK BK Chains of Stars – No Parameters • Queries A-span = 7 B-span = 3 Selections = 4,6,8,10 A

  29. CLIDE Performance • Views • Schema C1 C1 … … B1 B1 Ci Ci1 B1 C1 … … … … … CiM CL CL A A Bi Bi1 B2 C1 … … … … BiM BK BK Chains of Stars – No Parameters • Queries A-span = 7 B-span = 3 Selections = 4,6,8,10 A

  30. CLIDE Performance • Views • Schema C1 C1 … … B1 B1 Ci Ci1 B1 C1 … … … … … CiM CL CL A A Bi Bi1 B2 C1 … … … … BiM BK BK Chains of Stars – With Parameters • Queries A-span = 7 B-span = 3 Selections = 4,6,8,10 A

  31. CLIDE Summary First interactive query formulation interface based on source and mediator capabilities Applicability • Service-Oriented Architectures • Privacy-Preserving Services Contributions • Interaction Guarantees: Rapid Convergence, Completeness, Summarization of Suggestions • Interaction Graph • Back-End Algorithms • Closest Feasible Queries, Colors, Parameters • Modular, Customizable Architecture http://www.clide.info

More Related