1 / 29

Efficient Keyword Search across Heterogeneous Relational Databases

Efficient Keyword Search across Heterogeneous Relational Databases. Mayssam Sayyadian, AnHai Doan University of Wisconsin - Madison Hieu LeKhac University of Illinois - Urbana Luis Gravano Columbia University. Key Message of Paper. Precise data integration is expensive

Download Presentation

Efficient Keyword Search across Heterogeneous Relational Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient Keyword Search acrossHeterogeneous Relational Databases Mayssam Sayyadian, AnHai Doan University of Wisconsin - Madison Hieu LeKhac University of Illinois - Urbana Luis Gravano Columbia University

  2. Key Message of Paper • Precise data integration is expensive • But we can do IR-styledata integration very cheaply, with no manual cost! • just apply automatic schema/data matching • then do keyword search across the databases • no need to verify anything manually • Already very useful Build upon keyword search over a single database ...

  3. Keyword Search over a Single Relational Database • A growing field, numerous current works • DBXplorer [ICDE02], BANKS [ICDE02] • DISCOVER [VLDB02] • Efficient IR-style keyword search in databases [VLDB03], • VLDB-05, SIGMOD-06, etc. • Many related works over XML / other types of data • XKeyword [ICDE03], XRank [Sigmod03] • TeXQuery [WWW04] • ObjectRank [Sigmod06] • TopX [VLDB05], etc. • More are coming at SIGMOD-07 ...

  4. tid id emp-name comments u1 c124 Michael Smith Repair didn’t work u2 c124 John Deferred work to John Smith tid custid name contact addr t1 c124 Cisco Michael Jones … t2 c533 IBM David Long … t3 c333 MSR David Ross … u2 c124 John Deferred work to John Smith t1 c124 CiscoMichael Jones … A Typical Scenario Customers Complaints Foreign-Key Join Q = [Michael Smith Cisco] Ranked list of answers score=.8 u1 c124Michael Smith Repair didn’t work t1 c124 Cisco Michael Jones … score=.7

  5. tid id emp-name comments u1 c124 Michael Smith Repair didn’t work u2 c124 John Deferred work to John Smith tid custid name contact addr t1 c124 Cisco Michael Jones … t2 c533 IBM David Long … t3 c333 MSR Joan Brown … t1 c124 Cisco Michael Jones … u1 c124 Michael SmithRepair didn’t work v1 e23 Mike D. Smith x1 e23 e37 v3 e37 Jack Lucas across databases Our Proposal:Keyword Search across Multiple Databases Employees Complaints tid empid name v1 e23 Mike D. Smith v2 e14 John Brown v3 e37 Jack Lucas Groups Customers tid eid reports-to x1 e23 e37 x2 e14 e37 Query:[Cisco Jack Lucas]  IR-style data integration

  6. A Naïve Solution 1. Manually identify FK joins across DBs 2. Manually identify matching data instances across DBs 3. Now treat the combination of DBs as a single DB  apply current keyword search techniques Just like in traditional data integration, this is too much manual work

  7. tid id emp-name comments u1 c124 Michael Smith Repair didn’t work u2 c124 John Deferred work to John Smith tid custid name contact addr t1 c124 Cisco Michael Jones … t2 c533 IBM David Long … t3 c333 MSR Joan Brown … Kite Solution • Automatically find FK joins / matching data instances across databases  no manual work is required from user Employees Complaints tid empid name v1 e23 Mike D. Smith v2 e14 John Brown v3 e37 Jack Lucas Groups Customers tid eid reports-to x1 e23 e37 x2 e14 e37

  8. tid id emp-name comments u1 c124 Michael Smith Repair didn’t work u2 c124 John Deferred work to John Smith Automatically Find FK Joinsacross Databases • Current solutions analyze data values (e.g., Bellman) • Limited accuracy • e.g., “waterfront” with values yes/no “electricity” with values yes/no • Our solution: data analysis + schema matching • improve accuracy drastically (by as much as 50% F-1) Employees Complaints tid empid name v1 e23 Mike D. Smith v2 e14 John Brown v3 e37 Jack Lucas Automatic join/data matching can be wrong  incorporate confidence scores into answer scores

  9. t1 c124 Cisco Michael Jones … u1 c124 Michael Smith Repair didn’t work v1 e23 Mike D. Smith x1 e23 e37 v3 e37 Jack Lucas Incorporate Confidence Scores into Answer Scores • Recall: answer example in single-DB settings score=.8 u1 c124 Michael Smith Repair didn’t work t1 c124Cisco Michael Jones … • Recall: answer example in multiple-DB settings score 0.7 for data matching score 0.9 for FK join • α.score_kw (A, Q) + β.score_join (A, Q) + γ.score_data (A, Q) score (A, Q) = • size (A)

  10. Summary of Trade-Offs SQL queries • Precise data integration • the holy grail • IR-style data integration, naïve way • manually identify FK joins, matching data • still too expensive • IR-style data integration, using Kite • automatic FK join finding / data matching • cheap • only approximates the “ideal” ranked list found by naïve

  11. Kite Architecture Q = [ Smith Cisco ] Index Builder Condensed CN Generator IR index1 IR indexn … • Partial • Full • Deep Foreign key joins Refinement rules Top-k Searcher Data instance Foreign-Key Join Finder matcher Data-based Schema Join Finder Matcher Distributed SQL queries … … D1 D1 Dn Dn Offline preprocessing Online querying

  12. Database 1 Database 2 Relation 1 Relation 2 Relation 1 Relation 2 Online Querying What current solutions do: 1. Create answer templates 2. Materialize answer templates to obtain answers

  13. Customers v1 v2 v3 Complaints u1 u2 J1 J4 J2 J1 J4 Customers{} Complaints{} Emps{} Emps Customers Groups Complaints J2 J3 J1 J4 J3 Groups{} J3 J4 J1 J2 CustomersQ ComplaintsQ EmpsQ J1 J4 Create Answer Templates Find tuples that contain query keywords • Use DB’s IR index • example: Q = [Smith Cisco] Tuple sets: Create tuple-set graph Schema graph: Tuple set graph: Service-DB HR-DB Service-DB: ComplaintsQ={u1, u2} CustomersQ={v1} Groups x1 x2 Employees t1 t2 t3 HR-DB: EmployeesQ={t1} GroupsQ={}

  14. sample CNs J1 CN1: CN2: CN3: CN4: sample tuple set graph J1 J4 Customers{} Complaints{} Emps{} J2 J3 J1 J4 Groups{} J3 J4 J2 J2 J4 J1 J2 CustomersQ CustomersQ Complaints{Q} EmpsQ Groups{} Emps{}  Complaints{Q} CustomersQ ComplaintsQ EmpsQ J2 J3 J4 J1 J4 EmpsQ Groups{} Emps{}  Complaints{Q} Create Answer Templates (cont.) • Search tuple-set graph to generate answer templates • also called Candidate Networks (CNs) • Each answer template = one way to join tuples to form an answer

  15. Materialize Answer Templatesto Generate Answers • By generating and executing SQL queries J1 CN:CustomersQ ComplaintsQ (CustomersQ = {v1} , ComplaintsQ = {u1, u2}) SQL: SELECT * FROM Customers C, Complaints P WHERE C.cust-id = P.id AND (C.tuple-id = v1) AND (P.tuple-id = u1 OR tuple-id = u2) • Naïve solution • materialize all answer templates, score, rank, then return answers • Current solutions • find only top-k answers • materialize only certain answer templates • make decisions using refinement rules + statistics

  16. Challenges for Kite Setting • More databases  way too many answer templates to generate • can take hours on just 3-4 databases • Materializing an answer template takes way too long • requires SQL query execution across multiple databases • invoking each database incurs large overhead • Difficult to obtain reliable statistics across databases • See paper for our solutions (or backup slides)

  17. The DBLP Schema Sample Inventory Schema AR (aid, biblo) CITE (id1, id2) AUTHOR ARTIST BOOK CD PU (aid, uid) AR (id, title) WH2BOOK WH2CD AU (id, name) CNF (id, name) WAREHOUSE DBLP 1 DBLP 2 Inventory 1 Empirical Evaluation Domains

  18. Hybrid algorithm adapted to run over multiple databases Kite without condensed CNs Kite without adaptive rule selection and without rule Deep Full-fledged Kite algorithm Inventory DBLP time (sec) time (sec) Kite without rule Deep max CCN size max CCN size 2-keyword queries, k=10, 5 databases 2-keyword queries, k=10, 2 databases runtime vs. # of databases Inventory time (sec) # of DBs maximum CCN size = 4, 2-keyword queries, k=10 Runtime Performance (1) runtime vs. maximum CCN size

  19. runtime vs. # of keywords in the query DBLP Inventory time (sec) time (sec) |q| |q| max CCN=6, k=10, 2 databases max CCN=4, k=10, 5 databases runtime vs. # of answers requested Inventory time (sec) time (sec) k k 2-keyword queries, max CCN=4, 5 databases 2-keyword queries, max CCN=4, |q|=2, 5 databases Runtime Performance (2)

  20. Query Result Quality • Pr@k = the fraction of answers that appear in the “ideal” list Pr@k Pr@k k k OR-semantic queries AND-semantic queries

  21. Summary • Kite executes IR-style data integration • performs some automatic preprocessing • then immediately allows keyword querying • Relatively painless • no manual work! • no need to create global schema, nor to understand SQL • Can be very useful in many settings: e.g., on-the-fly, best-effort, for non-technical people • enterprises, on the Web, need only a few answers • emergency (e.g., hospital + police), need answers quickly

  22. Future Directions • Incorporate user feedback  interactive IR-style data integration • More efficient query processing • large # of databases, network latency • Extends to other types of data • XML, ontologies, extracted data, Web data IR-style data integration is feasible and useful extends current works on keyword search over DB raises many opportunities for future work

  23. BACKUP

  24. CN3: CN4: J1 J4 Customers{} Complaints{} Emps{} {J2, J3}  J1 J4 Groups{}` J4  J1 J1 J4 {J2, J3} Customers{} Complaints{} Emps{} CustomersQ ComplaintsQ EmpsQ J1 J4 J2 J3 J1 J4 Groups{} J3 J4 J2 J2 J4 J1 J2 EmpsQ Groups{} Emps{}  Complaints{Q} {J2, J3} J2 J4 CustomersQ ComplaintsQ EmpsQ EmpsQ Groups{} Emps{}  Complaints{Q} J2 J3 J4 J1 J4 EmpsQ Groups{} Emps{}  Complaints{Q} Condensing Candidate Networks In multi-database settings  unmanageable number of CNs • Many CNs share the same tuple sets and differ only in the associated joins • Group CNs into condensed candidate networks (CCNs) sample tuple set graph condense tuple set graph Condense sample CNs sample CCNs Condense

  25. iteration 1 iteration 2 iteration 3 K = {P2, P3}, min score = 0.7 . . . . . . P1 [0.6, 0.8] P [0.6, 1] . P2 0.9 Res = {P2, R2} min score = 0.85 . . . Q [0.5, 0.7] . . . P3 0.7 R1 [0.4, 0.6] . . . . . . . R [0.4, 0.9] R [0.4, 0.9] R2 0.85 Top-k Search Main ideas for top-k keyword search: • No need to materialize all CNs • Sometimes, even partially materializing a CN is enough • Estimate score intervals for CNs, then branch and bound search Kite approach: materialize CNs using refinement rules

  26. Top-k Search Using Refinement Rules • In single-database setting  • selecting rules based on database statistics • In multi-database setting Inaccurate statistics • Inaccurate statistics  Inappropriate rule selection

  27. t1 u1 UQ TQ TQ UQ t1 0.9 t1 0.9 t1 0.9 0.8 u1 0.8 u1 0.8 u1 , 0.6 u2 0.6 u2 0.6 u2 t2 0.7 t2 0.7 t2 0.7 u1 u3 t1 t1 t3 0.4 t3 0.4 t3 0.4 0.5 u3 0.5 u3 0.5 u3 UQ TQ TQ UQ t4 0.3 t4 0.3 t4 0.3 0.1 u4 0.1 u4 0.1 u4 t1 0.9 0.8 u1 0.6 u2 t2 0.7 t3 0.4 0.5 u3 t4 0.3 0.1 u4 Refinement Rules • Full: • Exhaustively extract all answers from a CN (fully materialize S) •  too much data to move around the network (data transfer cost) • Partial: • Try to extract the most promising answer from a CN •  invoke remote databases for only one answer (high cost of database invocation) • Deep: • A middle-ground approach • Once a table in a remote database is invoked, extract all answers involving that table • Takes into account database invocation cost

  28. Adaptive Search • Question: which refinement rule to apply next? • In single-database setting  based on database statistics • Multi-database setting  inaccurate statistics • Kite approach: adaptively select rules goodness-score (rule, cn) = benefit (rule, cn) – cost (rule, cn) • cost (rule, cn): optimizer’s estimated cost for SQL statements • benefit (rule, cn): reduce the benefit if a rule is applied for a while without making any progress

  29. accuracy (F1) Kite over single database time (sec) max CCN size Other Experiments • Schema matching helps improve join discovery algorithm drastically • Kite also improves single-database keyword search algorithm mHybrid

More Related