Efficient keyword search across heterogeneous relational databases
This presentation is the property of its rightful owner.
Sponsored Links
1 / 24

Efficient Keyword Search across Heterogeneous Relational Databases PowerPoint PPT Presentation


  • 57 Views
  • Uploaded on
  • Presentation posted in: General

Efficient Keyword Search across Heterogeneous Relational Databases. Mayssam Sayyadian, AnHai Doan University of Wisconsin - Madison Hieu LeKhac University of Illinois - Urbana Luis Gravano Columbia University. Key Message of Paper. Precise data integration is expensive

Download Presentation

Efficient Keyword Search across Heterogeneous Relational Databases

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Efficient keyword search across heterogeneous relational databases

Efficient Keyword Search acrossHeterogeneous Relational Databases

Mayssam Sayyadian, AnHai Doan

University of Wisconsin - Madison

Hieu LeKhac

University of Illinois - Urbana

Luis Gravano

Columbia University


Key message of paper

Key Message of Paper

  • Precise data integration is expensive

  • But we can do IR-styledata integration very cheaply, with no manual cost!

    • just apply automatic schema/data matching

    • then do keyword search across the databases

    • no need to verify anything manually

  • Already very useful

Build upon keyword search over a single database ...


Keyword search over a single relational database

Keyword Search over a Single Relational Database

  • A growing field, numerous current works

    • DBXplorer [ICDE02], BANKS [ICDE02]

    • DISCOVER [VLDB02]

    • Efficient IR-style keyword search in databases [VLDB03],

    • VLDB-05, SIGMOD-06, etc.

  • Many related works over XML / other types of data

    • XKeyword [ICDE03], XRank [Sigmod03]

    • TeXQuery [WWW04]

    • ObjectRank [Sigmod06]

    • TopX [VLDB05], etc.

  • More are coming at SIGMOD-07 ...


A typical scenario

tid id emp-name comments

u1 c124 Michael Smith Repair didn’t work

u2 c124 John Deferred work to

John Smith

tid custid name contact addr

t1 c124 Cisco Michael Jones …

t2 c533 IBM David Long …

t3 c333 MSR David Ross …

u2 c124 John Deferred work to John Smith

t1 c124 CiscoMichael Jones …

A Typical Scenario

Customers

Complaints

Foreign-Key Join

Q = [Michael Smith Cisco]

Ranked list of answers

score=.8

u1 c124Michael Smith Repair didn’t work

t1 c124 Cisco Michael Jones …

score=.7


Our proposal keyword search across multiple databases

tid id emp-name comments

u1 c124 Michael Smith Repair didn’t work

u2 c124 John Deferred work to

John Smith

tid custid name contact addr

t1 c124 Cisco Michael Jones …

t2 c533 IBM David Long …

t3 c333 MSR Joan Brown …

t1 c124 Cisco Michael Jones …

u1 c124 Michael SmithRepair didn’t work

v1 e23 Mike D. Smith

x1 e23 e37

v3 e37 Jack Lucas

Our Proposal:Keyword Search across Multiple Databases

Employees

Complaints

tid empid name

v1 e23 Mike D. Smith

v2 e14 John Brown

v3 e37 Jack Lucas

Groups

Customers

tid eid reports-to

x1 e23 e37

x2 e14 e37

Query:[Cisco Jack Lucas]

across databases

 IR-style data integration


A naive solution

A Naive Solution

1. Manually identify FK joins across DBs

2. Manually identify matching data instances across DBs

3. Now treat the combination of DBs as a single DB

 apply current keyword search techniques

Just like in traditional data integration,

this is too much manual work


Kite solution

tid id emp-name comments

u1 c124 Michael Smith Repair didn’t work

u2 c124 John Deferred work to

John Smith

tid custid name contact addr

t1 c124 Cisco Michael Jones …

t2 c533 IBM David Long …

t3 c333 MSR Joan Brown …

Kite Solution

  • Automatically find FK joins / matching data instances across databases

     no manual work is required from user

Employees

Complaints

tid empid name

v1 e23 Mike D. Smith

v2 e14 John Brown

v3 e37 Jack Lucas

Groups

Customers

tid eid reports-to

x1 e23 e37

x2 e14 e37


Automatically find fk joins across databases

tid id emp-name comments

u1 c124 Michael Smith Repair didn’t work

u2 c124 John Deferred work to

John Smith

Automatically Find FK Joinsacross Databases

  • Current solutions analyze data values (e.g., Bellman)

  • Limited accuracy

    • e.g., “waterfront” with values yes/no “electricity” with values yes/no

  • Our solution: data analysis + schema matching

    • improve accuracy drastically (by as much as 50% F-1)

Employees

Complaints

tid empid name

v1 e23 Mike D. Smith

v2 e14 John Brown

v3 e37 Jack Lucas

Automatic join/data matching can be wrong

 incorporate confidence scores into answer scores


Incorporate confidence scores into answer scores

t1 c124 Cisco Michael Jones …

u1 c124 Michael Smith Repair didn’t work

v1 e23 Mike D. Smith

x1 e23 e37

v3 e37 Jack Lucas

Incorporate Confidence Scores into Answer Scores

  • Recall: answer example in single-DB settings

score=.8

u1 c124 Michael Smith Repair didn’t work

t1 c124Cisco Michael Jones …

  • Recall: answer example in multiple-DB settings

score 0.7 for data matching

score 0.9 for FK join

  • α.score_kw (A, Q) + β.score_join (A, Q) + γ.score_data (A, Q)

score (A, Q) =

  • size (A)


Summary of trade offs

Summary of Trade-Offs

SQL queries

  • Precise data integration

    • the holy grail

  • IR-style data integration, naive way

    • manually identify FK joins, matching data

    • still too expensive

  • IR-style data integration, using Kite

    • automatic FK join finding / data matching

    • cheap

    • only approximates the “ideal” ranked list found by naive


Kite architecture

Kite Architecture

Q = [ Smith Cisco ]

Index Builder

Condensed

CN Generator

IR index1

IR indexn

  • Partial

  • Full

  • Deep

Foreign key joins

Refinement

rules

Top-k

Searcher

Data instance

Foreign-Key Join Finder

matcher

Data-based Schema

Join Finder Matcher

Distributed SQL queries

D1

D1

Dn

Dn

Offline preprocessing

Online querying


Online querying

Database 1

Database 2

Relation 1

Relation 2

Relation 1

Relation 2

Online Querying

What current solutions do:

1. Create answer templates

2. Materialize answer templates to obtain answers


Create answer templates

Customers

v1

v2

v3

Complaints

u1

u2

J1

J4

J2

J1

J4

Customers{}

Complaints{}

Emps{}

Emps

Customers

Groups

Complaints

J2

J3

J1

J4

J3

Groups{}

J3

J4

J1

J2

CustomersQ

ComplaintsQ

EmpsQ

J1

J4

Create Answer Templates

Find tuples that contain query keywords

  • Use DB’s IR index

  • example:

    Q = [Smith Cisco]

    Tuple sets:

    Create tuple-set graph

    Schema graph:

    Tuple set graph:

Service-DB

HR-DB

Service-DB: ComplaintsQ={u1, u2}

CustomersQ={v1}

Groups

x1

x2

Employees

t1

t2

t3

HR-DB: EmployeesQ={t1}

GroupsQ={}


Create answer templates cont

sample CNs

J1

CN1:

CN2:

CN3:

CN4:

sample tuple set graph

J1

J4

Customers{}

Complaints{}

Emps{}

J2

J3

J1

J4

Groups{}

J3

J4

J2

J2

J4

J1

J2

CustomersQ

CustomersQ Complaints{Q}

EmpsQ Groups{} Emps{}  Complaints{Q}

CustomersQ

ComplaintsQ

EmpsQ

J2

J3

J4

J1

J4

EmpsQ Groups{} Emps{}  Complaints{Q}

Create Answer Templates (cont.)

  • Search tuple-set graph to generate answer templates

    • also called Candidate Networks (CNs)

  • Each answer template = one way to join tuples to form an answer


Materialize answer templates to generate answers

Materialize Answer Templatesto Generate Answers

  • By generating and executing a SQL query

CN:CustomersQ ComplaintsQ (CustomersQ = {v1} , ComplaintsQ = {u1, u2})

SQL: SELECT * FROM Customers C, Complaints P

WHERE C.cust-id = P.id AND

(C.tuple-id = v1) AND

(P.tuple-id = u1 OR tuple-id = u2)

J1

  • Naive solution

    • materialize all answer templates, score, rank, then return answers

  • Current solutions

    • find only top-k answers

    • materialize only certain answer templates

    • make decisions using refinement rules + statistics


Challenges for kite setting

Challenges for Kite Setting

  • More databases  way too many answer templates to generate

    • can take hours on just 3-4 databases

  • Materializing an answer template takes way too long

    • requires SQL query execution across multiple databases

    • invoking each database incurs large overhead

  • Difficult to obtain reliable statistics across databases

  • See paper for our solutions


Empirical evaluation

The DBLP Schema

Sample Inventory Schema

AR (aid, biblo)

CITE (id1, id2)

AUTHOR

ARTIST

BOOK

CD

PU (aid, uid)

AR (id, title)

WH2BOOK

WH2CD

AU (id, name)

CNF (id, name)

WAREHOUSE

DBLP 1

DBLP 2

Inventory 1

Empirical Evaluation

Domains


Runtime performance 1

Hybrid algorithm adapted to run

over multiple databases

Kite without condensed CNs

Kite without adaptive rule

selection and without rule Deep

Full-fledged Kite algorithm

Inventory

DBLP

time (sec)

time (sec)

Kite without rule Deep

max

CCN

size

max

CCN

size

2-keyword queries, k=10, 5 databases

2-keyword queries, k=10, 2 databases

runtime vs. # of databases

Inventory

time (sec)

# of DBs

maximum CCN size = 4, 2-keyword queries, k=10

Runtime Performance (1)

runtime vs. maximum CCN size


Runtime performance 2

runtime vs. # of keywords in the query

DBLP

Inventory

time (sec)

time (sec)

|q|

|q|

max CCN=6, k=10, 2 databases

max CCN=4, k=10, 5 databases

runtime vs. # of answers requested

Inventory

time (sec)

time (sec)

k

k

2-keyword queries, max CCN=4, 5 databases

2-keyword queries, max CCN=4, |q|=2, 5 databases

Runtime Performance (2)


Query result quality

Query Result Quality

  • [email protected] = the fraction of answers that appear in the “ideal” list

[email protected]

[email protected]

k

k

OR-semantic queries

AND-semantic queries


Summary

Summary

  • Kite executes IR-style data integration

    • performs some automatic preprocessing

    • then immediately allows keyword querying

  • Relatively painless

    • no manual work!

    • no need to create global schema, to understand SQL

  • Can be very useful in many settings: e.g., on-the-fly, best-effort, for non-technical people

    • enterprises, on the Web, need only a few answers

    • emergency (e.g., hospital + police), need answers quickly


Future directions

Future Directions

  • Incorporate user feedback  interactive IR-style data integration

  • More efficient query processing

    • large # of databases, network latency

  • Extends to other types of data

    • XML, ontologies, extracted data, Web data

IR-style data integration is feasible and useful

extends current works on keyword search over DB

raises many opportunities for future work


Backup

BACKUP


Other experiments

accuracy (F1)

Kite over single database

time (sec)

max CCN size

Other Experiments

  • Schema matching helps improve join discovery algorithm drastically

  • Kite also improves single-database keyword search algorithm mHybrid


  • Login