- By
**jun** - Follow User

- 170 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Keyword Search Over Graph Databases' - jun

Download Now**An Image/Link below is provided (as is) to download presentation**

Download Now

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Keyword Search Over Graph Databases

Institute of Computer Science and Technology of Peking University

Instructor: Lei Zou

DB/IR: Different Positions

Structured data vs unstructured data

Schema vs non-schema

Query processing vs ranking

Search

Unstructured

(keywords)

IR Systems

Structured

(SQL,Xquery)

DB Systems

Data

Structured

(records)

Unstructured

(documents)

Search

Unstructured

(keywords)

IR Systems

Keyword search over relational graphs

Digital Libraries

Enterprise BI

Web 2.0

DB Systems

Structured

(SQL,Xquery)

Querying entities and relations from information extraction

Text data

Schema latter

ranking

Data

Structured

(records)

Unstructured

(documents)

1. Approximate matching and record linkage

“Beijing University” & “PKU”

2. Too-many-answers ranking

Answers are ranked instead of “set”-style outputs

3. Schema relaxation and heterogeneity

No Schema approach & Schema Mapping & Data Integration

4. Information Extraction and Entity (and relationship) Search

Find all books written by Mark Twain

5. Uncertain Data Management

Some data are uncertain, such as extracted data by IE

1

2

2

1

3

2

3

3

3

3

3

3

1

4

4

2

4

4

4

4

4

4

3

2

1

2

Full-Text Keyword SearchMichelle

XML

XML

Michelle

Michelle XML

Database Tuples

- Find a set of tuples that contain the keywords.
- Full text index (Oracle, DB2, SQL Server …)
- SQL supports “contain (A, w)” clause to retrieve the set of tuples that contain keyword w in the attribute A.

1

2

2

2

2

3

3

3

3

4

4

4

4

2

2

Structural Keyword Search in RDBMichelle

Michelle

Michelle XML

XML

XML

- Tuples are connected through foreign key references in an RDB.
- Find a set of interconnected tuples that contain the keywords

c5

c2

c1

c3

p3

p1

p2

p4

3

3

3

3

a1

a2

a3

4

4

4

w1

w2

w3

w6

w4

w5

Tuple Connections (Database Graph)Michelle

XML

XML

Michelle

Tuples are connected through foreign key references

c5

c4

c2

c1

p3

p2

p4

p1

3

3

3

3

a1

a3

a2

4

4

4

w3

w2

w1

w6

w5

w4

What is the relationship between Michelle and XML?

Michelle XML

Michelle

XML

XML

Michelle

c2

p1

p3

p3

p2

p1

p1

p2

3

3

3

3

3

3

3

a1

a3

4

4

w6

w1

w2

Michelle

Michelle

XML

Michelle

XML

…

Michelle

XML

XML

- How are keywords connected?
- Minimal total joining network of tuples
- Total: all keywords will be contained
- Minimal: it is not total if any tuple is removed
- Tmax: maximum number of nodes allowed

DISCOVER: Keyword Search in Relational Databases

- Vagelis Hristidis

University of California, San Diego

- Yannis Papakonstantinou

University of California, San Diego

Motivation

- Keyword Search is the dominant information discovery method in documents
- Increasing amount of data stored in databases

Motivation

- Currently, information discovery in databases requires:
- Knowledge of schema
- Knowledge of a query language (eg: SQL)
- Knowledge of the role of the keywords

- DISCOVER eliminates these requirements

Keyword Query - Semantics

- Keywords are:
- in same tuple
- in same relation
- connected through primary-foreign key relationships
- Score of result:
- distance of keywords within a tuple
- distance between keywords in terms of primary-foreign key connections
- weighted distance

Result of Keyword Query

- Result is tree T of tuples where:
- each edge corresponds to a primary-foreign key relationship
- every keyword contained in a tuple of T (total)
- no tuple of T is redundant (minimal)

Example – Keyword Query

Query: “Smith, Miller”

Example – Keyword Query

Query: “Smith, Miller”

Results:

Smaller sizes usually denote tighter association between keywords

Architecture

User

Candidate Networks Generator - Challenges

- A keyword may appear in multiple tuples
- # candidate networks can be too big (sometimes unbounded)

Candidate Network - Example

ORDERS

Smith

n:1

n:1

n:1

CUSTOMER

NATION

ORDERS

n:1

ORDERS

Miller

CN1: OSmith C OMiller size=2

Candidate Network - Example

ORDERS

Smith

n:1

n:1

n:1

CUSTOMER

NATION

ORDERS

n:1

ORDERS

Miller

CN1: OSmith C OMiller size=2

CN2: OSmith C N C OMiller size=4

Candidate Network - Example

ORDERS

Smith

n:1

n:1

n:1

CUSTOMER

NATION

ORDERS

n:1

ORDERS

Miller

CN3: OSmith C OMiller Csize=3

Candidate Network - Example

ORDERS

Smith

n:1

-------------------------------------------------

c1 – o – c2

c1 c2 , because primary to foreign key from CUSTOMER to ORDERS

Pruning Condition: RKSRL

n:1

n:1

CUSTOMER

NATION

ORDERS

n:1

ORDERS

Miller

CN4: OSmithC O C OMiller size=4

Candidate Networks Generator - Algorithm

- Traverse tuple set graph breadth first
- Q tuple sets containing keyword k1
- For each network n of tuple sets in Q do
- If pruning_condition(n) drop n
- else if is_CN(n) output n
- else expand n by one tuple set to all possible directions in tuple set graph and insert expansions to Q

[eg: if n is OSmith C then we add to Q

OSmith C OMiller, OSmith C O, OSmith C N ]

Candidate Networks Generator is Complete and Non-Redundant

- Prove that the set of Candidate Networks generated is
- Complete: All solutions generated by a CN
- Non-redundant: There is database instance, where by removing a CN a solution is lost

Size of Candidate Networks may be Unbounded

- Size is unbounded iff schema graph G has one of the following properties:
- There is a node of G that has at least two incoming edges.

[eg: PARTSUPPLINEITEMORDERS]

- G has a directed cycle.

[eg: ancestor schemas]

Execution Plan - Challenges

- Generated SQL queries are expensive due to joins
- Reusability opportunities

Execution Plan

- Each CN corresponds to a SQL statement
- CN1: OSmith C OMiller

CN2: OSmith C N C OMiller

- Execution Plan

CN1 OSmith C OMiller

CN2 OSmith C N C OMiller

Reuse Common Subexpressions - Example

- Execution Plan

CN1 OSmith C OMiller

CN2 OSmith C N C OMiller

- Optimized Execution Plan

Temp OSmith C

CN1 Temp OMiller

CN2 Temp N C OMiller

Optimal Reuse of Common Subexpressions is NP-Complete

- Simple Cost Model: each join has cost 1
- Prove that finding Optimal Common Subexpressions is NP-Complete.

Proof: Reduce string compression problem

Cost Model and Greedy Optimization Algorithm

- Actual Cost Model: cost of a join is size of result
- Greedy algorithm:

In each iteration build intermediate result of size 1 (1 join) that maximizes

Tuning of Greedy Algorithm

- a: frequency factor
- favors reusability
- b: size factor
- favors small intermediate results
- a=1
- 0b0.3

Biopathway example

Relational DB

Growing Uses of Graph-Structured Data- Data produced directly based on graphs
- W3C standards: XML, RDF and OWL
- Bioinformatics: BioCyc, BioMaze, etc.
- Data that can be made graphs by restoring implicit connections among data items
- Relational data Graph
- Unstructured and heterogeneous data Graph
- Personal information management (PIM)
- Information extraction from Web

Keyword Search on Graphs

- New and popular query paradigm
- Simple, user-friendly query interface
- Queryability on graphs without obvious schema
- Problem definition
- Data: a weighted directed graph G=(V,E), where each node vV may be labeled with text
- Query: a keyword query q = (w1, …, wm)
- Answer: an answer to q is a pair r,(n1,…,nm), where r and ni’s are nodes satisfying:
- Coverage: ni contains keyword wi for every i
- Connectivity: a directed path exists from r to ni for every i
- Top-k Query returns k distinct root nodes with the highest best scores associated to each root

q = (c,d)

T1 = <3, (3,6)>T2 = <2, (12,4)>…

Roadmap

- Define scoring function
- Interplay between semantics and ease of evaluation
- Propose better graph search strategies
- Worst-case performance guarantee
- Combine graph indexing with search
- Simple, but impractical, single-level index
- Practical bi-level index in BLINKS
- Partitioning-based indexing

matches

r

n2

n1

n3

Scoring Function- Score definition
- For an answer T= r,(n1,…,nm) to a query q = (w1, …, wm), the score is defined asS(T) = f( Sr(r) + Sn(ni, wi) + Sp(r, ni))
- Considers both content and graph structure
- Match-distributive property
- Contribution of matches and root-match paths can be computed in a distributive manner by summing over all matches
- Allow pre-computation of best path, independently for each node/keyword
- Graph-distance property
- The contribution of a root-match path, Sp(r,ni), is defined to be shortest-path distance from r to ni
- To simplify presentation, we focus on the path contribution Sp(r,ni)

paths from root to matches

Intersection

Graph Search Strategies- Backward search [Bhalotia et al., ICDE’02]
- Starting from keyword nodes (containing at least one query keyword)
- In each search step, choose an incoming edge to a previously visited node and follow the edge backward to visit its source node
- Discover an answer root r if r is visited from every keyword
- Bidirectional search [Kacholia et al., VLDB’05]
- Explore the graph by following forward edges as well
- Choose which node to visit by heuristic activation factors

w1

Conceptually, expand clusters of visited nodes for each keyword

w2

Graph

Graph Search Strategies (cont’d)

- Each search step needs to decide
- Which node to expand within a cluster
- Which keyword cluster to expand
- Our approach
- Equi-distance expansion in each keyword
- Cost-balanced expansion across keywords: balance # of nodes expanded across clusters
- Cost is at most m times that of an “oracle” backward search algorithm (m = # of query keywords)

Equi-distance expansion: node closest to cluster origin in graph distance

Optimal

Distance-balanced expansion: balance diameter across all clusters

No Guarantee

Assume 3 keywords: w1 w2 w3

Optimal

m-optimal

Using a Single-Level Index

- What is inefficient with search without index?
- Needs to maintain, for each keyword, a priority queue storing nodes in current expansion “frontier” High space/time complexity
- Existing forward expansion is largely guesswork
- Our ideas
- (I) For each keyword, index nodes in the order of visiting them in search: Keyword-node lists
- For each keyword w, a list LKN(w) contains nodes that can reach w, ordered by their shortest distances to w
- (II) Index shortest distances from nodes to keywords, enabling forward jumps: Node-keyword map
- Given node u and keyword w, a hash map MNK (u,w) returns the shortest distance from u to w in O(1) time

v1

v2

v3

v4

v5

v6

v7

LKN(w1)

…

0, v7, v7, v7

1, v6, v7, v7

0, v2, v2, v2

MNK(v2,w1)

0, v2, v2

MNK(v2,w2)

1, v4, v4

…

4

(v2, w2)

(v4, w1)

(1, v4, v4)

(, -, -)

Search with Single-Level Index- Search algorithm using the single-level index, applying our search strategies
- Equi-distance expansion Use one cursor to traverse each LKN(wi)
- Cost-balanced expansion Pick the cursor to expand next in a round-robin manner
- Forward expansion When visiting a node, look up its distances to other keywords by MNK
- Efficiency
- Managing exploration states by m cursors instead of m priority queues
- Finding next node to visit is much faster from a cursor than from a queue
- Forward expansion allows the search to converge on answers faster

Keyword-node lists

v1

LKN(w1)

…

0, v7, v7, v7

1, v6, v7, v7

0, v2, v2, v2

v2

1

v3

LKN(w2)

…

0,v11,v11,v11

1, v2, v4, v4

0, v4, v4, v4

v4

v5

v8

3

v6

Node-keyword map

v9

v7

MNK

v10

Partial Answers

v12

<v4,(, 0)>

v11

<v2,(0, )>

Answers

<v2,(0, 1)>

<v6,(1, 2)>

2

6

4

5

3

5

10

11

7

8

9

11

12

13

14

13

15

17

16

Bi-Level Indexing in BLINKS- Unfortunately, single-level index is impractical for large graphs
- Space complexity: O(|V|K) where K is the number of keywords
- BLINKS: Bi-Level Index for Keyword Search
- Partition a data graph into multiple, say B, subgraphs, or blocks
- Partitioning by nodes, called portals, which will play key roles in search
- There are many partitioning algorithms, such as Breadth-first and METIS
- (Top-level) block index: map keywords and portals to blocks
- Purpose: Initiate backward expansion in relevant blocks; guide backward expansion across blocks (through portals)
- Space complexity: O( |V| + BP ) where P is the total number of portals
- (Low-level)intra-block index: store similar information as in a single-level index, but restricted to within each block
- Purpose: Help backward expansion and forward jumps within blocks
- Space complexity: O( |V|K / B )

Search with the Bi-Level Index

- Similar to searching with single-level index in
- Overall expansion policies (which keyword cluster/node to explore next)
- Index access (scanning LKN lists and looking-up MNK hash map)
- New challenges/complications introduced by graph partitioning
- A single cursor for a keyword is no longer sufficient
- Need simultaneously backward expansion in multiple blocks that contains the keyword
- So we maintain a queue of cursors, one for each block we are currently exploring
- Backward expansion needs to continue across block boundaries
- When encountering boundaries, we retrieve new blocks to visit from the block index and add them to the queue
- Distance information in the intra-block index global shortest distance
- The path with the shortest distance may happen to go across blocks
- Our exploration order guarantees correct global shortest distance

Download Presentation

Connecting to Server..