Keyword search over graph databases
This presentation is the property of its rightful owner.
Sponsored Links
1 / 55

Keyword Search Over Graph Databases PowerPoint PPT Presentation


  • 116 Views
  • Uploaded on
  • Presentation posted in: General

北京大学计算机科学技术研究所 Institute of Computer Science and Technology of Peking University. Keyword Search Over Graph Databases. Instructor: Lei Zou. DB/IR: Different Positions. Structured data vs unstructured data Schema vs non-schema Query processing vs ranking. In the Past. Search.

Download Presentation

Keyword Search Over Graph Databases

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Keyword search over graph databases

北京大学计算机科学技术研究所

Institute of Computer Science and Technology of Peking University

Keyword Search Over Graph Databases

Instructor: Lei Zou


Db ir different positions

DB/IR: Different Positions

Structured data vs unstructured data

Schema vs non-schema

Query processing vs ranking


Keyword search over graph databases

In the Past

Search

Unstructured

(keywords)

IR Systems

Structured

(SQL,Xquery)

DB Systems

Data

Structured

(records)

Unstructured

(documents)


Keyword search over graph databases

Now

Search

Unstructured

(keywords)

IR Systems

Keyword search over relational graphs

Digital Libraries

Enterprise BI

Web 2.0

DB Systems

Structured

(SQL,Xquery)

Querying entities and relations from information extraction

Text data

Schema latter

ranking

Data

Structured

(records)

Unstructured

(documents)


Keyword search over graph databases

In the future

Search

Integrated DB & IR System

(Models, Algorithms, Platform)

Data


Keyword search over graph databases

1. Approximate matching and record linkage

“Beijing University” & “PKU”

2. Too-many-answers ranking

Answers are ranked instead of “set”-style outputs

3. Schema relaxation and heterogeneity

No Schema approach & Schema Mapping & Data Integration

4. Information Extraction and Entity (and relationship) Search

Find all books written by Mark Twain

5. Uncertain Data Management

Some data are uncertain, such as extracted data by IE


Keyword search over graph databases

The fist Step

IR-style Search Over Database System

Hard to Learn SQL

No idea abut the Schema

keywords


Full text keyword search

3

1

2

2

1

3

2

3

3

3

3

3

3

1

4

4

2

4

4

4

4

4

4

3

2

1

2

Full-Text Keyword Search

Michelle

XML

XML

Michelle

Michelle XML

Database Tuples

  • Find a set of tuples that contain the keywords.

  • Full text index (Oracle, DB2, SQL Server …)

  • SQL supports “contain (A, w)” clause to retrieve the set of tuples that contain keyword w in the attribute A.


Keyword search over graph databases

SELECT student_id, student_name

FROM students

WHERE CONTAINS( address, 'beijing' )


Structural keyword search in rdb

2

1

2

2

2

2

3

3

3

3

4

4

4

4

2

2

Structural Keyword Search in RDB

Michelle

Michelle

Michelle XML

XML

XML

  • Tuples are connected through foreign key references in an RDB.

  • Find a set of interconnected tuples that contain the keywords


Rdb example

c3

c2

c5

c4

c1

p4

p3

p2

p1

3

3

3

3

a1

a3

a2

4

4

4

w4

w3

w1

w5

w2

w6

RDB Example

Paper

Author

Write

Cite


Tuple connections database graph

c4

c5

c2

c1

c3

p3

p1

p2

p4

3

3

3

3

a1

a2

a3

4

4

4

w1

w2

w3

w6

w4

w5

Tuple Connections (Database Graph)

Michelle

XML

XML

Michelle

Tuples are connected through foreign key references


Keyword search over graph databases

c3

c5

c4

c2

c1

p3

p2

p4

p1

3

3

3

3

a1

a3

a2

4

4

4

w3

w2

w1

w6

w5

w4

What is the relationship between Michelle and XML?

Michelle XML

Michelle

XML

XML

Michelle


Keyword search over graph databases

c1

c2

p1

p3

p3

p2

p1

p1

p2

3

3

3

3

3

3

3

a1

a3

4

4

w6

w1

w2

Michelle

Michelle

XML

Michelle

XML

Michelle

XML

XML

  • How are keywords connected?

  • Minimal total joining network of tuples

    • Total: all keywords will be contained

    • Minimal: it is not total if any tuple is removed

  • Tmax: maximum number of nodes allowed


Outline

Outline

  • Schema BasedDiscover Algorithm

  • Non-Schema Based

    Blink Algorithm


Outline1

Outline

  • Schema BasedDiscover Algorithm

  • Non-Schema Based

    Blink Algorithm


Discover keyword search in relational databases

DISCOVER: Keyword Search in Relational Databases

  • Vagelis Hristidis

    University of California, San Diego

  • Yannis Papakonstantinou

    University of California, San Diego


Motivation

Motivation

  • Keyword Search is the dominant information discovery method in documents

  • Increasing amount of data stored in databases


Motivation1

Motivation

  • Currently, information discovery in databases requires:

    • Knowledge of schema

    • Knowledge of a query language (eg: SQL)

    • Knowledge of the role of the keywords

  • DISCOVER eliminates these requirements


Keyword query semantics

Keyword Query - Semantics

  • Keywords are:

  • in same tuple

  • in same relation

  • connected through primary-foreign key relationships

  • Score of result:

  • distance of keywords within a tuple

  • distance between keywords in terms of primary-foreign key connections

  • weighted distance


Result of keyword query

Result of Keyword Query

  • Result is tree T of tuples where:

  • each edge corresponds to a primary-foreign key relationship

  • every keyword contained in a tuple of T (total)

  • no tuple of T is redundant (minimal)


Example schema

Example - Schema

Subset of TPC-H schema

n:1

n:1

ORDERS

CUSTOMER

NATION


Example data

Example - Data


Example keyword query

Example – Keyword Query

Query: “Smith, Miller”


Example keyword query1

Example – Keyword Query

Query: “Smith, Miller”

Results:


Example keyword query2

Example – Keyword Query

Query: “Smith, Miller”

Results:

Smaller sizes usually denote tighter association between keywords


Architecture

Architecture

User


Architecture1

Architecture


Candidate networks generator challenges

Candidate Networks Generator - Challenges

  • A keyword may appear in multiple tuples

  • # candidate networks can be too big (sometimes unbounded)


Candidate network example

Candidate Network - Example

ORDERS

Smith

n:1

n:1

n:1

CUSTOMER

NATION

ORDERS

n:1

ORDERS

Miller


Candidate network example1

Candidate Network - Example

ORDERS

Smith

n:1

n:1

n:1

CUSTOMER

NATION

ORDERS

n:1

ORDERS

Miller

CN1: OSmith C  OMillersize=2


Candidate network example2

Candidate Network - Example

ORDERS

Smith

n:1

n:1

n:1

CUSTOMER

NATION

ORDERS

n:1

ORDERS

Miller

CN1: OSmith C  OMillersize=2

CN2: OSmith C  N  C  OMillersize=4


Candidate network example3

Candidate Network - Example

ORDERS

Smith

n:1

n:1

n:1

CUSTOMER

NATION

ORDERS

n:1

ORDERS

Miller

CN3: OSmith C  OMiller Csize=3


Candidate network example4

Candidate Network - Example

ORDERS

Smith

n:1

-------------------------------------------------

c1 – o – c2

c1 c2 , because primary to foreign key from CUSTOMER to ORDERS

Pruning Condition: RKSRL

n:1

n:1

CUSTOMER

NATION

ORDERS

n:1

ORDERS

Miller

CN4: OSmithC  O  C OMillersize=4


Candidate networks generator algorithm

Candidate Networks Generator - Algorithm

  • Traverse tuple set graph breadth first

  • Q  tuple sets containing keyword k1

  • For each network n of tuple sets in Q do

    • If pruning_condition(n) drop n

    • else if is_CN(n) output n

    • else expand n by one tuple set to all possible directions in tuple set graph and insert expansions to Q

      [eg: if n is OSmith C then we add to Q

      OSmith C  OMiller, OSmith C  O, OSmith C  N ]


Candidate networks generator is complete and non redundant

Candidate Networks Generator is Complete and Non-Redundant

  • Prove that the set of Candidate Networks generated is

    • Complete: All solutions generated by a CN

    • Non-redundant: There is database instance, where by removing a CN a solution is lost


Size of candidate networks may be unbounded

Size of Candidate Networks may be Unbounded

  • Size is unbounded iff schema graph G has one of the following properties:

    • There is a node of G that has at least two incoming edges.

      [eg: PARTSUPPLINEITEMORDERS]

    • G has a directed cycle.

      [eg: ancestor schemas]


Architecture2

Architecture


Execution plan challenges

Execution Plan - Challenges

  • Generated SQL queries are expensive due to joins

  • Reusability opportunities


Execution plan

Execution Plan

  • Each CN corresponds to a SQL statement

  • CN1: OSmith C  OMiller

    CN2: OSmith C  N  C  OMiller

  • Execution Plan

    CN1  OSmith C  OMiller

    CN2  OSmith C  N  C  OMiller


Reuse common subexpressions example

Reuse Common Subexpressions - Example

  • Execution Plan

    CN1  OSmith C OMiller

    CN2  OSmith C N  C  OMiller

  • Optimized Execution Plan

    Temp OSmith C

    CN1  Temp OMiller

    CN2  Temp N  C  OMiller


Optimal reuse of common subexpressions is np complete

Optimal Reuse of Common Subexpressions is NP-Complete

  • Simple Cost Model: each join has cost 1

  • Prove that finding Optimal Common Subexpressions is NP-Complete.

    Proof: Reduce string compression problem


Cost model and greedy optimization algorithm

Cost Model and Greedy Optimization Algorithm

  • Actual Cost Model: cost of a join is size of result

  • Greedy algorithm:

    In each iteration build intermediate result of size 1 (1 join) that maximizes


Tuning of greedy algorithm

Tuning of Greedy Algorithm

  • a: frequency factor

    • favors reusability

  • b: size factor

    • favors small intermediate results

  • a=1

  • 0b0.3


Outline2

Outline

  • Schema BasedDiscover Algorithm

  • Non-Schema Based

    Blink Algorithm


Growing uses of graph structured data

RDF example

Biopathway example

Relational DB

Growing Uses of Graph-Structured Data

  • Data produced directly based on graphs

    • W3C standards: XML, RDF and OWL

    • Bioinformatics: BioCyc, BioMaze, etc.

  • Data that can be made graphs by restoring implicit connections among data items

    • Relational data  Graph

    • Unstructured and heterogeneous data  Graph

      • Personal information management (PIM)

      • Information extraction from Web


Keyword search on graphs

Keyword Search on Graphs

  • New and popular query paradigm

    • Simple, user-friendly query interface

    • Queryability on graphs without obvious schema

  • Problem definition

    • Data: a weighted directed graph G=(V,E), where each node vV may be labeled with text

    • Query: a keyword query q = (w1, …, wm)

    • Answer: an answer to q is a pair r,(n1,…,nm), where r and ni’s are nodes satisfying:

      • Coverage: ni contains keyword wi for every i

      • Connectivity: a directed path exists from r to ni for every i

    • Top-k Query returns k distinct root nodes with the highest best scores associated to each root

q = (c,d)

T1 = <3, (3,6)>T2 = <2, (12,4)>…


Roadmap

Roadmap

  • Define scoring function

    • Interplay between semantics and ease of evaluation

  • Propose better graph search strategies

    • Worst-case performance guarantee

  • Combine graph indexing with search

    • Simple, but impractical, single-level index

    • Practical bi-level index in BLINKS

      • Partitioning-based indexing


Scoring function

answer root

matches

r

n2

n1

n3

Scoring Function

  • Score definition

    • For an answer T= r,(n1,…,nm) to a query q = (w1, …, wm), the score is defined asS(T) = f( Sr(r) +  Sn(ni, wi) +  Sp(r, ni))

    • Considers both content and graph structure

  • Match-distributive property

    • Contribution of matches and root-match paths can be computed in a distributive manner by summing over all matches

    • Allow pre-computation of best path, independently for each node/keyword

  • Graph-distance property

    • The contribution of a root-match path, Sp(r,ni), is defined to be shortest-path distance from r to ni

  • To simplify presentation, we focus on the path contribution Sp(r,ni)

  • paths from root to matches


    Graph search strategies

    Intersection

    Intersection

    Graph Search Strategies

    • Backward search [Bhalotia et al., ICDE’02]

      • Starting from keyword nodes (containing at least one query keyword)

      • In each search step, choose an incoming edge to a previously visited node and follow the edge backward to visit its source node

      • Discover an answer root r if r is visited from every keyword

    • Bidirectional search [Kacholia et al., VLDB’05]

      • Explore the graph by following forward edges as well

      • Choose which node to visit by heuristic activation factors

    w1

    Conceptually, expand clusters of visited nodes for each keyword

    w2

    Graph


    Graph search strategies cont d

    Graph Search Strategies (cont’d)

    • Each search step needs to decide

      • Which node to expand within a cluster

      • Which keyword cluster to expand

    • Our approach

      • Equi-distance expansion in each keyword

      • Cost-balanced expansion across keywords: balance # of nodes expanded across clusters

        • Cost is at most m times that of an “oracle” backward search algorithm (m = # of query keywords)

     Equi-distance expansion: node closest to cluster origin in graph distance

    Optimal

     Distance-balanced expansion: balance diameter across all clusters

    No Guarantee

    Assume 3 keywords: w1 w2 w3

    Optimal

    m-optimal


    Using a single level index

    Using a Single-Level Index

    • What is inefficient with search without index?

      • Needs to maintain, for each keyword, a priority queue storing nodes in current expansion “frontier”  High space/time complexity

      • Existing forward expansion is largely guesswork

    • Our ideas

      • (I) For each keyword, index nodes in the order of visiting them in search: Keyword-node lists

        • For each keyword w, a list LKN(w) contains nodes that can reach w, ordered by their shortest distances to w

      • (II) Index shortest distances from nodes to keywords, enabling forward jumps: Node-keyword map

        • Given node u and keyword w, a hash map MNK (u,w) returns the shortest distance from u to w in O(1) time

    v1

    v2

    v3

    v4

    v5

    v6

    v7

    LKN(w1)

    0, v7, v7, v7

    1, v6, v7, v7

    0, v2, v2, v2

    MNK(v2,w1)

    0, v2, v2

    MNK(v2,w2)

    1, v4, v4


    Search with single level index

    2

    4

    (v2, w2)

    (v4, w1)

    (1, v4, v4)

    (, -, -)

    Search with Single-Level Index

    • Search algorithm using the single-level index, applying our search strategies

      • Equi-distance expansion Use one cursor to traverse each LKN(wi)

      • Cost-balanced expansion Pick the cursor to expand next in a round-robin manner

      • Forward expansion When visiting a node, look up its distances to other keywords by MNK

    • Efficiency

      • Managing exploration states by m cursors instead of m priority queues

      • Finding next node to visit is much faster from a cursor than from a queue

      • Forward expansion allows the search to converge on answers faster

    Keyword-node lists

    v1

    LKN(w1)

    0, v7, v7, v7

    1, v6, v7, v7

    0, v2, v2, v2

    v2

    1

    v3

    LKN(w2)

    0,v11,v11,v11

    1, v2, v4, v4

    0, v4, v4, v4

    v4

    v5

    v8

    3

    v6

    Node-keyword map

    v9

    v7

    MNK

    v10

    Partial Answers

    v12

    <v4,(, 0)>

    v11

    <v2,(0, )>

    Answers

    <v2,(0, 1)>

    <v6,(1, 2)>


    Bi level indexing in blinks

    1

    2

    6

    4

    5

    3

    5

    10

    11

    7

    8

    9

    11

    12

    13

    14

    13

    15

    17

    16

    Bi-Level Indexing in BLINKS

    • Unfortunately, single-level index is impractical for large graphs

      • Space complexity: O(|V|K) where K is the number of keywords

    • BLINKS: Bi-Level Index for Keyword Search

      • Partition a data graph into multiple, say B, subgraphs, or blocks

        • Partitioning by nodes, called portals, which will play key roles in search

        • There are many partitioning algorithms, such as Breadth-first and METIS

      • (Top-level) block index: map keywords and portals to blocks

        • Purpose: Initiate backward expansion in relevant blocks; guide backward expansion across blocks (through portals)

        • Space complexity: O( |V| + BP ) where P is the total number of portals

      • (Low-level)intra-block index: store similar information as in a single-level index, but restricted to within each block

        • Purpose: Help backward expansion and forward jumps within blocks

        • Space complexity: O( |V|K / B )


    Search with the bi level index

    Search with the Bi-Level Index

    • Similar to searching with single-level index in

      • Overall expansion policies (which keyword cluster/node to explore next)

      • Index access (scanning LKN lists and looking-up MNK hash map)

    • New challenges/complications introduced by graph partitioning

      • A single cursor for a keyword is no longer sufficient

        • Need simultaneously backward expansion in multiple blocks that contains the keyword

        • So we maintain a queue of cursors, one for each block we are currently exploring

      • Backward expansion needs to continue across block boundaries

        • When encountering boundaries, we retrieve new blocks to visit from the block index and add them to the queue

      • Distance information in the intra-block index  global shortest distance

        • The path with the shortest distance may happen to go across blocks

        • Our exploration order guarantees correct global shortest distance


  • Login