Keyword searching in relational databases esha palta 05329017 kumar gaurav bijay 02005013
Download
1 / 62

Keyword Searching in Relational Databases Esha Palta 05329017 Kumar Gaurav Bijay 02005013 - PowerPoint PPT Presentation


  • 89 Views
  • Uploaded on

Keyword Searching in Relational Databases Esha Palta (05329017) Kumar Gaurav Bijay (02005013) . Dilbert Strip . Motivation. Keyword search We have SQL, why keyword-querying? SQL - not appropriate for naive users

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Keyword Searching in Relational Databases Esha Palta 05329017 Kumar Gaurav Bijay 02005013' - kana


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Keyword searching in relational databases esha palta 05329017 kumar gaurav bijay 02005013

Keyword Searching in Relational Databases

Esha Palta (05329017)

Kumar Gaurav Bijay (02005013)



Motivation
Motivation

  • Keyword search

  • We have SQL, why keyword-querying?

    • SQL - not appropriate for naive users

    • So many online databases (imdb, citeseer, bseindia …) – user cannot keep track of schema for all of these


Simple approaches
Simple Approaches

  • Using Form interfaces

    • Require separate form for each type of query – confusing

    • Not suitable for ad-hoc queries – how many forms will you provide?

  • How about Google?

    • Export data from db to documents and do keyword- querying on these

    • Suffers from duplication overheads

    • Google wants all keywords in one document. DB is often normalized, so need to join tables and store as documents

    • Multiple combinations of tables to join. Not scalable …


Differences from web search
Differences from Web Search

  • Related data split across multiple tuples due to normalization

  • Different keywords may match tuples from different relations

    • What joins are to be computed can only be decided on the fly

    • Need to find result containing all keywords and rank them somehow

Writes

(AuthorId,

PaperId)

Paper

(PaperId,

PaperName)

Author

(AuthorId,

AuthorName)

Cites

(Citing,

Cited)

The DBLP Bibliography Schema


Systems for db search
Systems for DB search

  • BANKS (Browsing and Keyword Search)

    – IITB (ICDE ’02)

  • DBXplorer – Microsoft Research (ICDE ’02)

  • ObjectRank – IBM, UCSD, FIU (VLDB ’04)

  • Bidirectional BANKS – IITB (VLDB ’05)


Systems for db search1
Systems for DB search

  • BANKS (Browsing and Keyword Search)

    – IITB (ICDE ’02)

  • DBXplorer – Microsoft Research (ICDE ’02)

  • ObjectRank – IBM, UCSD, FIU (VLDB ’04)

  • Bidirectional BANKS – IITB (VLDB ’05)

    will cover in depth



The banks system
The BANKS system

  • BANKS Architecture

  • Available on the web

    • http://www.cse.iitb.ac.in/banks

  • Connects to database using JDBC

    • JDBC metadata features used to provide schema browsing

  • Preprocesses db

User

BANKS

JDBC

HTTP

Web-server

Database


Basic model

BANKS01:Keyword Search

MO:MultiQuery Optimizn

paper

Charuta:BANKS01

writes

Charuta

S. Sudarshan

Prasan Roy

author

Basic Model

  • Database: modeled as a graph

    • Nodes = tuples

    • Edges = references between tuples

      • foreign key (assume for this talk), inclusion dependencies, ..

      • Edges are directed.

PaperId:PaperName

AuthorID:PaperId

AuthorId

DBLP example


The banks answer model
The BANKS Answer Model

  • Query: set of search terms {t1, t2, .., tn}

    • For each search term ti we find set of nodes Si matching ti

    • Eg: Query = Sudarshan Roy (t1= Sudarshan, t2 = Roy)

  • Answer: rooted, directed tree connecting nodes matching keywords

    • Root node has special significance, may be restricted to some relations

      • E.g. relations representing entities, not relationships

    • May include intermediate nodes not in any Si (Steiner Tree)

  • Multiple answers

    • Ranking based on proximity + prestige


Answer example

MultiQuery Optimization

Prasan Roy

S. Sudarshan

Answer Example

Query: sudarshanroy

Paper

  • We would like to find sets of (closely) connected tuples that match all given keywords

Writes

Writes

Author

Author


Edge directionality
Edge Directionality

  • Directed tree will miss desired answers. For eg:

    Query = DBXplorerObjectRank

  • So, for each forward edge, BANKS adds a back edge

CitedBy

Cited

Cites

BANKS

DBXPlorer

ObjectRank

Cites

Cited

CitedBy

BANKS

Cites

DBXPlorer

ObjectRank

Cites


Edge directionality1
Edge Directionality

  • What if we ignore directionality?

    • Some popular tuples are connected to many other tuples

    • E.g. Students -> departments -> university

  • Problem: A popular tuple would create misleading shortcuts between tuples

    • E.g. every student would be closely linked with every other student via the department/university

  • Solution: define different forward and backward edge weights

    • Forward edges: In the direction of the foreign key reference


Edge weight

3

1

3

1

3

1

Edge Weight

  • Weight of forward edge based on schema

    • e.g. citation link weights > “writes” link weights

  • Weight of backward edge = indegree of edges pointing to the node


Edge weight scaling
Edge Weight Scaling

  • Normalize edge score Escore(e)

    • Make edge weight scale-free by dividing edge weigth by wmin

  • Problem: Some backward edges have unduly large weights

    • Depress the scale by defining Escore(e) as log(1+w(e)/wmin )

  • Overall Escore E = 1 / (1 + e Escore(e))


Node weight
Node Weight

  • Set weight of a Node = Indegree of the node

    • As per prestige rankings nodes with multiple pointers to them get a higher prestige

    • So, higher node weight corresponds to higher prestige

  • Problem: Nodes with many in-edges result in skewed answers

    • Subdue extreme node weights by using log(1+indegree)

  • Node score Nscore = Average of node scores (root-node-weight +  leaf-node-weights)


Combining scores
Combining Scores

  • Combining two independent metrics: node weight and edge weight

    • Normalize each to 0-1

    • Combine using weighting factor 

      • Additive: (1- ) Escore +  Nscore

      • Multiplicative: Escore * Nscore

  • Performance study to compare alternatives and to find reasonable values for 


First step symbol table
First Step – Symbol Table

  • The first step is to build a symbol table

  • This table is in the db and is not normalized

  • Example:


Searching for best answers
Searching for Best Answers

  • Backward Expanding Search Algorithm:

    • Assume: graph fits in memory

    • Idea: find vertices from which a forward path exists to at least one node from each Si.

    • Run concurrent single source shortest path algorithm from each node matching a keyword

      • Create an iterator for each node matching a keyword

        • Traverse the graph edges in reverse direction

      • Output a node whenever it is on the intersection of the sets of nodes reached from each keyword

  • Answer trees may not be generated in relevance order


Backward expanding search

Query: sudarshan roy

paper

MultiQuery Optimization

writes

S. Sudarshan

Prasan Roy

authors

Backward Expanding Search

Iterators


Banks query result example
BANKS Query Result Example

  • Result of “Sudarshan Roy”


Result ordering
Result Ordering

  • Answers need not be always in Relevance order

This tree is output

Better Root Missed

2

2

2

5

2

1


Result ordering contd
Result Ordering (contd…)

  • Solution:

    • Generate all connection trees and then sort them

      • Increases computation costs and leads to a greatly increased time to generate initial results

    • Create a small heap ordered on the relevance of the trees

      • Output highest ranked tree from heap to user when heap is full

  • What about duplicate results?

    • Maintain a list of generated results for duplicate detection

    • Discard result according to relevance


Experience and performance
Experience and Performance

  • BANKS provides keyword search coupled with extensive browsing facilities

    • Schema browsing + data browsing

    • Graphical display of data

  • Implemented using Java + servlets

  • Keyword search response times typically 1 to 3 seconds on

    • DBLP database with 100,000 tuples/300,000 edges

    • P3 600 MHz, 512 MB RAM


Anecdotes
Anecdotes

  • “Mohan”

    • Returns C. Mohan at top based on prestige (number of papers written)

  • “Transaction”

    • Returns Jim Gray’s classic paper and textbook as top answers based on prestige (number of citations)

  • “Sunita Seltzer”

    • No common papers, but both have papers with Stonebraker: system finds this connection


Effect of parameters

EdgeLog

Effect of Parameters

  • Log scaling of edge weights worked well

  • (1- ) E +  N versus E Nmade little difference

  • Best with  = .2 (subdue node weights but not entirely)



Motivation1
Motivation

  • BANKS performs poorly if

    • Keyword matches lot of nodes (so lot of Dijkstra sources)

    • Search hits a node with large fan – in.

Wastes time

Sudarshan

Roy


New ideas forward search
New Ideas – Forward Search

  • Why only backward, lets search forward too :

How about fwd

Searching ?

Sudarshan

Roy


New ideas activation
New Ideas - Activation

  • Activation :- Cannot forward search from each node.

    • Spread activation from keyword nodes to others.

    • Activation is like Page Rank with decay.

      High Activation  close to many keywords.


Activation spreading
Activation Spreading

  • Spreading Activation

    • Node with highest activation explored first

  • Activation spread to neighbors (μ = 0.3)

  • Gives low activation to neighbors of hubs


Modifications to model
Modifications to Model

  • Graph model stays the same.

  • BANKS is concerned with search more than how to tune parameters or define node – weights / edge – weights.

  • BANKS code :

    Tree Node – Score, N =

    Tree Edge – Score, E =

    Total Score = ENl (l = 0.2)


The new algorithm
The New Algorithm

  • Need two priority queues :

    Qin - do backward search from these nodes

    Qout - do forward search from these nodes

  • Each node, n keeps 3 variables per keyword, ti

    • sp [i] : Node to got to from n for shortest-path to ti

    • distance [i] : Length of the shortest-path from n to ti

    • Activation [i] : Activation to n from keyword ‘ti’


The new algorithm continued
The New Algorithm – continued…

  • Set initial activation keyword nodes and add to Qin for backward-search.

  • At each step, pick node with maximum activation

    i.e. if (Qin.getMaxActivation > Qout. getMaxActivation))

    // use node from Qin

    else

    // use node from Qout

  • If node from Qin,do backward search and add itself toQout. (newly explored nodes into Qin)

  • If node from Qout, do forward search

  • If node has reached from all keyword, generate result-tree. [answer is buffered as results can be out of order]


Explanation with example
Explanation with example

Qin

Qout

N100

N4

Roy

Sudarshan

N1

N3

N2

Roy

Sudarshan


Explanation with example1
Explanation with example

Qin

Qout

N100

Roy

Sudarshan

N4

N1

N2

N1

N3

N2

Roy

Sudarshan


Explanation with example2
Explanation with example

Qin

Qout

N100

N2

N3

N100

N1

Roy

Sudarshan

N4

N1

N3

N2

Roy

Sudarshan

Result Found !


Generation of top k results
Generation of top-k results

  • If we know the score of next-best answer, all buffered answers with better score can be output.

  • Need upper bounds


Computation of upper bound
Computation of upper bound

  • For each keyword ti, we have explored nodes upto some length – say li.

  • So, next – best – score (approx.) =

  • This is not a true upper bound, but works quite well and is simple !


Are we losing answers
Are we losing answers ?

  • BANKS – I used many Dijkstra states, BANKS – II uses 2 only – forward and backward search-states.

  • The result is that we can now lose answers !


Answer loss example

Ny

K1

K2

Ny

Nx

K2

K1

Answer Loss Example

Ny

K1

Nx

K2

K1

This is the generated answer.

This answer is lost.


NY

K2

NX

K1


Metrics of performance
Metrics of Performance

  • Manually obtain best relevant answers.

  • Determine 2 times :

    • Time taken to produce last relevant answer.

    • Time taken to output last relevant answer.

  • Search algorithms

    • MI-Bkwd: original backward search

      • Iterator for every node matching a keyword

    • SI-Bkwd: backward search with single backward iterator

    • Bidirec: bidirectional search

  • Datasets

    • DBLP, IMDB ~ 2 million nodes, 9 million edges

    • US Patent DB ~ 4 million nodes, 15 million edges


Graph i
Graph - I

  • MI-Bkwd versus SI-Bkwd

  • SI-Bkwd gain increases with origin size, # keywords


Graph ii
Graph - II

  • SI-Bkwd versus Bidirec

  • Bidirec gain increases with origin size, # keywords


A critique
A Critique

  • BANKS needs a lot of memory.

  • Need to cluster and keep parts of graph on disk.

  • Work is in progress 



Dbxplorer microsoft research
DBXplorer : (Microsoft Research)

  • Use symbol – table to determine which tables to join.

  • Generate all possible table – join combinations :

    Figure :

T1, T2, T3, T4 and T5 are tables


Cool ideas in dbxplorer
Cool ideas in DBXplorer

  • Symbol table need not be at tuple level. If column has an index, column – level symbol table is ok.

  • Table Compression :

  • e.g. : KeywordsColumnsKeywordsColumns

K1

K1

C1

C1

K2

K2

X

C2

K3

C2

K3

K4

K4

K5

K5

Intermediate Column



Objectrank ibm fiu ucsd
ObjectRank (IBM, FIU, UCSD)

  • Creates objects in database. Object definition is manual.

    e.g. in DBLP, author, conference and paper can be defined as objects.

  • Heavily inspired by PageRank.

  • Each node is given global ObjectRank just like PageRank of Google.


Objectrank ideas
ObjectRank Ideas

  • Keyword-level ObjectRank : for each keyword, precompute and save object ranks of nodes [can optimize by defining cut-off)

  • Score of node, n w.r.t. keyword k :

    scorek(n) = f (Global-object-rank (n), Objectrankk (n))

  • At run time, scores are combined :

    scorek1,k2,…,km(n) = scorek1(n) * scorek2(n) * …* scorekm(n)


Objectrank algorithm and answers
ObjectRank Algorithm and answers

  • If graph is DAG or near DAG, topologically sort and spread ObjectRank in this order.

  • Answers are single objects and not Cluster / group as in BANKS.

  • Demo at :

    http://teriyaki.ucsd.edu:9099/objrank/main05_new.html


Conclusion
Conclusion

  • Studied BANKS, both versions.

  • Covered cool ideas from DBXplorer and ObjectRank.

  • Graph of BANKS must be made disk-resident.


References
References

  • Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, and S. Sudarshan.Keyword Searching and Browsing in Databases using BANKS.In International Conference on Data Engineering (ICDE), pages 1083–1096, 2002.

  • Varun Kacholia, Shashank Pandit, Soumen Chakrabarti, S. Sudarshan et. al.Bidirectional Expansion for Keyword Search on Graph Databases.In VLDB Conference, pages 505–516, 2005.

  • Sanjay Agrawal, Surajit Chaudhari, and Gautam Das.DBXplorer: A System for Keyword-Based Search over Relational Databases.In International Conference on Data Engineering (ICDE), pages 5–22, 2002.

  • Andrey Balmin, Vagelis Hristidis, and Yannis Papakonstantinou.ObjectRank: Authority-Based Keyword Search in Databases.In VLDB Conference, pages 564–575, 2004.


Appendix

Appendix

Extra slides


Browsing may add
Browsing - May add??????

  • Hyperlinks are there for all primary key foreign key attributes

  • Each table is displayed with set of tools for interacting with data

    • Projection (using drop), Selection, Join, Group-by, Sort

  • Template facilities to do a variety of tasks

    • Browsing data by grouping and creating crosstabs

      • e.g., theses grouped by department and year

    • Hierarchical views of data

      • Nested XML style, even on relational data

    • Graphical displays

      • Bar charts, pie charts, etc

  • Templates are generic and can be applied on any data matching assumed schema

    • Can be applied after applying selections

    • New templates can be created by user, interactively



Related work
Related Work

  • DataSpot (DTL)/Mercado Intuifind [VLDB 98]

    • Based on patent by Palmon (filed 1995, granted 1998)

    • Based on hypergraph model, similar answer model to ours

    • Differences: our model of backward link weights and prestige

  • Proximity Search [VLDB98]

    • Different model of proximity based on adding up support

    • No edge weights, prestige, different evaluation algorithm

  • Information units (linked Web pages) [WWW10]

    • No directionality, only studied in Web context

  • Microsoft DBExplorer (this conference)

    • No ranking, based on SQL generation

    • Addresses efficient construction of text indexes

  • Microsoft English query


Extensions
Extensions

  • Summarization of output

    • group the output tuples into sets that have same tree structure

    • define the notion of similarity between two result trees

    • perform restricted search

  • Metadata queries (attribute:keyword queries)

    • For example: author:levy

    • match all the tuples of a relation

      • costly

    • Forward searching approach


Proposed conclusions and future work
Proposed Conclusions and Future Work

BANKS is an integrated browsing and keyword querying system for relational databases

Future work:

  • Keyword queries on XML

  • Disambiguating queries by selecting

    • Nodes: G.W.Bush: “Bush Jr” or “Bush Sr”

    • Tree structure: “coauthors” or “cites”

  • Boolean queries

  • Metadata queries

  • Summarization of output


ad