Loading in 5 sec....

Keyword Searching in Relational Databases Esha Palta (05329017) Kumar Gaurav Bijay (02005013) PowerPoint Presentation

Keyword Searching in Relational Databases Esha Palta (05329017) Kumar Gaurav Bijay (02005013)

- By
**kana** - Follow User

- 90 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Keyword Searching in Relational Databases Esha Palta (05329017) Kumar Gaurav Bijay (02005013)' - kana

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Appendix

Motivation

- Keyword search
- We have SQL, why keyword-querying?
- SQL - not appropriate for naive users
- So many online databases (imdb, citeseer, bseindia …) – user cannot keep track of schema for all of these

Simple Approaches

- Using Form interfaces
- Require separate form for each type of query – confusing
- Not suitable for ad-hoc queries – how many forms will you provide?

- How about Google?
- Export data from db to documents and do keyword- querying on these
- Suffers from duplication overheads
- Google wants all keywords in one document. DB is often normalized, so need to join tables and store as documents
- Multiple combinations of tables to join. Not scalable …

Differences from Web Search

- Related data split across multiple tuples due to normalization
- Different keywords may match tuples from different relations
- What joins are to be computed can only be decided on the fly
- Need to find result containing all keywords and rank them somehow

Writes

(AuthorId,

PaperId)

Paper

(PaperId,

PaperName)

Author

(AuthorId,

AuthorName)

Cites

(Citing,

Cited)

The DBLP Bibliography Schema

Systems for DB search

- BANKS (Browsing and Keyword Search)
– IITB (ICDE ’02)

- DBXplorer – Microsoft Research (ICDE ’02)
- ObjectRank – IBM, UCSD, FIU (VLDB ’04)
- Bidirectional BANKS – IITB (VLDB ’05)

Systems for DB search

- BANKS (Browsing and Keyword Search)
– IITB (ICDE ’02)

- DBXplorer – Microsoft Research (ICDE ’02)
- ObjectRank – IBM, UCSD, FIU (VLDB ’04)
- Bidirectional BANKS – IITB (VLDB ’05)
will cover in depth

The BANKS system

- BANKS Architecture
- Available on the web
- http://www.cse.iitb.ac.in/banks

- Connects to database using JDBC
- JDBC metadata features used to provide schema browsing

- Preprocesses db

User

BANKS

JDBC

HTTP

Web-server

Database

MO:MultiQuery Optimizn

paper

Charuta:BANKS01

writes

Charuta

S. Sudarshan

Prasan Roy

author

Basic Model- Database: modeled as a graph
- Nodes = tuples
- Edges = references between tuples
- foreign key (assume for this talk), inclusion dependencies, ..
- Edges are directed.

PaperId:PaperName

AuthorID:PaperId

AuthorId

DBLP example

The BANKS Answer Model

- Query: set of search terms {t1, t2, .., tn}
- For each search term ti we find set of nodes Si matching ti
- Eg: Query = Sudarshan Roy (t1= Sudarshan, t2 = Roy)

- Answer: rooted, directed tree connecting nodes matching keywords
- Root node has special significance, may be restricted to some relations
- E.g. relations representing entities, not relationships

- May include intermediate nodes not in any Si (Steiner Tree)

- Root node has special significance, may be restricted to some relations
- Multiple answers
- Ranking based on proximity + prestige

Prasan Roy

S. Sudarshan

Answer ExampleQuery: sudarshanroy

Paper

- We would like to find sets of (closely) connected tuples that match all given keywords

Writes

Writes

Author

Author

Edge Directionality

- Directed tree will miss desired answers. For eg:
Query = DBXplorerObjectRank

- So, for each forward edge, BANKS adds a back edge

CitedBy

Cited

Cites

BANKS

DBXPlorer

ObjectRank

Cites

Cited

CitedBy

BANKS

Cites

DBXPlorer

ObjectRank

Cites

Edge Directionality

- What if we ignore directionality?
- Some popular tuples are connected to many other tuples
- E.g. Students -> departments -> university

- Problem: A popular tuple would create misleading shortcuts between tuples
- E.g. every student would be closely linked with every other student via the department/university

- Solution: define different forward and backward edge weights
- Forward edges: In the direction of the foreign key reference

1

3

1

3

1

Edge Weight- Weight of forward edge based on schema
- e.g. citation link weights > “writes” link weights

- Weight of backward edge = indegree of edges pointing to the node

Edge Weight Scaling

- Normalize edge score Escore(e)
- Make edge weight scale-free by dividing edge weigth by wmin

- Problem: Some backward edges have unduly large weights
- Depress the scale by defining Escore(e) as log(1+w(e)/wmin )

- Overall Escore E = 1 / (1 + e Escore(e))

Node Weight

- Set weight of a Node = Indegree of the node
- As per prestige rankings nodes with multiple pointers to them get a higher prestige
- So, higher node weight corresponds to higher prestige

- Problem: Nodes with many in-edges result in skewed answers
- Subdue extreme node weights by using log(1+indegree)

- Node score Nscore = Average of node scores (root-node-weight + leaf-node-weights)

Combining Scores

- Combining two independent metrics: node weight and edge weight
- Normalize each to 0-1
- Combine using weighting factor
- Additive: (1- ) Escore + Nscore
- Multiplicative: Escore * Nscore

- Performance study to compare alternatives and to find reasonable values for

First Step – Symbol Table

- The first step is to build a symbol table
- This table is in the db and is not normalized
- Example:

Searching for Best Answers

- Backward Expanding Search Algorithm:
- Assume: graph fits in memory
- Idea: find vertices from which a forward path exists to at least one node from each Si.
- Run concurrent single source shortest path algorithm from each node matching a keyword
- Create an iterator for each node matching a keyword
- Traverse the graph edges in reverse direction

- Output a node whenever it is on the intersection of the sets of nodes reached from each keyword

- Create an iterator for each node matching a keyword

- Answer trees may not be generated in relevance order

paper

MultiQuery Optimization

writes

S. Sudarshan

Prasan Roy

authors

Backward Expanding SearchIterators

BANKS Query Result Example

- Result of “Sudarshan Roy”

Result Ordering

- Answers need not be always in Relevance order

This tree is output

Better Root Missed

2

2

2

5

2

1

Result Ordering (contd…)

- Solution:
- Generate all connection trees and then sort them
- Increases computation costs and leads to a greatly increased time to generate initial results

- Create a small heap ordered on the relevance of the trees
- Output highest ranked tree from heap to user when heap is full

- Generate all connection trees and then sort them
- What about duplicate results?
- Maintain a list of generated results for duplicate detection
- Discard result according to relevance

Experience and Performance

- BANKS provides keyword search coupled with extensive browsing facilities
- Schema browsing + data browsing
- Graphical display of data

- Implemented using Java + servlets
- Keyword search response times typically 1 to 3 seconds on
- DBLP database with 100,000 tuples/300,000 edges
- P3 600 MHz, 512 MB RAM

Anecdotes

- “Mohan”
- Returns C. Mohan at top based on prestige (number of papers written)

- “Transaction”
- Returns Jim Gray’s classic paper and textbook as top answers based on prestige (number of citations)

- “Sunita Seltzer”
- No common papers, but both have papers with Stonebraker: system finds this connection

Effect of Parameters

- Log scaling of edge weights worked well
- (1- ) E + N versus E Nmade little difference
- Best with = .2 (subdue node weights but not entirely)

Motivation

- BANKS performs poorly if
- Keyword matches lot of nodes (so lot of Dijkstra sources)
- Search hits a node with large fan – in.

…

Wastes time

Sudarshan

Roy

New Ideas – Forward Search

- Why only backward, lets search forward too :

How about fwd

Searching ?

…

Sudarshan

Roy

New Ideas - Activation

- Activation :- Cannot forward search from each node.
- Spread activation from keyword nodes to others.
- Activation is like Page Rank with decay.
High Activation close to many keywords.

Activation Spreading

- Spreading Activation
- Node with highest activation explored first

- Activation spread to neighbors (μ = 0.3)
- Gives low activation to neighbors of hubs

Modifications to Model

- Graph model stays the same.
- BANKS is concerned with search more than how to tune parameters or define node – weights / edge – weights.
- BANKS code :
Tree Node – Score, N =

Tree Edge – Score, E =

Total Score = ENl (l = 0.2)

The New Algorithm

- Need two priority queues :
Qin - do backward search from these nodes

Qout - do forward search from these nodes

- Each node, n keeps 3 variables per keyword, ti
- sp [i] : Node to got to from n for shortest-path to ti
- distance [i] : Length of the shortest-path from n to ti
- Activation [i] : Activation to n from keyword ‘ti’

The New Algorithm – continued…

- Set initial activation keyword nodes and add to Qin for backward-search.
- At each step, pick node with maximum activation
i.e. if (Qin.getMaxActivation > Qout. getMaxActivation))

// use node from Qin

else

// use node from Qout

- If node from Qin,do backward search and add itself toQout. (newly explored nodes into Qin)
- If node from Qout, do forward search
- If node has reached from all keyword, generate result-tree. [answer is buffered as results can be out of order]

Explanation with example

Qin

Qout

N100

N2

N3

…

N100

N1

Roy

Sudarshan

N4

N1

…

N3

N2

Roy

Sudarshan

Result Found !

Generation of top-k results

- If we know the score of next-best answer, all buffered answers with better score can be output.
- Need upper bounds

Computation of upper bound

- For each keyword ti, we have explored nodes upto some length – say li.
- So, next – best – score (approx.) =
- This is not a true upper bound, but works quite well and is simple !

Are we losing answers ?

- BANKS – I used many Dijkstra states, BANKS – II uses 2 only – forward and backward search-states.
- The result is that we can now lose answers !

K1

K2

Ny

Nx

K2

K1

Answer Loss ExampleNy

K1

Nx

K2

K1

This is the generated answer.

This answer is lost.

- But, we will generate this tree rooted at Nx:
- So, a rotated tree with same nodes but different root is often generated !

NY

K2

NX

K1

Metrics of Performance

- Manually obtain best relevant answers.
- Determine 2 times :
- Time taken to produce last relevant answer.
- Time taken to output last relevant answer.

- Search algorithms
- MI-Bkwd: original backward search
- Iterator for every node matching a keyword

- SI-Bkwd: backward search with single backward iterator
- Bidirec: bidirectional search

- MI-Bkwd: original backward search
- Datasets
- DBLP, IMDB ~ 2 million nodes, 9 million edges
- US Patent DB ~ 4 million nodes, 15 million edges

A Critique

- BANKS needs a lot of memory.
- Need to cluster and keep parts of graph on disk.
- Work is in progress

DBXplorer : (Microsoft Research)

- Use symbol – table to determine which tables to join.
- Generate all possible table – join combinations :
Figure :

T1, T2, T3, T4 and T5 are tables

Cool ideas in DBXplorer

- Symbol table need not be at tuple level. If column has an index, column – level symbol table is ok.
- Table Compression :
- e.g. : KeywordsColumnsKeywordsColumns

K1

K1

C1

C1

K2

K2

X

C2

K3

C2

K3

K4

K4

K5

K5

Intermediate Column

ObjectRank (IBM, FIU, UCSD)

- Creates objects in database. Object definition is manual.
e.g. in DBLP, author, conference and paper can be defined as objects.

- Heavily inspired by PageRank.
- Each node is given global ObjectRank just like PageRank of Google.

ObjectRank Ideas

- Keyword-level ObjectRank : for each keyword, precompute and save object ranks of nodes [can optimize by defining cut-off)
- Score of node, n w.r.t. keyword k :
scorek(n) = f (Global-object-rank (n), Objectrankk (n))

- At run time, scores are combined :
scorek1,k2,…,km(n) = scorek1(n) * scorek2(n) * …* scorekm(n)

ObjectRank Algorithm and answers

- If graph is DAG or near DAG, topologically sort and spread ObjectRank in this order.
- Answers are single objects and not Cluster / group as in BANKS.
- Demo at :
http://teriyaki.ucsd.edu:9099/objrank/main05_new.html

Conclusion

- Studied BANKS, both versions.
- Covered cool ideas from DBXplorer and ObjectRank.
- Graph of BANKS must be made disk-resident.

References

- Gaurav Bhalotia, Arvind Hulgeri, Charuta Nakhe, Soumen Chakrabarti, and S. Sudarshan.Keyword Searching and Browsing in Databases using BANKS.In International Conference on Data Engineering (ICDE), pages 1083–1096, 2002.
- Varun Kacholia, Shashank Pandit, Soumen Chakrabarti, S. Sudarshan et. al.Bidirectional Expansion for Keyword Search on Graph Databases.In VLDB Conference, pages 505–516, 2005.
- Sanjay Agrawal, Surajit Chaudhari, and Gautam Das.DBXplorer: A System for Keyword-Based Search over Relational Databases.In International Conference on Data Engineering (ICDE), pages 5–22, 2002.
- Andrey Balmin, Vagelis Hristidis, and Yannis Papakonstantinou.ObjectRank: Authority-Based Keyword Search in Databases.In VLDB Conference, pages 564–575, 2004.

Extra slides

Browsing - May add??????

- Hyperlinks are there for all primary key foreign key attributes
- Each table is displayed with set of tools for interacting with data
- Projection (using drop), Selection, Join, Group-by, Sort

- Template facilities to do a variety of tasks
- Browsing data by grouping and creating crosstabs
- e.g., theses grouped by department and year

- Hierarchical views of data
- Nested XML style, even on relational data

- Graphical displays
- Bar charts, pie charts, etc

- Browsing data by grouping and creating crosstabs
- Templates are generic and can be applied on any data matching assumed schema
- Can be applied after applying selections
- New templates can be created by user, interactively

Related Work

- DataSpot (DTL)/Mercado Intuifind [VLDB 98]
- Based on patent by Palmon (filed 1995, granted 1998)
- Based on hypergraph model, similar answer model to ours
- Differences: our model of backward link weights and prestige

- Proximity Search [VLDB98]
- Different model of proximity based on adding up support
- No edge weights, prestige, different evaluation algorithm

- Information units (linked Web pages) [WWW10]
- No directionality, only studied in Web context

- Microsoft DBExplorer (this conference)
- No ranking, based on SQL generation
- Addresses efficient construction of text indexes

- Microsoft English query

Extensions

- Summarization of output
- group the output tuples into sets that have same tree structure
- define the notion of similarity between two result trees
- perform restricted search

- Metadata queries (attribute:keyword queries)
- For example: author:levy
- match all the tuples of a relation
- costly

- Forward searching approach

Proposed Conclusions and Future Work

BANKS is an integrated browsing and keyword querying system for relational databases

Future work:

- Keyword queries on XML
- Disambiguating queries by selecting
- Nodes: G.W.Bush: “Bush Jr” or “Bush Sr”
- Tree structure: “coauthors” or “cites”

- Boolean queries
- Metadata queries
- Summarization of output

Download Presentation

Connecting to Server..