1 / 20

DBXplorer: A System for Keyword-Based Search over Relational Databases.

DBXplorer: A System for Keyword-Based Search over Relational Databases. Sanjay Agarwal Surajit Chaudhuri Gautam Das. Presented By : SRUTHI GUNGIDI. Outline. Introduction Overview of DBExplorer Symbol Table Design – Publish Keyword Search Generalized Matches

ezhno
Download Presentation

DBXplorer: A System for Keyword-Based Search over Relational Databases.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DBXplorer: A System for Keyword-Based Search over Relational Databases. Sanjay Agarwal Surajit Chaudhuri Gautam Das Presented By : SRUTHI GUNGIDI

  2. Outline • Introduction • Overview of DBExplorer • Symbol Table Design – Publish • Keyword Search • Generalized Matches • Results and Conclusions CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

  3. Introduction • Keyword-Based Search: Given a set of keywords, a ranked list of documents is returned. • Traditional Database systems required: - knowledge of the schema • Search over databases: Applying IR techniques from documents world to databases without the above requirements is a challenging task. Given a set of keywords, matching rows may need to be obtained by joining several tables on the fly. Example: Search for book on ‘Programming’ by ‘Ritchie’ Authors AuthorsBooks Books BooksStores Stores CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

  4. DBExplorer Overview(1/2) • Given a set of query keywords, DBExplorer returns all matching rows(either from single table or by joining tables connected by foreign-key joins)such that each row contains all keywords. • Goal: Enabling the keyword search without necessarily requiring to know the schema of the respective databases. • Let’s see the case where there is a single Database. • The system also allows to search multiple databases simultaneously . CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

  5. DBExplorer Overview(2/2) • Requires 2 steps: Publish(Preprocessing step): • Identifies the Database, along with the set of tables and columns within the database to be published. • Builds the symbol table (auxiliary tables). Search step: Given a query of keywords • Lookup symbol table to identify the tables, columns/cells containing keywords • Enumerate join trees • For each join tree, construct and execute SQL statement to select those rows that contain all keywords. Rank rows and return. CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

  6. Symbol Table Designs • Traditional IR techniques use data structures such as Inverted lists -to efficiently identify documents containing a query keyword. • Symbol table is the key data structure used to store the information about keywords at different granularities. - Column level (Pub-Col): list of table. column - Cell level (Pub-Cell): list of table.column.rowid. -Row level : list of all rows that contain it.(have very little advantage) CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

  7. Symbol Table Granularity • The Pub-Col symbol table is almost always better unless certain columns do not have indexes. • Space and time to build Pub-Col symbol tables is less since we only need to record the distinct values in a column. • Keyword search performance: depends on the efficient generation and execution of the SQL statements. Pub-Cell returns more number of SQL statements than Pub-Col. • Ease of symbol table Maintenance: Pub-Col table requires an update only if the insertions cause new values to be introduced in some column whereas Pub-Cell table need to be updated for every inserted row. - Similarly deletions. CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

  8. Symbol Table Granularity • Hybrid symbol table: If an index is available for a column – publish with Pub-Col granularity otherwise with Pub-Cell granularity. CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

  9. Storing Symbol Tables in Databases • Store Pub-Col symbol table as (keyword hash, ColId) • FK-Comp(Foreign key): If there is key-foreign key relationship between c1 and c2, store only c1 • CP-Comp: (not necessarily tied by fk relationships) • Partition H into a minimum number of bipartite cliques • Compress each clique. • Store Pub-Cell table as (keyword hash, list of CellIds). v2 v3 v4 c1 c2 x Uncompressed hash table ColumnsMap table Compressed hash table CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

  10. Keyword Search(1/2) • Looks up symbol table to find tables/columns which contain at least one of the keywords. • Enumerate join trees View the schema graph G as an undirected graph and enumerate all possible join trees i.e., sub-trees of G such that • the leaves belong to the MatchedTables and • together, the leaves contain all keywords of the query Join trees Schema graph CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

  11. Keyword Search(2/2) 3. Identify the matching rows • Each join tree is then mapped to a single SQL statement that joins the tables as specified in the tree, and selects those rows that contain all keywords • The retrieved rows are ranked before being output. CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

  12. Generalized Matches(1/2) • Token Matches: The keyword in the query matches only a token or sub-string of an attribute value. (e.g., LIKE “%string%”) • Pub-Prefix method: • Some pre-computation is done but can perform token searches using B+ tree indexes. • Clause is of the form WHERE T.C LIKE ‘P%K%’ • During publishing of a database, for every keyword K, the entry (hash(K), T.C, P) is kept in the symbol table if there exists in column T.C which contains a token K and has prefix P. CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

  13. Generalized Matches(2/2) • Example: • Let the hash values of the searchable tokens ‘string’, ‘ball’ and ‘round’ be 1, 2, and 3 respectively. • Consider searching keyword “string” • Pub-Prefix table returns “th” and “no” and subsequent SQL will contain WHERE (T.C LIKE ‘th%ball%’) OR (T.C LIKE ‘an%ball%’) Database table Pub-Prefix table CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

  14. Results(1/2) Fig 2: Quality of Compression techniques • The search scales with the number of query keywords even when the databases have complex schema. • The compression achieved for FK-Comp is consistently less when CP-Comp. CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

  15. Results & Conclusions(2/2) • As the prefix length increases the discriminating abilities increases, and in the limit the prefix method degenerates to Pub-Cell Fig 3: Search time Vs Prefix length • If a full-text index is available, use Pub-Col. • If only a traditional index is available and the column width is small, use Pub-Prefix, otherwise Pub-Cell. CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

  16. GUI for DBXplorer • Keywords: {livia, karsen, computer} • Published Databases CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

  17. GUI for DBXplorer • For each keyword, list of table and column pairs. • Enumerated Join trees. CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

  18. GUI for DBXplorer Fig: Matching Rows CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

  19. References [1] T. Feder, R. Motwani, Clique partitions, Graph Compression and Speeding-Up Algorithms, STOC, 1991. [2] S. Kapoor, H. Ramesh, Algorithms for Enumerating all Spanning Trees of Directed and Undirected Graphs, SIAM J. Computing, 1995. CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

  20. Questions ? CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

More Related