DBXplorer: A System for Keyword-Based Search over Relational Databases.

DBXplorer: A System for Keyword-Based Search over Relational Databases. Sanjay Agarwal Surajit Chaudhuri Gautam Das Presented By : SRUTHI GUNGIDI

Outline • Introduction • Overview of DBExplorer • Symbol Table Design – Publish • Keyword Search • Generalized Matches • Results and Conclusions CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

Introduction • Keyword-Based Search: Given a set of keywords, a ranked list of documents is returned. • Traditional Database systems required: - knowledge of the schema • Search over databases: Applying IR techniques from documents world to databases without the above requirements is a challenging task. Given a set of keywords, matching rows may need to be obtained by joining several tables on the fly. Example: Search for book on ‘Programming’ by ‘Ritchie’ Authors AuthorsBooks Books BooksStores Stores CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

DBExplorer Overview(1/2) • Given a set of query keywords, DBExplorer returns all matching rows(either from single table or by joining tables connected by foreign-key joins)such that each row contains all keywords. • Goal: Enabling the keyword search without necessarily requiring to know the schema of the respective databases. • Let’s see the case where there is a single Database. • The system also allows to search multiple databases simultaneously . CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

DBExplorer Overview(2/2) • Requires 2 steps: Publish(Preprocessing step): • Identifies the Database, along with the set of tables and columns within the database to be published. • Builds the symbol table (auxiliary tables). Search step: Given a query of keywords • Lookup symbol table to identify the tables, columns/cells containing keywords • Enumerate join trees • For each join tree, construct and execute SQL statement to select those rows that contain all keywords. Rank rows and return. CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

Symbol Table Designs • Traditional IR techniques use data structures such as Inverted lists -to efficiently identify documents containing a query keyword. • Symbol table is the key data structure used to store the information about keywords at different granularities. - Column level (Pub-Col): list of table. column - Cell level (Pub-Cell): list of table.column.rowid. -Row level : list of all rows that contain it.(have very little advantage) CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

Symbol Table Granularity • The Pub-Col symbol table is almost always better unless certain columns do not have indexes. • Space and time to build Pub-Col symbol tables is less since we only need to record the distinct values in a column. • Keyword search performance: depends on the efficient generation and execution of the SQL statements. Pub-Cell returns more number of SQL statements than Pub-Col. • Ease of symbol table Maintenance: Pub-Col table requires an update only if the insertions cause new values to be introduced in some column whereas Pub-Cell table need to be updated for every inserted row. - Similarly deletions. CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

Symbol Table Granularity • Hybrid symbol table: If an index is available for a column – publish with Pub-Col granularity otherwise with Pub-Cell granularity. CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

Storing Symbol Tables in Databases • Store Pub-Col symbol table as (keyword hash, ColId) • FK-Comp(Foreign key): If there is key-foreign key relationship between c1 and c2, store only c1 • CP-Comp: (not necessarily tied by fk relationships) • Partition H into a minimum number of bipartite cliques • Compress each clique. • Store Pub-Cell table as (keyword hash, list of CellIds). v2 v3 v4 c1 c2 x Uncompressed hash table ColumnsMap table Compressed hash table CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

Keyword Search(1/2) • Looks up symbol table to find tables/columns which contain at least one of the keywords. • Enumerate join trees View the schema graph G as an undirected graph and enumerate all possible join trees i.e., sub-trees of G such that • the leaves belong to the MatchedTables and • together, the leaves contain all keywords of the query Join trees Schema graph CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

Keyword Search(2/2) 3. Identify the matching rows • Each join tree is then mapped to a single SQL statement that joins the tables as specified in the tree, and selects those rows that contain all keywords • The retrieved rows are ranked before being output. CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

Generalized Matches(1/2) • Token Matches: The keyword in the query matches only a token or sub-string of an attribute value. (e.g., LIKE “%string%”) • Pub-Prefix method: • Some pre-computation is done but can perform token searches using B+ tree indexes. • Clause is of the form WHERE T.C LIKE ‘P%K%’ • During publishing of a database, for every keyword K, the entry (hash(K), T.C, P) is kept in the symbol table if there exists in column T.C which contains a token K and has prefix P. CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

Generalized Matches(2/2) • Example: • Let the hash values of the searchable tokens ‘string’, ‘ball’ and ‘round’ be 1, 2, and 3 respectively. • Consider searching keyword “string” • Pub-Prefix table returns “th” and “no” and subsequent SQL will contain WHERE (T.C LIKE ‘th%ball%’) OR (T.C LIKE ‘an%ball%’) Database table Pub-Prefix table CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

Results(1/2) Fig 2: Quality of Compression techniques • The search scales with the number of query keywords even when the databases have complex schema. • The compression achieved for FK-Comp is consistently less when CP-Comp. CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

Results & Conclusions(2/2) • As the prefix length increases the discriminating abilities increases, and in the limit the prefix method degenerates to Pub-Cell Fig 3: Search time Vs Prefix length • If a full-text index is available, use Pub-Col. • If only a traditional index is available and the column width is small, use Pub-Prefix, otherwise Pub-Cell. CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

GUI for DBXplorer • Keywords: {livia, karsen, computer} • Published Databases CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

GUI for DBXplorer • For each keyword, list of table and column pairs. • Enumerated Join trees. CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

GUI for DBXplorer Fig: Matching Rows CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

References [1] T. Feder, R. Motwani, Clique partitions, Graph Compression and Speeding-Up Algorithms, STOC, 1991. [2] S. Kapoor, H. Ramesh, Algorithms for Enumerating all Spanning Trees of Directed and Undirected Graphs, SIAM J. Computing, 1995. CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

Questions ? CSE 6339: Data Exploration and Analysis in Relational Databases Fall 2010

DBXplorer: A System for Keyword-Based Search over Relational Databases.