An Evaluation of Index Architectures for DB-IR Integration in an Open-Source IRMS, KRISTAL

An Evaluation of Index Architectures forDB-IR Integration in an Open-Source IRMS, KRISTAL Jinsuk Kim Information System Development Team KISTI jinsuk@kisti.re.kr 2007. 6. 1.

DB+IR vs. KRISTAL-IRMS IRMS = Information Retrieval & Management System

Contents • The Great Divide in DB and IR • Approaches in DB-IR Integration • Strategies for Dynamic Index Maintenance • Direct Index Update • Stand-alone Auxiliary Index Strategy • Pulsing Auxiliary Index Strategy • Experimental Result • Conclusion • Discussion

The Great Divide in DB and IR The Great Data Divide Information Retrieval Systems Ranked Keyword Search Queries Database Systems The Great Query Divide Complex and Structured Structured Unstructured Data Dr. Jayavel Shanmugasundaram Cornell University SIGMOD 2005

RDB vs. IR Web Search Engines ? RDB IR Dr. R. Baeza-Yates & Dr. M. Consens VLDB 2004

RDB vs. IR vs. KRISTAL RDB IR KRISTAL Find presentations between the years 2005 and 2007 of which author is “Jinsuk” and of which abstract is about “pulsing auxiliary index”. … <presentation date=“1 June 2007”> <title>Index Maintenance in DB-IR Integration</title> <author>Jinsuk Kim</author> <abstract>Index maintenance strategies in DB-IR … stand-alone and pulsing auxiliary index architectures … </abstract> </presentation> … KRISTAL Query Formula (DATE: 2005 ~ 2007) AND (AUTHOR: Jinsuk) AND (ABSTRACT: pulsing auxiliary index)

Strategies in DB-IR Integration (1/2) • DB-IR Middleware Approach • Glue existing DB and IR engines at the application level • DBMS for data management and IRS for text search facilities • Inevitable document-index gap • DB-IR loose coupling • Extend DBMS by SQL-level IR interface • Examples: Oracle ConText, DB2 Tex Extender, QUIQ, TopX, MonetDB/X100 • DB-IR tight coupling • Extend IR facilities in DBMS storage level (IR on DB) • Example: Odysseus ORDBMS • Extend DB management facilities in IR storage level (DB on IR: IRMS) • Example: KRISTAL-IRMS • Novel architecture for DB-IR unification • Still under discussion • “The storage-level core system with RISC-style functionality in DB-IR integration” suggested by Chaudhuri et al.

Strategies in DB-IR Integration (2/2) Odysseus DB-IR Integration DBMS IR on DB Tight Coupling DB Features DB Extenders DB2 Cartridge IRMS (DB on IR Tight Coupling) Web Crawlers DB-IR middleware approach I R S Web Search Engines IR Features

As a Text Management System Author: Jinsuk Kim Title: Evaluation of index maintenance in DB-IR integration Keywords: pulsing auxiliary index, postings list Sample Input Document DB IR DB-IR Integration Fast Input Rollback Crash Recovery Slow Retrieval Fast Input No Rollback No Crash Recovery Fast Retrieval Slow Input Rollback Crash Recovery Fast Retrieval How to solve this problem?

A Basic Problem in DB-IR Integration • Index Maintenance for Incoming Documents • As a document usually contains hundreds of terms to be indexed, index update involves hundreds of disk accesses. This is an extremely time-consuming task. • Traditional IR systems store these incoming postings lists from a block of new documents in in-memory structures. If additional memory space is not available, the in-memory postings lists are merged to the on-disk main index. • However, the in-memory postings lists are volatile and can be lost upon certain crash conditions. • For DB-IR integration, index update for each document should guarantee the document-index integrity, as DB typically does. We call such a document-level transaction as per-document basis transactional index maintenance.

How to Solve the Basic Problem? • Requirements (1) Updating index for an incoming document should be fast. (How much fast?) • Avoiding relocations of long postings lists is essential to speed up index maintenance tasks. (2) The task should be rollbacked if an error occurs. (3) The result of the task should consistent even with system crashes. • To cope with (1), separate the index update for incoming documents to a supplementary or auxiliary index storage area. • It is time consuming due to heavy disk accesses if the on-disk main index is directly updated. • Rather, update index to a smaller auxiliary storage area. • To cope with (2), transaction logs should be written to an on-disk area. • To cope with (3), the auxiliary index should be stored in on-disk area not in in-memory storage.

KRISTAL: Index Maintenance Strategies (1/2) • Direct Index Update (As base line) • Postings list for each term in a new document is appended to the main index • Relocation of postings lists severely degrades the performance • Stand-alone Auxiliary Index • Postings lists are updated to a small auxiliary on-disk index • Relocation size in the auxiliary structure is usually smaller than in the main index • As the auxiliary index grows, relocation size will grow too. • Pulsing Auxiliary Index • As new documents are arrived, an auxiliary postings list longer than a given threshold is in-place updated to the main index; this feature keeps the auxiliary index size nearly constant throughout addition of new documents • Every relocation in the auxiliary index is smaller than the given threshold • Relocations of long postings lists are dispersed among insertion of new documents • Example: high frequency terms such as ‘the’, ‘on’, and ‘of’ does not exactly co-occur

Document Table Main Index doc 1 6 1 3 5 Key doc 2 2 4 6 Document Table Document Table … Main Index Main Index doc 5 doc 1 doc 1 1 3 1 3 Key Key doc 6 doc 2 5 doc 2 2 … 2 4 2 4 … … doc 5 doc 5 doc 6 doc 6 In-Place Update 2 2 … … Auxiliary Index Auxiliary Index Delete list Delete list 6 6 5 7 5 7 2 3 6 .. 2 3 6 .. Key Key Update list Update list 6 6 8 8 3  7 3  7 6  8 6  8 7 7 5 5 B+-tree B+-tree B+-tree B+-tree B+-tree … … KRISTAL: Index Maintenance Strategies (2/2) (A) (A) Direct Index Update (B) Stand-alone Auxiliary Index (C) Pulsing Auxiliary Index (B) (C)

Experimental Setting • Hardware • Dual Pentium CPUs (Clock Speed = 3GHz) • 8GB of RAM • RAID-5 SCSI HDD • Software • OS: RedHat Enterprise Linux 4 • Storage and Retrieval Engine: KRISTAL-IRMS • Test Data • Bibliographic texts • 10,000, 100,000, and 1,000,000 records for base data • Additional 10,000 documents for appending experiments • Query Evaluation • Three sets of single terms with varying document frequencies • Complex queries used in real bibliography service in KISTI

Experiment – A sample document @DOCUMENT (1296) #TITLE=Regression with Doubly Censored Current Status Data #AUTHOR=Rabinowitz, Daniel ; Jewell, Nicholas P. #JOURNAL=Journal of the Royal Statistical Society. Series B (Methodological) #VOLUME=58 #NUMBER=3 #PAGE START=541 #PAGE END=550 #PUBDATE=20010324 #ABSTRACT=Data from settings in which an initiating event and a subsequent event occur in sequence are called doubly censored current status data if the time of neither event is observed directly, but instead it is determined at a random monitoring time whether either the initiating or subsequent event has yet occurred. This paper is concerned with using doubly censored current status data to estimate the regression coefficient in an accelerated failure time model for the length of time between the initiating event and the subsequent event. Motivated by a problem in the epidemiology of acquired immune deficiency syndrome, attention here is focused on a special case, the case in which the initiating event, given that it has occurred before the monitoring time, may be assumed to follow a uniform distribution. The main result is that the likelihood in the special case has the same structure as the likelihood in a simpler setting, the setting in which the time of the initiating event is known. The result allows methods developed for the simpler setting to be applied in the special case. The results of the application of the approach to real data are reported. #KEYWORDS=Accelerated Failure Time ; Acquired Immune Deficiency Syndrome ; Current Status Data ; Double Censoring ; Survival Analysis

Experiment – DB schema and Index Statistics

This is a Sample text documents This is a Sample text document Experiment – Appending 10,000 Documents 10,000 new documents 1M 10K 100K Pre-built Table With 10,000 docs Pre-built Table With 100,000 docs Pre-built Table With 1,000,000 docs

Experiment – 10K Table • 10K + 10K • Appending 10,000 new documents to a base table with existing 10,000 documents • Results • Direct update shows poor performance • Stand-alone auxiliary index is better than direct update but poor than pulsing aux. • Pulsing auxiliary strategy shows consistent manner with overall 10,000 documents

Experiment – 100K Table • 100K + 10K • Appending 10,000 new documents to a base table with existing 100,000 documents • Results • Pulsing auxiliary index strategy is better than stand-alone auxiliary index. • However, pulsing strategy shows many biased points throughout the insertion

Experiment – 1M Table • 1M + 10K • Appending 10,000 new documents to a base table with existing 1,000,000 documents • Results • Pulsing auxiliary index strategy is better than stand-alone auxiliary index. • However, pulsing strategy shows many and huge biased points throughout the insertion • For larger base tables pulsing may inferior to stand-alone strategy

Experiment – Overall Result • Average Processing Time per Document for 10,000 Insertions • Overall performance of pulsing auxiliary strategy is superior to stand-alone auxiliary index. • Stand-alone auxiliary index shows nearly constant performance since main index and auxiliary index is independent each other. • Pulsing one shows degenerated performance as the size of base table grows.

Experiment – Postings Access • Boolean mode access for terms with varying DF ranges after adding 10,000 new documents to the 1M table • Pulsing auxiliary index shows comparable performance with re-built table. • cf) Re-build = table built with 1.1 million table in bulk-mode

Experiment – Query Evaluation(1/2) • Target tables • 10K, 100K, and 1M table after adding 10,000 new documents • Queries • 2994 subject queries used in KISTI bibliography database service • Examples: • yellow* /N8 (polyurethane* OR urethane*) • silicon AND (optic* /N8 signal*) AND module* • food* /N3 (wastewater* OR (waste /W1 water*)) AND treat* • ceramic* AND (bulletproof* OR (bullet /W1 proof*) OR (bullet /W1 resist*) OR (bullet* /N2 (protect* OR resist*))) • wood* /N5 (substitut* OR replacement*) • (catalyst* OR catalyzer*) /N5 (regenerat* OR ((precious OR valu* OR noble*) /N2 metal* /N5 recover*)) • Heavy truncations reflect B+-tree performance by exploiting leaf nodes of the tree • Within/Near operations reflect the performance of positional information

Experiment – Query Evaluation(2/2) • Average query performance for complex queries shows Re-build table is the most superior • But, the performance of pulsing auxiliary index is only 18% worse than that of re-build table (for 1M, 120 to 147ms) while stand-alone auxiliary is degraded by 44% (120 to 212ms)

Conclusion • Index Maintenance • Pulsing Auxiliary Index is superior to Stand-alone Auxiliary Index Strategy in index maintenance for newly arriving documents • cf) For larger base tables, pulsing may inferior to stand-alone auxiliary strategy • Query Evaluation • Query evaluation performance of pulsing auxiliary index is comparable with that of re-built table • Pulsing auxiliary index can be a candidate for index architecture in DB-IR integration

Discussion (1/4) • Recent implementation of new index maintenance strategy in KRISTAL • Postings segmentation (4.4 seconds to 1.5 seconds for 1M table)

Discussion (2/4) • Still this approach is interior to IR’s a block of documents approaches • IR: 0.1 seconds per document • KRISTAL: 1.5~4.4 seconds per document • Overallocation • Overallocation of postings lists in the auxiliary index may relieve the relocations problem • Index Compression • Compression of postings lists will reduce relocation sizes

Discussion (3/4) • KRISTAL toward DB-IR Integration from an IRMS viewpoint • Solved Problems (Intra-table operations) • Structured query evaluation • Structured data processing • XML repository • Dynamic index maintenance (?) • To be solved (Inter-table operations) • Table Join • View and Materialized View • Trigger • Query optimization (and SQL-like query language?)

Discussion (4/4) • KRISTAL toward Open-Source IRMS • Aiming at Open Source Initiative • Currently KRISTAL’s source is open for educational and research purposes • However, KRISTAL-IRMS will be intended to OSI level, sooner or later • Building KRISTAL on another languages such as Uzbek and Mongolian is under progress in Open-Source level • Download KRISTAL at http://www.kristalinfo.com

http://www.yeskisti.net http://www.kristalinfo.com 감사합니다

An Evaluation of Index Architectures for DB-IR Integration in an Open-Source IRMS, KRISTAL

An Evaluation of Index Architectures for DB-IR Integration in an Open-Source IRMS, KRISTAL

Presentation Transcript

R: An Open Source Statistical Environment

An open source QA stack

gvSIG : An “Open Source” Option for GIS

An Open Source IP-Telephony

An Open Source Google Apps Integration (Bboogle)

FLR: An Open-Source Framework for the Evaluation and Development of Management Strategies

Computer Networks An Open Source Approach

Online Learning in an Open Source Environment

UTGB Shell An Open-Source Browser Framework for the Integration of Biological Data

Computer Networks An Open Source Approach

Computer Networks An Open Source Approach

MySQL An Open Source DBMS

Computer Networks An Open Source Approach

Computer Networks An Open Source Approach

Constructing an Open Source Powerhouse

Computer Networks An Open Source Approach

Computer Networks An Open Source Approach

An Introduction of GIS Open Data Source

OpenEMPI An Open Source Enterprise Master Patient Index

Computer Networks An Open Source Approach

Computer Networks An Open Source Approach

Computer Networks An Open Source Approach