INTEGRATING NETWORK STORAGE INTO INFORMATION RETRIEVAL APPLICATIONS

INTEGRATING NETWORK STORAGE INTO INFORMATION RETRIEVAL APPLICATIONS Svetlana Y. Mironova The University of Tennessee, Knoxville Spring 2003

Topics of Discussion • Motivation • General Text Parser (GTP) • Network Storage Stack • GTP with Network Storage • Implementation Challenges • Performance • Future Work

Motivation • Amount of textual-based information stored on our computers and on the Web is rapidly accumulating. • Researchers and scientists need storage to run simulations and store outputs. • Data mining and information retrieval professionals need a tool capable of creating an index from a document collection, storing it on the network and sharing with others.

General Text Parser (GTP) • Two modules: GTP and GTPQUERY • Text/document parsing and indexing • Construct sparse matrix data structures • Create vector-space model where documents and queries are vectors in low-dimensional subspace • Term-by-document matrix defines relationships between docs and distinct terms • Underlying model is Latent Semantic Indexing (LSI)

Versions of GTP • C++ (original) • Parallel C++ using MPI (for SVD computation) • Java (GUI recently developed) • Solaris (Unix), Linux in C++ • Parallel only on Solaris • Solaris, Linux, Mac OS X in Java

GTP Process • Filter documents (optional) • Create database of keys, IDs and weights • Perform matrix decomposition (SVD) on the term-by-document matrix • Clean up • Write out summary

Query Process • Filter queries (optional) • Parse first query • Generate query vector • Scale query vector by singular values (optional) • Perform cosine matching • Write results to file for this query • Repeat for more queries

Network Storage Stack • Framework for storing and transferring data over network • Modeled after Internet Protocol (IP) Stack • Designed to add storage resources to the Internet in a sharable and scalable manner

Network Storage Stack

IBP • Internet Backplane Protocol • Foundation of Network Storage Stack • Share resources across networks • Use of local storage to create global storage service • Echoes advantages of IP: abstraction of datagram delivery, scalability, simple fault detection (discard faulty datagrams) • Temporary and “unreliable”

IBP Client Calls • Allocate • Store • Load • Copy • Mcopy • Manage

exNode • Hard to manage IBP capabilities • exNode automates it • exNodes are pointers to IBP allocations • Allows to create network files from unreliable IBP allocations, with stronger properties (fault-tolerance, longer duration, etc.) • Two major components: metadata and mappings

L-Bone • The Logistical Backbone • Resource discovery service • Maintains list of public depots and metadata about them • Uses Network Weather Service (NWS) to monitor throughput between depots • http://loci.cs.utk.edu/lbone

LoRS • Logistical Runtime System • Automate finding of IBP depots via L-Bone, creation and management of IBP capabilities and exNodes • C API and command line interface tool set

LoRS Functions • Upload • Download • Augment • Trim • Refresh • List

GTP with Network Storage • Creating an index is a dynamic process • Large document collection => large output files => require lots of storage space • Need to share produced results with others (across the globe) • If not satisfied with result – stored files will go away automatically • If happy with collection – can either store on IBP longer or store locally (burn on CD, etc)

GTP and Upload • GTP parses the collection • GTP creates output files (keys and output ) • Files are uploaded to remote network (IBP) • Upload requires some information from the user (optional) • Information helps optimize performance • Capabilities are returned in the form of XML files (.xnd extension)

GTP and Upload (contd) • Location (Null, hostname, zip, state, city, country, airport) • Duration in days • Fragments • Copies

Download and GTPQUERY • Files keys and output are downloaded using information from .xnd files • Download is multithreaded • Adaptive algorithm: takes into account throughput to the client • “Faster” depots provide more blocks of data

Download + GTPQUERY 5K 100 Representation of the binary file output for 5K collection

Implementation Challenges • GTP in Java, while LoRS tools in C • Go through server (first xnd_server, then lors_server) • Adapt to changes – both GTP and LoRS tools are constantly evolving • Threading to optimize performance • User friendliness

Performance • All results were achieved using the Java version of GTP • Three sub collections of FBIS (Foreign Broadcast Information Service) were used to produce benchmarks • Server located in Tennessee • Upload/download to/from Tennessee(TN), California (CA), France (FR)

Run Specifications • By default, GTP uses 100 SVD factors, i.e., all term and document vectors are of length 100 • The weighting scheme used was log entropy • For the query only the first 15 singular triplets were used • Three queries were used on each collection: Yugoslavia Croatia Bosnia-Herzegovina Russia embassy FIS Nissan Motor

FBIS 5K GTP + Upload Download + GTPQUERY

Performance AnalysisGTP + Upload • GTP time is directly proportional to the collection size • Additional overhead for upload is not significant compared to the total time • Upload time depends on multiple factors: location, network bandwidth, time of day, size of file, number of copies requested, and status of depots at the time of the upload

Performance AnalysisDownload + GTPQUERY • All “heavy-duty” preprocessing of the collection was done by GTP • Query process simply projects the query into the term-by-document vector space • Dimension of the vector space and number of factors used affects query time • Number of queries requested affects query time • Download takes up greater portion of the total time • Download is affected by location of fragments and network conditions

Future Work • Optimize Java performance • Incorporate fully with GUI • Incorporate network storage into the other (C++, parallel) versions of GTP • Streaming data directly while it is generated? • Avoid local file generation • User friendliness

INTEGRATING NETWORK STORAGE INTO INFORMATION RETRIEVAL APPLICATIONS