1 / 29

INTEGRATING NETWORK STORAGE INTO INFORMATION RETRIEVAL APPLICATIONS

INTEGRATING NETWORK STORAGE INTO INFORMATION RETRIEVAL APPLICATIONS. Svetlana Y. Mironova The University of Tennessee, Knoxville Spring 2003. Topics of Discussion. Motivation General Text Parser (GTP) Network Storage Stack GTP with Network Storage Implementation Challenges Performance

winda
Download Presentation

INTEGRATING NETWORK STORAGE INTO INFORMATION RETRIEVAL APPLICATIONS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. INTEGRATING NETWORK STORAGE INTO INFORMATION RETRIEVAL APPLICATIONS Svetlana Y. Mironova The University of Tennessee, Knoxville Spring 2003

  2. Topics of Discussion • Motivation • General Text Parser (GTP) • Network Storage Stack • GTP with Network Storage • Implementation Challenges • Performance • Future Work

  3. Motivation • Amount of textual-based information stored on our computers and on the Web is rapidly accumulating. • Researchers and scientists need storage to run simulations and store outputs. • Data mining and information retrieval professionals need a tool capable of creating an index from a document collection, storing it on the network and sharing with others.

  4. General Text Parser (GTP) • Two modules: GTP and GTPQUERY • Text/document parsing and indexing • Construct sparse matrix data structures • Create vector-space model where documents and queries are vectors in low-dimensional subspace • Term-by-document matrix defines relationships between docs and distinct terms • Underlying model is Latent Semantic Indexing (LSI)

  5. Versions of GTP • C++ (original) • Parallel C++ using MPI (for SVD computation) • Java (GUI recently developed) • Solaris (Unix), Linux in C++ • Parallel only on Solaris • Solaris, Linux, Mac OS X in Java

  6. GTP Process • Filter documents (optional) • Create database of keys, IDs and weights • Perform matrix decomposition (SVD) on the term-by-document matrix • Clean up • Write out summary

  7. Query Process • Filter queries (optional) • Parse first query • Generate query vector • Scale query vector by singular values (optional) • Perform cosine matching • Write results to file for this query • Repeat for more queries

  8. Network Storage Stack • Framework for storing and transferring data over network • Modeled after Internet Protocol (IP) Stack • Designed to add storage resources to the Internet in a sharable and scalable manner

  9. Network Storage Stack

  10. IBP • Internet Backplane Protocol • Foundation of Network Storage Stack • Share resources across networks • Use of local storage to create global storage service • Echoes advantages of IP: abstraction of datagram delivery, scalability, simple fault detection (discard faulty datagrams) • Temporary and “unreliable”

  11. IBP Client Calls • Allocate • Store • Load • Copy • Mcopy • Manage

  12. exNode • Hard to manage IBP capabilities • exNode automates it • exNodes are pointers to IBP allocations • Allows to create network files from unreliable IBP allocations, with stronger properties (fault-tolerance, longer duration, etc.) • Two major components: metadata and mappings

  13. L-Bone • The Logistical Backbone • Resource discovery service • Maintains list of public depots and metadata about them • Uses Network Weather Service (NWS) to monitor throughput between depots • http://loci.cs.utk.edu/lbone

  14. LoRS • Logistical Runtime System • Automate finding of IBP depots via L-Bone, creation and management of IBP capabilities and exNodes • C API and command line interface tool set

  15. LoRS Functions • Upload • Download • Augment • Trim • Refresh • List

  16. GTP with Network Storage • Creating an index is a dynamic process • Large document collection => large output files => require lots of storage space • Need to share produced results with others (across the globe) • If not satisfied with result – stored files will go away automatically • If happy with collection – can either store on IBP longer or store locally (burn on CD, etc)

  17. GTP and Upload • GTP parses the collection • GTP creates output files (keys and output ) • Files are uploaded to remote network (IBP) • Upload requires some information from the user (optional) • Information helps optimize performance • Capabilities are returned in the form of XML files (.xnd extension)

  18. GTP and Upload (contd) • Location (Null, hostname, zip, state, city, country, airport) • Duration in days • Fragments • Copies

  19. Download and GTPQUERY • Files keys and output are downloaded using information from .xnd files • Download is multithreaded • Adaptive algorithm: takes into account throughput to the client • “Faster” depots provide more blocks of data

  20. Download + GTPQUERY 5K 100 Representation of the binary file output for 5K collection

  21. Implementation Challenges • GTP in Java, while LoRS tools in C • Go through server (first xnd_server, then lors_server) • Adapt to changes – both GTP and LoRS tools are constantly evolving • Threading to optimize performance • User friendliness

  22. Performance • All results were achieved using the Java version of GTP • Three sub collections of FBIS (Foreign Broadcast Information Service) were used to produce benchmarks • Server located in Tennessee • Upload/download to/from Tennessee(TN), California (CA), France (FR)

  23. Run Specifications • By default, GTP uses 100 SVD factors, i.e., all term and document vectors are of length 100 • The weighting scheme used was log entropy • For the query only the first 15 singular triplets were used • Three queries were used on each collection: Yugoslavia Croatia Bosnia-Herzegovina Russia embassy FIS Nissan Motor

  24. FBIS 5K GTP + Upload Download + GTPQUERY

  25. FBIS 10K GTP + Upload Download + GTPQUERY

  26. FBIS 20K GTP + Upload Download + GTPQUERY

  27. Performance AnalysisGTP + Upload • GTP time is directly proportional to the collection size • Additional overhead for upload is not significant compared to the total time • Upload time depends on multiple factors: location, network bandwidth, time of day, size of file, number of copies requested, and status of depots at the time of the upload

  28. Performance AnalysisDownload + GTPQUERY • All “heavy-duty” preprocessing of the collection was done by GTP • Query process simply projects the query into the term-by-document vector space • Dimension of the vector space and number of factors used affects query time • Number of queries requested affects query time • Download takes up greater portion of the total time • Download is affected by location of fragments and network conditions

  29. Future Work • Optimize Java performance • Incorporate fully with GUI • Incorporate network storage into the other (C++, parallel) versions of GTP • Streaming data directly while it is generated? • Avoid local file generation • User friendliness

More Related