1 / 8

Optimizing Full-Text Search for HathiTrust: A Technical Approach

This document outlines the strategies for optimizing full-text search within the HathiTrust project at the University of Michigan. Launched in October 2008, the project includes 48 member institutions and millions of volumes. We address the challenge of query response times by experimenting with index shards, server memory configurations, and disk media. Our findings emphasize the critical role of memory in improving performance and explore technologies such as SSD for enhanced query response. Continuous optimization and synchronization processes ensure system reliability.

iden
Download Presentation

Optimizing Full-Text Search for HathiTrust: A Technical Approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scaling Full-text Search Cory Snavely Library IT Core Services manager University of Michigan September 2011

  2. HathiTrust project profile • Launched October 2008 • 48 29 member institutions and growing • primarily Google-scanned materials but also other sources (e.g. Internet Archive, and increasingly content digitized by partners themselves) • 9.6 6.7 million volumes, 350 pages average • 430 250 terabytes in two US instances

  3. Full-text search overview Determine, through experimentation, the optimal… • …number and size of index “shards”. • …amount of server memory for IO cache and JVM. Evaluate whether rotational media can provide, given the above, … • …acceptable query throughput. • …a manageable update process, assuming continuous availability at two sites.

  4. Query response on rotational media

  5. Cache is king! More memory improves response time linearly with index size, so… • …choose a shard size that provides acceptable response time, and divide up the index evenly. • …buy as much memory as possible! Current configuration: 5TB index, 10 shards, ~3 shards and 72GB RAM per server.

  6. Snapshots make daily releases a snap • Index new materials throughout the day. • When queue is empty, optimize, check, and quiesce. • 3am: take snapshot of index and begin synchronizing from Michigan to Indiana. • 6am: check that synchronization is complete – it should be - and release simultaneously in both sites or calmly notify staff. • Rinse and repeat.

  7. Boosting performance with SSD Given that… • …99thpercentile of query response is > 1 second, • …rotational media has nothing more to give, and • …some parts of the index are repeatedly read, we are looking into a DRAM- or SSD-based NFS cache as potentially the simplest way to reduce response time of edge cases without perverting our simple and elegant indexing and release workflow.

  8. me: Cory Snavely csnavely@umich.edu primary IR research: Tom Burton-West tburtonw@umich.edu http://www.hathitrust.org/large_scale_search

More Related