1 / 9

O|B|F Flatfile Indexing

Andrew Dalke Dalke Scientific Software, LLC. O|B|F Flatfile Indexing. One of the Biohackathon projects.

tom
Download Presentation

O|B|F Flatfile Indexing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Andrew Dalke Dalke Scientific Software, LLC O|B|F Flatfile Indexing One of the Biohackathon projects

  2. Sally, a bioinformatics researcher, needs fast access to many different records from GenBank. She is in a small group with little experience in database management systems so wants a simple system that doesn't involve a client/server model. She also wants the different tools she has (written for the different Bio* projects) to be able to access the system, so she doesn't need to continuously extract data with one tool for use by another. Use case

  3. Have a set of large data files Each contains many records Records have identifiers id, accession, gid, entry name, etc. Want to retrieve a record given an identifier Don't want to set up a database server Make an indexer Background

  4. Nothing new here "Everyone" has written one Spec out a standard and use it Indexer

  5. "Schema" Primary identifier (filename, start byte, length) (Actually, normalized to fileid) * * Secondary identifier Secondary identifier .... .... * * Secondary identifier Secondary identifier

  6. Index as flat-file key_ID.key config.dat P12345 \t 1 \t 10000 \t 100 index \t flat/1 fileid_1 \t /path/to/here fileid_2 \t /path/to/there .... id_ACC.index .... The .key and .index files are fixed width and sorted. Allows fast binary searches. GI22222 \t P00012 GI22222 \t P12345 GI22223 \t P86753 ....

  7. Use BDB tables for the key/value information Faster More scalable Easier to edit, modify More space efficient But it has an external dependency Index in BerkeleyDB Client code can determine the format automatically

  8. Biopython - Andrew Dalke Bioperl - Michele Clamp & Lincoln Stein BioJava - Matthew Pocock BioRuby - Toshiaki Katayama (starting) BioC - Steve Searle And they really do interoperate! Bio* support

  9. Still tweaking the spec How to handle format non-ASCII filenames / internationalization Need a cross-platform regression test suite TODO

More Related