100 likes | 204 Views
Designing a test suite to approximate the stable workload of over 150,000 daily searches on NCBI BLAST, using representative databases and queries while maintaining size ratios. Evaluate BLAST performance on different hardware and implementations efficiently.
E N D
BLAST benchmarks George Coulouris NCBI/NLM/NIH coulouri@ncbi.nlm.nih.gov June 2005
Motivation and goal • It’s hard to define what constitutes a “typical” search. • NCBI BLAST processes over 150,000 searches per day. • Large scale characteristics of this workload are stable over time. • Goal: Design a test suite that approximates this workload.
Applications • Evaluate the relative performance of BLAST running on different hardware • Evaluate the relative performance of different BLAST implementations
Components • Databases • Queries • Tasks • Driver
Databases • Protein “nr” and nucleotide “nt” account for >80% of all searches; good choice for representative databases. • Sequences are constantly added and removed; databases are updated daily. • The volatility and large size of these databases make them unsuitable for benchmarking purposes.
Databases • Solution: Generate benchmark databases from subsets of “nr” and “nt”. • Non-redundant proteins are sampled from “nr”. • Size ratio of nucleotide to protein databases is preserved to avoid skewing runtime results.
Queries • >90% of protein queries are <1000 residues in length • >90% of nucleotide queries are <2000 base pairs in length • Should cover major model organisms • Solution: Sample 200 queries from refseq_rna and refseq_protein. Resulting set covers many organisms and has a typical length distribution.
Tasks Program distribution: blastn 50% megablast 10% blastp 20% blastx 10% tblastn 5% tblastx 5%
Driver script • Executes 200 searches according to above program distribution. • Runs in 35 minutes on current hardware. • Can be used to measure speed or throughput.