NCBI BLAST Benchmarking Suite for Performance Evaluation

BLAST benchmarks George Coulouris NCBI/NLM/NIH coulouri@ncbi.nlm.nih.gov June 2005

Motivation and goal • It’s hard to define what constitutes a “typical” search. • NCBI BLAST processes over 150,000 searches per day. • Large scale characteristics of this workload are stable over time. • Goal: Design a test suite that approximates this workload.

Applications • Evaluate the relative performance of BLAST running on different hardware • Evaluate the relative performance of different BLAST implementations

Components • Databases • Queries • Tasks • Driver

Databases • Protein “nr” and nucleotide “nt” account for >80% of all searches; good choice for representative databases. • Sequences are constantly added and removed; databases are updated daily. • The volatility and large size of these databases make them unsuitable for benchmarking purposes.

Databases • Solution: Generate benchmark databases from subsets of “nr” and “nt”. • Non-redundant proteins are sampled from “nr”. • Size ratio of nucleotide to protein databases is preserved to avoid skewing runtime results.

Queries • >90% of protein queries are <1000 residues in length • >90% of nucleotide queries are <2000 base pairs in length • Should cover major model organisms • Solution: Sample 200 queries from refseq_rna and refseq_protein. Resulting set covers many organisms and has a typical length distribution.

Tasks Program distribution: blastn 50% megablast 10% blastp 20% blastx 10% tblastn 5% tblastx 5%

Driver script • Executes 200 searches according to above program distribution. • Runs in 35 minutes on current hardware. • Can be used to measure speed or throughput.

Sample results

NCBI BLAST Benchmarking Suite for Performance Evaluation

NCBI BLAST Benchmarking Suite for Performance Evaluation

Presentation Transcript

BLAST

BENCHMARKS

BLAST

Benchmarks

BLAST

BLAST:

Benchmarks

BLAST

Benchmarks

BLAST

Benchmarks

BLAST

Blast

Benchmarks

Benchmarks

BLAST

BLAST

BLAST

BLAST