VIGOR3 re-engineering VIGOR

VIGOR3 re-engineering VIGOR

VIGOR3 – re-engineering VIGOR Goals Incorporate “best practices” into software design Decrease time required to implement a new genome Simplify use Improve integration with current JCVI processes

Incorporate “best practices” into software design • Code has been reorganized using a structured programming approach. • Reporting disentangled from gene calling. • Code is heavily parameterized (blast parameters, similarity and coverage requirements, candidate selectivity, etc.). • # initialization • # candidate regions • set_parameter( "candidate_blastopts", "-p blastx -M BLOSUM45 -e 1E-5 -g F -b 1000 -v 0 -F \"\"" ); • set_parameter( "candidate_evalue", " ); • set_parameter( "min_candidate_pctsimilarity", 50 ); • set_parameter( "min_candidate_sbjcoverage", 33 ); • set_parameter( "selectivity", 0.985 ); • … • SQLite database is queried for overrides of parameter defaults appropriate to selected reference database. • sqlite> select col1,col2 from vigor_params where db='cov_g1a_db' order by col1; • candidate_blastopts-p blastx -M BLOSUM62 -e 1E-5 -v 0 -F \"\" -b 1000 • min_candidate_pctsimilarity55 • selectivity 0.980

Decrease time required to incorporate a new genome • Developed generalized module to deal with each customization exception • Overlapping genes • Splicing • Ribosomal slippage • RNA editing • Translation exceptions • Functional annotation • Mature peptide annotation • Execution of the customization modules is configured by defline tags in the reference database. • >YP_006491235.1 gene="SP-2" product="truncated polyprotein" shared_cds="SP" matpepdb="<vigordata>/veev_sp2_mp" structure="e2409i-1e78" ribosomal_slippage=Y slippage_motif="[NT]{6}[NA]" slippage_frameshift=-1 db="veevX_db“ • First small steps towards a true rule-based annotation? • check overlapping genes against list of permitted and forbidden overlaps • score result for each possible splice in region around expected intron and use best • apply frameshift to each potential slippage site, score results and use best • apply edit (regexp) to each potential edit site, score results and use best • alternate start codon/stop codon readthru – use best result, with/without exception • gene symbol, product name, and (optionally) notes • homology based – use hits from reference db that best cover the polyprotein

Simplify use • With customization moved out of the code and into configuration data, a single executable can now handle all genomes • The reference genome can be specified as a command line option • An auto-select mode has been implemented to allow VIGOR3 to choose the reference database best suited to the genome • A detailed summary is generated for each genomic sequence • Sequence Length Genes Pseudogenes CDS Bases Peptide Bases %Ref Coverage %Ref Identity %Ref Similarity Ref DB • gi|32172464|gb|AY309 15384 9 0 15453 5142 99.9 97.7 99.4 mmp_db • gene_id %id %sim %cov %t5 %gap %t3 start..stop pep sz ref sz ref id definition • gi|32172464.1 97.8 99.2 100.0 0.0 0.0 0.0 146..1795 549 549 Q77IS8.1 NP | nucleocapsid protein • gi|32172464.2 93.9 97.4 96.5 0.0 0.0 3.5 1979..2472 165 171 AEY76114.1 I | I protein • transcription exception 2435..2441:AAGAGGGGGGG • gi|32172464.3 97.8 99.0 100.0 0.0 0.0 0.0 1979..2653 224 224 BAK26765.1 V | V protein • gi|32172464.4 97.4 98.9 100.0 0.0 0.0 0.0 1979..3152 391 391 NP_054708.1 P | phosphoprotein • transcription exception 2435..2441:AAGAGGGGG • gi|32172464.5 99.5 99.9 100.0 0.0 0.0 0.0 3264..4391 375 375 NP_054710.1 M | membrane protein • gi|32172464.6 96.5 99.3 100.0 0.0 0.0 0.0 4546..6162 538 538 AAT81469.1 F | fusion protein • gi|32172464.7 91.2 96.1 100.0 0.0 0.0 0.0 6268..6441 57 57 P28087.1 SH | small hydrophobic protein • gi|32172464.8 97.3 99.4 100.0 0.0 0.0 0.0 6614..8362 582 582 AAT81471.1 HN | hemagglutinin-neuraminidase • gi|32172464.9 98.3 99.6 100.0 0.0 0.0 0.0 8438..15223 2261 2261 P30929.1 L | large protein

Improve integration with current JCVI processes • Better support for draft genomes • gap aware – genes are broken at N-gaps, producing “N-terminal,” “fragment,” and “C-terminal” components. • improved handling of partial sequences. • Integrated with AutoTasker to support closure. • Improved diagnosis of potential frameshifts and sequencing errors. • Integrated into the annotation pipeline via a “push-button” wrapper that allows annotation of genomes in batches of hundreds of sequences. • Data driven annotation makes it easier to respond to changing community standards as knowledge about the genome evolves.

Gene Calling in VIGOR3 Blast genome against reference database …filter out low quality HSPs >46 gtaaaaaatgcgtactacaaacttgcgtaaaccaaaaaaatggggcaaataagaatttgataagtaccacttaaatttaactcctttggttagagatgggcagcaattcattgagtatgataaaagttagattacaaaatttatttgacaatgatgaagtagcattgttaaaaataacctgctatactgacaaattgatacatttaactaatgctttggctaaggcagtgatacatacaatcaaattgaatggcattgtatttgtgcatgttattacaagtagtgatatttgccctaataataatattgtagtgaaatccaacttcacaacaatgccagtgttacaaaatggaggttatatatgggaaatgatggaattaacacactgctctcaacccaatggcctaatagatgacaattgtgaaatcaaattctccaa… >CA181295.1 gene="F"… MELPILKANAITTILAAVTFCFASSQNITEEFYQS… >AA238520.1 gene="F"… MELPILKTNAITTILAAVTLCFASSQNITEEFYQS… >AJ188031.1 gene="G"… MSKHKNQRTARTLEKTWDTLNHLIVISSCLYKLNL… >AQ166856.1 gene="G"… MSKTKDQRTAKTLERTWDTLNHLLFISSCLYKLNL… BLAST-X CA181295 CA181295 CA… CA181295 C A CA181295 High Scoring Pairs AA238520 A AA238520 A AJ188031 AQ… C • C C AQ166856 C A AJ…

For each HSP …project the implied gene …cluster the projections gene F generated TWO clusters Cluster F-1 CA181295 CA181295 CA… CA181295 Cluster F-2 CA181295 AA238520 AA238520 Cluster G AQ… AQ166856 AJ… AJ188031

Within a cluster ...generate “hits” (combinations of HSPs) …find missing exons* …discard hits below threshold …score the hits …calculate threshold score Cluster CA181295 CA181295 CA… CA181295 322.8 … … 288.3 … … AA238520 327.1 * 0.95 = 310.745 AA238520 327.1 …

Finding missing exons …blast …generate all possible hits establish region to search HSPs from blast implied exon (based on known gene structure) expected intron and exon (based on known gene structure) buffer buffer search region for missing exon

…remove hits that fail coverage requirement Pool all hits from all clusters …recluster hits …remove conflicting hits each cluster will yield a gene or pseudogene G HN Cluster F Cluster L F L F F L Cluster HN HN HN Cluster G G F HN G G F

For each hit …merge HSPs from same exon …adjust internal edges to splice sites …adjust external edges to start and stop codons …keep the best …score the hits there’s your gene ! 513.5 G 481.0 G no start codon 347.7 G frameshift !

Acknowledgements Shiliang Wang for his original concept, for many critical algorithms, for development of several reference databases, and for many useful insights Neha Gupta, Lakshmi Appalla for the development of the bulk of the reference databases, for hours of testing, and for their valuable feedback and many useful insights Susmita Shrivastava For her many insights, and particularly for her valuable suggestions for utilities and procedures to assist in the development of reference databases Paolo Amedeo, Dan Katzel for many useful suggestions, and for their assistance in the integration of VIGOR3 into existing JCVI procedures and pipelines. Ramana Madupu, Chris Town, Nadia Fedorova for their patience, valuable feedback, and willingness to serve as beta-testers (whether they realized it or not). Tim Stockwell for providing the opportunity and support for this effort.

VIGOR3 re-engineering VIGOR