Improving the Sensitivity of Peptide Identification for Genome Annotation

Improving the Sensitivityof Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center

Lost peptide identifications • Missing from the sequence database • Search engine strengths, weaknesses, quirks • Poor score or statistical significance • Thorough search takes too long

Lost peptide identifications • Missing from the sequence database • Build exhaustive peptide sequence databases • Search prokaryotic genomes • Build evidence for unannotated proteins and protein isoforms • Search engine strengths, weaknesses, quirks • Use multiple search engines and combine results • Poor score or statistical significance • Use search-engine consensus to boost confidence • Use machine-learning to distinguish true from false • Thorough search takes too long • Harness the power of heterogeneous computational grids

Searching under the street-light… • Tandem mass spectrometry doesn’t discriminate against novel peptides......but protein sequence databases do! • Searching traditional protein sequence databases biases the results in favor ofwell-understoodand/orcomputationally predicted proteins and protein isoforms!

Unannotated Splice Isoform • Human Jurkat leukemia cell-line • Lipid-raft extraction protocol, targeting T cells • von Haller, et al. MCP 2003. • LIME1 gene: • LCK interacting transmembrane adaptor 1 • LCK gene: • Leukocyte-specific protein tyrosine kinase • Proto-oncogene • Chromosomal aberration involving LCK in leukemias. • Multiple significant peptide identifications

Unannotated Splice Isoform

Translation start-site correction • Halobacterium sp. NRC-1 • Extreme halophilic Archaeon, insoluble membrane and soluble cytoplasmic proteins • Goo, et al. MCP 2003. • GdhA1 gene: • Glutamate dehydrogenase A1 • Multiple significant peptide identifications • Observed start is consistent with Glimmer 3.0 prediction(s)

Halobacterium sp. NRC-1ORF: GdhA1 • Peptide identifications filtered at 10% FDR • Consistent/not consistentwith NP_279651

Halobacterium sp. NRC-1ORF: GdhA1 • Peptide identifications filtered at 20% FDR • Consistent/not consistentwith NP_279651

Translation start-site correction

We can observe evidence for… • Known coding SNPs • Unannotated coding mutations • Alternate splicing isoforms • Alternate/Incorrect translation start-sites • Microexons • Alternate/Incorrect translation frames …though it must be treated thoughtfully.

Peptide Sequence Databases All amino-acid seqs of at most 30 amino-acids from: • IPI and all IPI constituent protein sequences • IPI, HInvDB, VEGA, UniProt, EMBL, RefSeq, GenBank • SwissProt variants, conflicts, splices, and annotated signal peptide truncations. • Genbank and RefSeq mRNA sequence • 3 frame translation • GenBank EST and HTC sequences • 6 frame translation and found in at least 2 sequences Grouped by Gene/UniGene cluster and compressed.

Peptide Sequence Databases • Formatted as a FASTA sequence database • Easy integration with search engines. • One entry per gene/cluster. • Automated rebuild every few months.

Peptide evidence, in context • Statistically significant identified peptides can be misleading… • Isobaric amino-acid/PTM substitutions • Unsubstantiated peptide termini • Few b-ions or y-ions suggest “random” mass match • Single amino-acids on upstream or downstream exons • Peptides in 5’ UTR with no upstream Met • Need tools to quickly check the corroborating (genomic, transcript, SNP) evidence

Counts: by gene and evidence EST, mRNA, Protein Sequences: accessions by gene UniProt variants nucleotide sequence & link to BLAT alignment Genomic Loci: one-click projection to the UCSC genome browser PeptideMapper Web Service

PeptideMapper Web Service I’m Feeling Lucky

PeptideMapper Web Service • Suffix-tree index on peptide sequence database • Fast peptide to gene/cluster mapping • “Compression” makes this feasible • Peptide alignment with cluster evidence • Amino-acid or nucleotide; exact & near-exact • Genomic-loci mapping via • UCSC “known-gene” transcripts, and • Predetermined, embedded genomic coordinates

SEQUEST Mascot 28% 14% 14% 38% 1% 3% 2% X! Tandem Comparison of search engine results • No single score is comprehensive • Search engines disagree • Many spectra lack confident peptide assignment Searle et al. JPR 7(1), 2008

Combining search engine results – harder than it looks! • Consensus boosts confidence, but... • How to assess statistical significance? • Gain specificity, but lose sensitivity! • Incorrect identifications are correlated too! • How to handle weak identifications? • Consensus vs disagreement vs abstention • Threshold at some significance? • We apply unsupervised machine-learning.... • Lots of related work unified in a single framework.

Supervised Learning

Unsupervised Learning

PepArML Combining Results Q-TOF Edwards, et al., Clin. Prot. 5(1), 2009 MALDI LTQ

Unsupervised Learning U*-TMO U-TMO C-TMO H Edwards, et al., Clin. Prot. 5(1), 2009

Peptide Atlas A8_IP LTQ Dataset

MALDI spectra of proteolytic peptides in serum Top-down CID spectra after decharging Halobacterium six-frame search PepArML found 389 non-RefSeq peptides Mascot: 173, OMSSA: 168, K-Score: 292 Peptides for GdhA1: PepArML 9(2), K-Score: 6(1) Semi-tryptic searches work particularly well. PepArML in the trenches… S17 Spectra at 10% FDR

Searching for Consensus Search engine quirks can destroy consensus • Initial methionine loss as tryptic peptide • Charge state enumeration or guessing • X!Tandem's refinement mode • Pyro-Gln, Pyro-Glu modifications • Difficulty tracking spectrum identifiers • Precursor mass tolerance (Da vs ppm) Decoy searches must be identical!

Configuring for Consensus Search engine configuration can be difficult: • Correct spectral format • Search parameter files and command-line • Pre-processed sequence databases. • Tracking spectrum identifiers • Extracting peptide identifications, especially modifications and protein identifiers

Instrument Precursor Tolerance Fragment Tolerance Max. Charge Sequence Database Target and # of Decoys Modification Fixed/Variable Amino-Acids Position Delta Proteolytic Agent Motif Peptide Candidates Termini Specificity Precursor Tolerance Missed cleavages Charge State Handling # 13C Peaks Search Engines Mascot, X!Tandem, K-Score, OMSSA, MyriMatch Peptide Identification Meta-Search Parameters

Simple unified search interface for: Mascot, X!Tandem, K-Score, OMSSA, MyriMatch Automatic decoy searches Automatic spectrumfile "chunking" Automatic scheduling Serial, Multi-Processor, Cluster, Grid Peptide Identification Meta-Search

Peptide Identification Grid-Enabled Meta-Search X!Tandem, KScore, OMSSA, MyriMatch, Mascot (1 core). NSF TeraGrid 1000+ CPUs Heterogeneous compute resources X!Tandem, KScore, OMSSA. Secure communication Edwards Lab Scheduler & 48+ CPUs Scales easily to 250+ simultaneoussearches X!Tandem, KScore, OMSSA. Single, simplesearch request UMIACS 250+ CPUs

Peptide Identification Grid-Enabled Meta-Search Heterogeneous compute resources NSF TeraGrid 1000+ CPUs Edwards Lab Scheduler & 48+ CPUs Secure communication Simple searchrequest UMIACS 250+ CPUs

Peptide Atlas A8_IP LTQ Dataset • Tryptic search of Human ESTs using PepSeqDB • 107084 spectra searched ~ 26 times: • Target + 2 decoys, 5 engines, 1+ vs 2+/3+ charge • 8685 search jobs • 25.7 days of CPU time. • 5211 TeraGrid TKO jobs < 2 hours • Using 143 different machines • Total elapsed time < 26 hours • Bottleneck: Mascot license.

Peptide Identification Grid-Enabled Meta-Search • Access to high-performance computing resources for the proteomics community • NSF TeraGrid Community Portal • University/Institute HPC clusters • Individual lab compute resources • Contribute cycles to the community and get access to others’ cycles in return. • Centralized scheduler • Compute capacity can still be exclusive, or prioritized. • Compute client plays well with HPC grid schedulers.

Conclusions Improve the scope and sensitivity of peptide identification for genome annotation, using • Exhaustive peptide sequence databases • Machine-learning for combining • Meta-search tools to maximize consensus • Grid-computing for thorough search http://edwardslab.bmcb.georgetown.edu

Acknowledgements • Dr. Catherine Fenselau • University of Maryland Biochemistry • Dr. Rado Goldman • Georgetown University Medical Center • Dr. Chau-Wen Tseng & Dr. Xue Wu • University of Maryland Computer Science • Funding: NIH/NCI

Improving the Sensitivity of Peptide Identification for Genome Annotation

Improving the Sensitivity of Peptide Identification for Genome Annotation

Presentation Transcript

Genome annotation

MICROBIAL GENOME ANNOTATION

Computational Genome Annotation

Proteogenomics : Refining and Improving Genome Annotation

Genome Annotation

Genome Annotation

Eukaryotic Genome Annotation

Genome Annotation

Genome Annotation

Peptide-assisted annotation of the Mlp genome

Genome Annotation

Basics of Genome Annotation

Improving the Sensitivity of Peptide Identification for Genome Annotation

microbial genome annotation

Genome Annotation

Genome Annotation

Improving the Sensitivity of Peptide Identification With Meta-Search and Machine Learning

Eukaryotic Genome Annotation

Improving Genome Annotation using Proteomics

Arabidopsis Genome Annotation

Annotation of the Laccaria genome

Genome Annotation