Improving Statistical Significance Assignment in Mass Spectrometry Based Peptide Identification

Improving Statistical Significance Assignment in Mass Spectrometry Based Peptide Identification • Overview • Statistical Significance in Peptide Identification • Statistics through deNovo method • Example • New way to Combine Search Results Yi-Kuo Yu Quantitative Molecular Biological Physics (QMBP) Group National Center for Biotechnology Information National Library of Medicine, National Institutes of Health

QMBP Research using Biowulf • Molecular Dynamics Protein Folding Simulations • Molecular Networks Information Transduction in protein-protein interaction networks • Molecular Interactions Exact electrostatic force/energy • Mass Spectrometry statistics of peptide/protein ID

Mass Spect. Task force LCDR Gelio Alves Dr. Aleksey Ogurtsov • Relevant References: • Gelio Alves and Yi-Kuo Yu. • Statistical Characterization of a 1D Random Potential Problem – with applications in score • statistics of MS-based peptide sequencing • Physica A (2008), 387:6538-6544. doi:10.1016. • 2. G. Alves , A. Ogurtsov, Wells W. Wu, Guanhui Wang, R-F Shen and Yi-Kuo Yu • Calibrating E-values for MS2 Database Search Methods • Biology Direct, 2007, 2:26 • 3. Gelio Alves, Wells W. Wu,Guanghui Wang, Rong-Fong Shen and Yi-Kuo Yu • Enhancing Peptide Identification Confidence by Combining Search Methods • Journal of Proteome Research, 7:3102-3113 (2008).

Overview: MS-based Proteomics Protein Identification is important for Proteomics/System Biology • Important issues: • Protein ID in a mixture, • Protein Circuit / Localization, • (3) Signaling and Communication. Desirable to understand Proteins involved? A generic pathway

What can mass spect do? Protein identification through peptide identification: MS/MS produces fragments of partial-peptides [(a,b,c)s and (x,y,z)s], thus provides more information about the peptide for sequencing. Given a set of MS/MS spectra, by database searches or denovo sequencing, one may identify peptides involved and then infer the proteins involved.

What is the problem? Confidence assignment in peptide identifications (How to confidently interpret biological experiments): Where to draw the line when selecting peptide candidates? How to rank peptide candidates across spectra? How to compare results analyzed using different search methods? (Does a top hit in method M1 carries the same meaning as that in method M2?) How to compare results from different experiments? • A possible solution is to have robust statistical significance assignment that provides • a quantifiable confidence measure for peptide ID • the flexibility to compare results from different spectra and even from different search methods.

Solid Statistics (E-values) might be our best rescue In the context of peptide searches, both the E- and P-values may be viewed as monotonically decreasing functions of some algorithm-dependent quality score S. For a given quality score cutoff, P-valuerefers to the probability of finding a random hit with quality score greater than or equal to the cutoff. E-value is defined as the expected number of hits in a random database with quality score greater than or equal to the cutoff. E = P*(random_db_size) [Equivalent to Bonferroni Correction] Key assumption needed: Aside from the true peptides, the rest of the peptides in the database appear to be random with respect to any given MS/MS spectrum. Using correct E-values, we can compare search results from different spectra and even different search methods!

Aren’t there many methods reporting E-values already, why not just use them? Apparently, most E-values reported deviate from the textbook definition. CR: removal of highly homologous clusters [Ref: Biol. Direct, 2007, 2:25]

To circumvent the statistical inaccuracy: • We developed RAId_DbS, • a new search method that has • satisfactorystatistics (see below) • but without losing performance • (see ROC curves to the right) using profile data using centroid data

(2) We provide a protocol to calibrate E-values: There exist methods that do not report E-value. To compare the search results from these methods, one needs to calibrate statistics, see G Alves, AY Ogurtsov, Y-K Yu, Calibrating E-values for MS2 Database Search Methods [Biol. Direct (2007), 2:26] problem: may lose spectrum-specific statistics.

(3) Statistical calibration leads to a way to combine search results from different methods (but can’t enforce spectrum-specific statistics), see Alves et al. Enhancing peptide identification by combining search methods [JPR (2008), 7:3102–3113]. Other advantage of having accurate E-value: Simple connection to the False Discovery Rate (FDR) where Ecis the E-value cutoff, N is the total number of spectra, and H(Ec) is the cumulative number of hits with E-value smaller than or equal to Ec. No need to search in decoy database to get FDR!

Spectrum-Specific Statistics Why spectrum-specific statistics? Fragment peaks depend on parent ion charge state, the presence of co-eluted materials and their physical interactions with each other, and the relative kinetic energy of the inert gas (CID), or the relative kinetic energy of the electrons (ECD, ETD), and the peptide/co-eluted material concentrations, and the peptide/co-eluted material conformation in gaseous phase, etc. Spectra from the same peptide KVPQVSTPTLVEVSR

The complication • Spectrum-specific noise demands spectrum-specific statistics. • Not every search method can do this. • Only two known methods use spectrum-specific statistics: • X!Tandem (fitted empirically) • RAId_DbS (derived theoretically) • Recently SEQUEST developers have also investigated the • possibility of using spectrum-specific XCorr statistics.

A new approach: obtaining statistical standard from scoring all possible peptides Merit: Bypass the need of decoy database (when FDR is considered) and the need of E-value calibration. • Challenge: the astronomically large number of peptides to score. • For a peptide of molecular weight 2300 Da there are ≈ 1026 . • Scoring 109 peptides per second would take ≈ 3.2 x 109 years! Solution: see our recent paper, Physica A (2008), 387:6538-6544. doi:10.1016. all possible human tryptic peptides all possible tryptic peptides

A new approach: obtaining statistical standard from scoring all possible peptides (cont…) Algorithm: also capable of incorporating internal structures such as peptide lengths, hydrophobicity etc. by extending the dimension of the internal array. [Physica A (2008), 387:6538-6544] Scoring functions: RAId_DbS, Hyperscore (X!Tandem), K-score, XCorr, WP. This dynamic programming algorithm can score all possible peptides in a few seconds. A similar algorithm was proposed independently by Pevzner’s group [JPR (2008), 7:3354-3363].

P-value of each candidate peptide from the 50MB random database is Inferred from the denovo score histogram of all possible peptides. ISB, Centroid data (RAId_DbS strategy).

Combining search results For a given spectrum σ, search a database using methods, M1 and M2, return hit lists L1(σ) and L2(σ) respectively along with database P-values Pdb. L2(σ) GAMHLER 3.4e-6 TVPMRQK 1.6e-3 VGTMGSK 0.06 ………… L1(σ) U L2(σ) GAMHLER 1.0 3.4e-6 SAMPLER 1.4e-4 1.0 TVPMRQK 4.6e-2 1.6e-3 VGTMGSK 1.0 0.06 HVGTMHK 0.13 1.0 ………… L1(σ) SAMPLER 1.4e-4 TVPMRQK 4.6e-2 HVGTMHK 0.13 ………… Peptide not present in a report list is assigned a database P-value 1.

Remarks and Acknowledgement It is anticipated that combining search methods that are orthogonal to each other might be most advantageous. It is easy to check the correlation between information utilized by various scoring methods. RAId_denovo can be accessed from http://www.ncbi.nlm.nih.gov/CBBresearch/qmbp/ (standalone version will be available for download this summer) We thank the administrative group of the Biowulf computers for constant technical support, which considerably helped our computational progress in improving the peptide identification statistics over the past few years. We thank Dr. R.-F. Shen for providing various peptide MS/MS data.

Improving Statistical Significance Assignment in Mass Spectrometry Based Peptide Identification

Improving Statistical Significance Assignment in Mass Spectrometry Based Peptide Identification

Presentation Transcript

Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

Mass spectrometry-based proteomics

Mass Spectrometry

A Neural Network Predictor for Peptide Fragmentation in Mass Spectrometry

Mass spectrometry-based proteomics

MASS SPECTROMETRY-BASED METABOLOMICS

Mass Spectrometry

Protein Identification and Peptide Sequencing by Liquid Chromatography – Mass Spectrometry

Algorithms for Peptide Mass Spectrometry

Efficient and accurate algorithms for peptide mass spectrometry

Peptide Sequencing by Mass Spectrometry

Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

Mass Spectrometry-based Proteomics

Peptide Identification via Tandem Mass Spectrometry Sorin Istrail

Protein Identification Using Tandem Mass Spectrometry

Mass Spectrometry-Based Methods for Protein Identification

Introduction to mass spectrometry-based protein identification and quantification

PROTEIN IDENTIFICATION BY MASS SPECTROMETRY

Peptide Sequencing by Mass Spectrometry

Mass Spectrometry

Mass Spectrometry