Statistical significance for peptide identification by tandem mass spectrometry
Download
1 / 47

Lecture - PowerPoint PPT Presentation


  • 264 Views
  • Updated On :

Statistical Significance for Peptide Identification by Tandem Mass Spectrometry. Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park. High Quality Peptide Identification: E -value < 10 -8.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Lecture' - medwin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Statistical significance for peptide identification by tandem mass spectrometry l.jpg

Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

Nathan Edwards

Center for Bioinformatics and Computational Biology

University of Maryland, College Park


High quality peptide identification e value 10 8 l.jpg
High Quality Peptide Identification: Tandem Mass SpectrometryE-value < 10-8


Moderate quality peptide identification e value 10 3 l.jpg
Moderate quality peptide identification: Tandem Mass SpectrometryE-value < 10-3


Peptide identification l.jpg
Peptide Identification Tandem Mass Spectrometry

  • Peptide fragmentation by CID is poorly understood

  • MS/MS spectra represent incomplete information about amino-acid sequence

    • I/L, K/Q, GG/N, …

  • Correct identifications don’t come with a certificate!


Peptide identification5 l.jpg
Peptide Identification Tandem Mass Spectrometry

  • High-throughput workflows demand we analyze all spectra, all the time.

  • Spectra may not contain enough information to be interpreted correctly

    • …bad static on a cell phone

  • Peptides may not match our assumptions

    • …its all Greek to me

  • “Don’t know”is an acceptable answer!


Peptide identification6 l.jpg
Peptide Identification Tandem Mass Spectrometry

  • Rank the best peptide identifications

  • Is the top ranked peptide correct?


Peptide identification7 l.jpg
Peptide Identification Tandem Mass Spectrometry

  • Rank the best peptide identifications

  • Is the top ranked peptide correct?


Peptide identification8 l.jpg
Peptide Identification Tandem Mass Spectrometry

  • Rank the best peptide identifications

  • Is the top ranked peptide correct?


Peptide identification9 l.jpg
Peptide Identification Tandem Mass Spectrometry

  • Incorrect peptide has best score

    • Correct peptide is missing?

    • Potential for incorrect conclusion

    • What score ensures no incorrect peptides?

  • Correct peptide has weak score

    • Insufficient fragmentation, poor score

    • Potential for weakened conclusion

    • What score ensures we find all correct peptides?


Statistical significance l.jpg
Statistical Significance Tandem Mass Spectrometry

  • Can’t prove particular identifications are right or wrong...

    • ...need to know fragmentation in advance!

  • A minimal standard for identification scores...

    • ...better than guessing.

    • p-value, E-value, statistical significance


Pin the tail on the donkey l.jpg
Pin the tail on the donkey… Tandem Mass Spectrometry


Probability concepts l.jpg

Throwing darts Tandem Mass Spectrometry

One at a time

Blindfolded

Uniform distribution?

Independent?

Identically distributed?

Pr [ Dart hits 20 ] = 0.05

Probability Concepts


Probability concepts13 l.jpg
Probability Concepts Tandem Mass Spectrometry

Throwing darts

  • One at a time

  • Blindfolded

  • Three darts

    Pr [Hitting 20 3 times]

    = 0.05 * 0.05 * 0.05

    Pr [Hit 20 at least twice]

    = 0.007125 + 0.000125


Probability concepts14 l.jpg
Probability Concepts Tandem Mass Spectrometry


Probability concepts15 l.jpg
Probability Concepts Tandem Mass Spectrometry

Throwing darts

  • One at a time

  • Blindfolded

  • 100 darts

    Pr [Hitting 20 3 times]

    = 0.139575

    Pr [Hit 20 at least twice]

    = 0.9629188


Probability concepts16 l.jpg
Probability Concepts Tandem Mass Spectrometry


Match score l.jpg
Match Score Tandem Mass Spectrometry

  • Dartboard represents the mass range of the spectrum

  • Peaks of a spectrum are “slices”

    • Width of slice corresponds to mass tolerance

  • Darts represent

    • random masses

      • masses of fragments of a random peptide

      • masses of peptides of a random protein

      • masses of biomarkers from a random class

    • How many darts do we get to throw?


Match score18 l.jpg

100 Tandem Mass Spectrometry

% Intensity

0

m/z

250

500

750

1000

Match Score

What is the probability that we match at least 5 peaks?

270

330

870

550

755

580


Match score19 l.jpg
Match Score Tandem Mass Spectrometry

  • Pr [ Match ≥ s peaks ]

    = Binomial( p , n )

    ≈ Poisson( p n ), for small p and large n

    p is prob. of random mass / peak match,

    n is number of darts (fragments in our answer)


Match score20 l.jpg
Match Score Tandem Mass Spectrometry

Theoretical distribution

  • Used by OMSSA

  • Proposed, in various forms, by many.

  • Probability of random mass / peak match

    • IID (independent, identically distributed)

    • Based on match tolerance


Match score21 l.jpg
Match Score Tandem Mass Spectrometry

Theoretical distribution assumptions

  • Each dart is independent

    • Peaks are not “related”

  • Each dart is identically distributed

    • Chance of random mass / peak match is the same for all peaks


Tournament size l.jpg
Tournament Size Tandem Mass Spectrometry

100 people

1000 people

100 Darts, # 20’s

100000 people

10000 people


Tournament size23 l.jpg
Tournament Size Tandem Mass Spectrometry

100 people

1000 people

100 Darts, # 20’s

100000 people

10000 people


Number of trials l.jpg
Number of Trials Tandem Mass Spectrometry

  • Tournament size == number of trials

    • Number of peptides tried

    • Related to sequence database size

  • Probability that a random match score is ≥ s

    • 1 – Pr [ all match scores < s ]

    • 1 – Pr [ match score < s ] Trials (*)

    • Assumes IID!

  • Expect value

    • E = Trials * Pr [ match ≥ s ]

    • Corresponds to Bonferroni bound on (*)


Better dart throwers l.jpg
Better Dart Throwers Tandem Mass Spectrometry


Better random models l.jpg
Better Random Models Tandem Mass Spectrometry

  • Comparison with completely random model isn’t really fair

  • Match scores for real spectra with real peptides obey rules

  • Even incorrect peptides match with non-random structure!


Better random models27 l.jpg
Better Random Models Tandem Mass Spectrometry

  • Want to generate random fragment masses (darts) that behave more like the real thing:

    • Some fragments are more likely than others

    • Some fragments depend on others

  • Theoretical models can only incorporate this structure to a limited extent.


Better random models28 l.jpg
Better Random Models Tandem Mass Spectrometry

  • Generate random peptides

    • Real looking fragment masses

    • No theoretical model!

    • Must use empirical distribution

    • Usually require they have the correct precursor mass

  • Score function can model anything we like!


Better random models29 l.jpg
Better Random Models Tandem Mass Spectrometry

Fenyo & Beavis, Anal. Chem., 2003


Better random models30 l.jpg
Better Random Models Tandem Mass Spectrometry

Fenyo & Beavis, Anal. Chem., 2003


Better random models31 l.jpg
Better Random Models Tandem Mass Spectrometry

  • Truly random peptides don’t look much like real peptides

  • Just use peptides from the sequence database!

  • Caveats:

    • Correct peptide (non-random) may be included

    • Peptides are not independent

  • Reverse sequence avoids only the first problem


Extrapolating from the empirical distribution l.jpg
Extrapolating from the Empirical Distribution Tandem Mass Spectrometry

  • Often, the empirical shape is consistent with a theoretical model

Geer et al., J. Proteome Research, 2004

Fenyo & Beavis, Anal. Chem., 2003


False positive rate estimation l.jpg
False Positive Rate Estimation Tandem Mass Spectrometry

  • Each spectrum is a chance to be right, wrong, or inconclusive.

    • How many decisions are wrong?

  • Given identification criteria:

    • SEQUEST Xcorr, E-value, Score, etc., plus...

    • ...threshold

  • Use “decoy” sequences

    • random, reverse, cross-species

    • Identifications must be incorrect!


False positive rate estimation34 l.jpg
False Positive Rate Estimation Tandem Mass Spectrometry

  • # FP in real search = # hits in decoy search

    • Need same size database, or rate conversion

  • FP Rate: # decoy hits

    # real hits

  • FP Rate: 2 x # decoy hits .

    (# real hits + # decoy hits)


False positive rate estimation35 l.jpg
False Positive Rate Estimation Tandem Mass Spectrometry

  • A form of statistical significance

    • In “theory”, E-value and a FP rate are the same.

  • Search engine independent

    • Easy to implement

  • Assumes a single threshold for all spectra

    • Spectrum/Peptide Identification scores are not iid!...

    • ...but E-values, in principle, are.


Peptide prophet l.jpg
Peptide Prophet Tandem Mass Spectrometry

  • From the Institute for Systems Biology

    • Keller et al., Anal. Chem. 2002

  • Re-analysis of SEQUEST results

  • Spectra are trials

  • Assumes that many of the spectra are not correctly identified


Peptide prophet37 l.jpg
Peptide Prophet Tandem Mass Spectrometry

Keller et al., Anal. Chem. 2002

Distribution of spectral scores in the results


Peptide prophet38 l.jpg
Peptide Prophet Tandem Mass Spectrometry

  • Assumes a bimodal distribution of scores, with a particular shape

  • Ignores database size

    • …but it is included implicitly

  • Like empirical distribution for peptide sampling, can be applied to any score function

    • Can be applied to any search engines’ results


Peptide prophet39 l.jpg
Peptide Prophet Tandem Mass Spectrometry

  • Caveats

    • Are spectra scores sampled from the same distribution?

    • Is there enough correct identifications for second peak?

    • Are spectra independent observations?

    • Are distributions appropriately shaped?

  • Huge improvement over raw SEQUEST results


Peptides to proteins l.jpg
Peptides to Proteins Tandem Mass Spectrometry

Nesvizhskii et al., Anal. Chem. 2003


Peptides to proteins41 l.jpg
Peptides to Proteins Tandem Mass Spectrometry


Peptides to proteins42 l.jpg
Peptides to Proteins Tandem Mass Spectrometry

  • A peptide sequence may occur in many different protein sequences

    • Variants, paralogues, protein families

  • Separation, digestion and ionization is not well understood

  • Proteins in sequence database are extremely non-random, and very dependent


Publication guidelines l.jpg
Publication Guidelines Tandem Mass Spectrometry


Publication guidelines44 l.jpg
Publication Guidelines Tandem Mass Spectrometry

  • Computational parameters

    • Spectral processing

    • Sequence database

    • Search program

    • Statistical analysis

  • Number of peptides per protein

    • Each peptide sequence counts once!

    • Multiple forms of the same peptide count once!


Publication guidelines45 l.jpg
Publication Guidelines Tandem Mass Spectrometry

  • Single-peptide proteins must be explicitly justified by

    • Peptide sequence

    • N and C terminal amino-acids

    • Precursor mass and charge

    • Peptide Scores

    • Multiple forms of the peptide counted once!

  • Biological conclusions based on single-peptide proteins must show the spectrum


Publication guidelines46 l.jpg
Publication Guidelines Tandem Mass Spectrometry

  • More stringent requirements for PMF data analysis

    • Similar to that for tandem mass spectra

  • Management of protein redundancy

    • Peptides identified from a different species?

  • Spectra submission encouraged


Summary l.jpg
Summary Tandem Mass Spectrometry

  • Could guessing be as effective as a search?

  • More guesses improves the best guess

  • Better guessers help us be more discriminating

  • Peptide to proteins is not as simple as it seems

  • Publication guidelines reflect sound statistical principles.


ad