Loading in 2 Seconds...

Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

Loading in 2 Seconds...

- 264 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Lecture' - medwin

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

Nathan Edwards

Center for Bioinformatics and Computational Biology

University of Maryland, College Park

Peptide Identification

- Peptide fragmentation by CID is poorly understood
- MS/MS spectra represent incomplete information about amino-acid sequence
- I/L, K/Q, GG/N, …
- Correct identifications don’t come with a certificate!

Peptide Identification

- High-throughput workflows demand we analyze all spectra, all the time.
- Spectra may not contain enough information to be interpreted correctly
- …bad static on a cell phone
- Peptides may not match our assumptions
- …its all Greek to me
- “Don’t know”is an acceptable answer!

Peptide Identification

- Rank the best peptide identifications
- Is the top ranked peptide correct?

Peptide Identification

- Rank the best peptide identifications
- Is the top ranked peptide correct?

Peptide Identification

- Rank the best peptide identifications
- Is the top ranked peptide correct?

Peptide Identification

- Incorrect peptide has best score
- Correct peptide is missing?
- Potential for incorrect conclusion
- What score ensures no incorrect peptides?
- Correct peptide has weak score
- Insufficient fragmentation, poor score
- Potential for weakened conclusion
- What score ensures we find all correct peptides?

Statistical Significance

- Can’t prove particular identifications are right or wrong...
- ...need to know fragmentation in advance!
- A minimal standard for identification scores...
- ...better than guessing.
- p-value, E-value, statistical significance

Throwing darts

One at a time

Blindfolded

Uniform distribution?

Independent?

Identically distributed?

Pr [ Dart hits 20 ] = 0.05

Probability ConceptsProbability Concepts

Throwing darts

- One at a time
- Blindfolded
- Three darts

Pr [Hitting 20 3 times]

= 0.05 * 0.05 * 0.05

Pr [Hit 20 at least twice]

= 0.007125 + 0.000125

Probability Concepts

Throwing darts

- One at a time
- Blindfolded
- 100 darts

Pr [Hitting 20 3 times]

= 0.139575

Pr [Hit 20 at least twice]

= 0.9629188

Match Score

- Dartboard represents the mass range of the spectrum
- Peaks of a spectrum are “slices”
- Width of slice corresponds to mass tolerance
- Darts represent
- random masses
- masses of fragments of a random peptide
- masses of peptides of a random protein
- masses of biomarkers from a random class
- How many darts do we get to throw?

% Intensity

0

m/z

250

500

750

1000

Match ScoreWhat is the probability that we match at least 5 peaks?

270

330

870

550

755

580

Match Score

- Pr [ Match ≥ s peaks ]

= Binomial( p , n )

≈ Poisson( p n ), for small p and large n

p is prob. of random mass / peak match,

n is number of darts (fragments in our answer)

Match Score

Theoretical distribution

- Used by OMSSA
- Proposed, in various forms, by many.
- Probability of random mass / peak match
- IID (independent, identically distributed)
- Based on match tolerance

Match Score

Theoretical distribution assumptions

- Each dart is independent
- Peaks are not “related”
- Each dart is identically distributed
- Chance of random mass / peak match is the same for all peaks

Number of Trials

- Tournament size == number of trials
- Number of peptides tried
- Related to sequence database size
- Probability that a random match score is ≥ s
- 1 – Pr [ all match scores < s ]
- 1 – Pr [ match score < s ] Trials (*)
- Assumes IID!
- Expect value
- E = Trials * Pr [ match ≥ s ]
- Corresponds to Bonferroni bound on (*)

Better Random Models

- Comparison with completely random model isn’t really fair
- Match scores for real spectra with real peptides obey rules
- Even incorrect peptides match with non-random structure!

Better Random Models

- Want to generate random fragment masses (darts) that behave more like the real thing:
- Some fragments are more likely than others
- Some fragments depend on others
- Theoretical models can only incorporate this structure to a limited extent.

Better Random Models

- Generate random peptides
- Real looking fragment masses
- No theoretical model!
- Must use empirical distribution
- Usually require they have the correct precursor mass
- Score function can model anything we like!

Better Random Models

Fenyo & Beavis, Anal. Chem., 2003

Better Random Models

Fenyo & Beavis, Anal. Chem., 2003

Better Random Models

- Truly random peptides don’t look much like real peptides
- Just use peptides from the sequence database!
- Caveats:
- Correct peptide (non-random) may be included
- Peptides are not independent
- Reverse sequence avoids only the first problem

Extrapolating from the Empirical Distribution

- Often, the empirical shape is consistent with a theoretical model

Geer et al., J. Proteome Research, 2004

Fenyo & Beavis, Anal. Chem., 2003

False Positive Rate Estimation

- Each spectrum is a chance to be right, wrong, or inconclusive.
- How many decisions are wrong?
- Given identification criteria:
- SEQUEST Xcorr, E-value, Score, etc., plus...
- ...threshold
- Use “decoy” sequences
- random, reverse, cross-species
- Identifications must be incorrect!

False Positive Rate Estimation

- # FP in real search = # hits in decoy search
- Need same size database, or rate conversion
- FP Rate: # decoy hits

# real hits

- FP Rate: 2 x # decoy hits .

(# real hits + # decoy hits)

False Positive Rate Estimation

- A form of statistical significance
- In “theory”, E-value and a FP rate are the same.
- Search engine independent
- Easy to implement
- Assumes a single threshold for all spectra
- Spectrum/Peptide Identification scores are not iid!...
- ...but E-values, in principle, are.

Peptide Prophet

- From the Institute for Systems Biology
- Keller et al., Anal. Chem. 2002
- Re-analysis of SEQUEST results
- Spectra are trials
- Assumes that many of the spectra are not correctly identified

Peptide Prophet

- Assumes a bimodal distribution of scores, with a particular shape
- Ignores database size
- …but it is included implicitly
- Like empirical distribution for peptide sampling, can be applied to any score function
- Can be applied to any search engines’ results

Peptide Prophet

- Caveats
- Are spectra scores sampled from the same distribution?
- Is there enough correct identifications for second peak?
- Are spectra independent observations?
- Are distributions appropriately shaped?
- Huge improvement over raw SEQUEST results

Peptides to Proteins

Nesvizhskii et al., Anal. Chem. 2003

Peptides to Proteins

- A peptide sequence may occur in many different protein sequences
- Variants, paralogues, protein families
- Separation, digestion and ionization is not well understood
- Proteins in sequence database are extremely non-random, and very dependent

Publication Guidelines

- Computational parameters
- Spectral processing
- Sequence database
- Search program
- Statistical analysis
- Number of peptides per protein
- Each peptide sequence counts once!
- Multiple forms of the same peptide count once!

Publication Guidelines

- Single-peptide proteins must be explicitly justified by
- Peptide sequence
- N and C terminal amino-acids
- Precursor mass and charge
- Peptide Scores
- Multiple forms of the peptide counted once!
- Biological conclusions based on single-peptide proteins must show the spectrum

Publication Guidelines

- More stringent requirements for PMF data analysis
- Similar to that for tandem mass spectra
- Management of protein redundancy
- Peptides identified from a different species?
- Spectra submission encouraged

Summary

- Could guessing be as effective as a search?
- More guesses improves the best guess
- Better guessers help us be more discriminating
- Peptide to proteins is not as simple as it seems
- Publication guidelines reflect sound statistical principles.

Download Presentation

Connecting to Server..