peptide identification statistics pin the tail on the donkey
Download
Skip this Video
Download Presentation
Peptide Identification Statistics Pin the tail on the donkey?

Loading in 2 Seconds...

play fullscreen
1 / 46

Peptide Identification Statistics Pin the tail on the donkey? - PowerPoint PPT Presentation


  • 217 Views
  • Uploaded on

Peptide Identification Statistics Pin the tail on the donkey?. US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2005. Peptide Identification. Peptide fragmentation by CID is poorly understood MS/MS spectra represent incomplete information about amino-acid sequence

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Peptide Identification Statistics Pin the tail on the donkey?' - richard_edik


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
peptide identification statistics pin the tail on the donkey

Peptide Identification StatisticsPin the tail on the donkey?

US HUPO: Bioinformatics for Proteomics

Nathan Edwards – March 12, 2005

peptide identification
Peptide Identification
  • Peptide fragmentation by CID is poorly understood
  • MS/MS spectra represent incomplete information about amino-acid sequence
    • I/L, K/Q, GG/N, …
  • Correct identifications don’t come with a certificate!

US HUPO: Bioinformatics for Proteomics

peptide identification3
Peptide Identification
  • High-throughput workflows demand we analyze all spectra, all the time.
  • Spectra may not contain enough information to be interpreted correctly
    • …bad static on a cell phone
  • Peptides may not match our assumptions
    • …its all Greek to me
  • “Don’t know” is an acceptable answer!

US HUPO: Bioinformatics for Proteomics

peptide identification4
Peptide Identification

We can’t prove we are right…

…so can we prove we aren’t wrong?

US HUPO: Bioinformatics for Proteomics

peptide identification5
Peptide Identification

We can’t prove we are right…

…so can we prove we aren’t wrong?

NO!

US HUPO: Bioinformatics for Proteomics

peptide identification6
Peptide Identification

We can’t prove we are right…

…so can we prove we aren’t wrong?

The best we can do is to show our answer is better than guessing!

NO!

US HUPO: Bioinformatics for Proteomics

better than guessing
Better than guessing…
  • Better implies comparison
    • Score or measure of degree of success
  • Guessing implies randomness
    • Probability and statistics

US HUPO: Bioinformatics for Proteomics

pin the tail on the donkey
Pin the tail on the donkey…

US HUPO: Bioinformatics for Proteomics

probability concepts
Throwing darts

One at a time

Blindfolded

Identically distributed?

Uniform distribution?

Mutually exclusive?

Independent?

Pr [ Dart hits x ] = 0.05

Probability Concepts

US HUPO: Bioinformatics for Proteomics

probability concepts10
Probability Concepts

Throwing darts

  • One at a time
  • Blindfolded
  • Three darts

Pr [Hitting 20 3 times]

= 0.05 * 0.05 * 0.05

Pr [Hit 20 at least twice]

= 0.007125 + 0.000125

US HUPO: Bioinformatics for Proteomics

probability concepts11
Probability Concepts

US HUPO: Bioinformatics for Proteomics

probability concepts12
Probability Concepts

Throwing darts

  • One at a time
  • Blindfolded
  • Three darts

Pr [Hitting evens 3 times]

= Pr [Hitting 1-10 3 times]

= 0.5 * 0.5 * 0.5

Pr [Evens at least twice]

= 0.5

US HUPO: Bioinformatics for Proteomics

probability concepts13
Probability Concepts

US HUPO: Bioinformatics for Proteomics

probability concepts14
Probability Concepts

Throwing darts

  • One at a time
  • Blindfolded
  • 100 darts

Pr [Hitting 20 3 times]

= 0.139575

Pr [Hit 20 at least twice]

= 0.9629188

US HUPO: Bioinformatics for Proteomics

probability concepts15
Probability Concepts

US HUPO: Bioinformatics for Proteomics

match score
Match Score
  • Dartboard is peaks in a spectrum
  • Each dart is a peptide fragment
  • Pr [ Match ≥ s peaks ]

= Binomial( p , n )

≈ Poisson( p n ), for small p and large n

p is prob. of fragment / peak match,

n is number of fragments

US HUPO: Bioinformatics for Proteomics

match score17
Match Score

Theoretical distribution

  • Used by OMSSA
  • Proposed, in various forms, by many.
  • Probability of fragment / peak match
    • IID (independent, identically distributed)
    • Based on match tolerance
  • Can use fragments or peaks as darts!

US HUPO: Bioinformatics for Proteomics

match score18
Match Score

Theoretical distribution assumptions

  • Each dart is independent
    • Peaks are not “related”
  • Each dart is identically distributed
    • Chance of fragment / peak match is the same for all peaks and fragments

US HUPO: Bioinformatics for Proteomics

tournament size
Tournament Size

100 people

1000 people

100 Darts, # 20’s

100000 people

10000 people

US HUPO: Bioinformatics for Proteomics

tournament size20
Tournament Size

100 people

1000 people

100 Darts, # 20’s

100000 people

10000 people

US HUPO: Bioinformatics for Proteomics

number of trials
Number of Trials
  • Tournament size == number of trials
    • Number of peptides tried
    • Related to sequence database size
  • Probability that a random match score is ≥ s
    • 1 – Pr [ all match scores < s ]
    • 1 – Pr [ match score < s ] Trials (*)
    • Assumes IID!
  • Expect value
    • E = Trials * Pr [ match ≥ s ]
    • Corresponds to Bonferroni bound on (*)

US HUPO: Bioinformatics for Proteomics

better dart throwers
Better Dart Throwers

US HUPO: Bioinformatics for Proteomics

better random models
Better Random Models
  • Comparison with completely random model isn’t really fair
  • Match scores for real spectra with real peptides obey rules
  • Even incorrect peptides match with non-random structure!

US HUPO: Bioinformatics for Proteomics

better random models24
Better Random Models
  • Want to generate random fragment masses (darts) that behave more like the real thing:
    • Some fragments are more likely than others
    • Some fragments depend on others
  • Theoretical models can only incorporate this structure to a limited extent.
  • Cannot model the properties of a particular peptide!
    • Must capture behavior of fragments in general

US HUPO: Bioinformatics for Proteomics

better random models25
Better Random Models
  • Generate random peptides
    • Real looking fragment masses
    • No theoretical model!
    • Must use empirical distribution
    • Usually require they have the correct precursor mass
  • Score function can model anything we like!

US HUPO: Bioinformatics for Proteomics

better random models26
Better Random Models

Fenyo & Beavis, Anal. Chem., 2003

US HUPO: Bioinformatics for Proteomics

better random models27
Better Random Models

Fenyo & Beavis, Anal. Chem., 2003

US HUPO: Bioinformatics for Proteomics

better random models28
Better Random Models
  • Truly random peptides don’t look much like real peptides
  • Just use peptides from the sequence database!
  • Caveats:
    • Correct peptide (non-random) may be included
    • Peptides are not independent
  • Reverse sequence avoids only the first problem

US HUPO: Bioinformatics for Proteomics

extrapolating from the empirical distribution
Extrapolating from the Empirical Distribution

Fenyo & Beavis, Anal. Chem., 2003

US HUPO: Bioinformatics for Proteomics

extrapolating from the empirical distribution30
Extrapolating from the Empirical Distribution
  • Often, the empirical shape is consistent with a theoretical model

Fenyo & Beavis, Anal. Chem., 2003

Geer et al., J. Proteome Research, 2004

US HUPO: Bioinformatics for Proteomics

peptide prophet
Peptide Prophet
  • From the Institute for Systems Biology
    • Keller et al., Anal. Chem. 2002
  • Re-analysis of SEQUEST results
  • Spectra are trials (NOT peptides!)
  • Assumes that many of the spectra are not correctly identified

US HUPO: Bioinformatics for Proteomics

peptide prophet32
Peptide Prophet

Keller et al., Anal. Chem. 2002

Distribution of spectral scores in the results

US HUPO: Bioinformatics for Proteomics

peptide prophet33
Peptide Prophet
  • Assumes a bimodal distribution of scores, with a particular shape
  • Ignores database size
    • …but it is included implicitly
  • Like empirical distribution for peptide sampling, can be applied to any score function
    • Can be applied to any search engines’ results

US HUPO: Bioinformatics for Proteomics

peptide prophet34
Peptide Prophet
  • Caveats
    • Are spectra scores sampled from the same distribution?
    • Is there enough correct identifications for second peak?
    • Are spectra independent observations?
    • Are distributions appropriately shaped?
  • Huge improvement over raw SEQUEST results

US HUPO: Bioinformatics for Proteomics

peptides to proteins
Peptides to Proteins

Nesvizhskii et al., Anal. Chem. 2003

US HUPO: Bioinformatics for Proteomics

peptides to proteins36
Peptides to Proteins

US HUPO: Bioinformatics for Proteomics

peptides to proteins37
Peptides to Proteins
  • A peptide sequence may occur in many different protein sequences
    • Variants, paralogues, protein families
  • Separation, digestion and ionization is not well understood
  • Proteins in sequence database are extremely non-random, and very dependent

US HUPO: Bioinformatics for Proteomics

peptides to proteins38
Peptides to Proteins

US HUPO: Bioinformatics for Proteomics

peptides to proteins39
Peptides to Proteins
  • Mascot
    • Protein score is sum of peptide scores
    • Assumes peptide identifications are independent!
  • SEQUEST
    • Keeps only one of the proteins for each peptide?

US HUPO: Bioinformatics for Proteomics

peptides to proteins40
Peptides to Proteins
  • Peptide Prophet
    • Nesvizhskii, et al. Anal. Chem 2003
  • Models probability that a protein is correct based on
    • Probability that its peptides are correct
  • Models probability that a peptide is correct based on
    • Probability that its proteins are correct
  • Proteins with one high-probability peptide are not eliminated
    • …but are down-weighted
  • Assumes identification probabilities from the same protein are independent (like Mascot)

US HUPO: Bioinformatics for Proteomics

peptides to proteins41
Peptides to Proteins
  • Best available method, to date, is Protein Prophet.
  • The problem will only get worse, as we search variants and isoform sequences
  • Proteins do not have a single sequence!
    • Peptide identification is not protein identification!

US HUPO: Bioinformatics for Proteomics

publication guidelines
Publication Guidelines

US HUPO: Bioinformatics for Proteomics

publication guidelines43
Publication Guidelines
  • Computational parameters
    • Spectral processing
    • Sequence database
    • Search program
    • Statistical analysis
  • Number of peptides per protein
    • Each peptide sequence counts once!
    • Multiple forms of the same peptide count once!

US HUPO: Bioinformatics for Proteomics

publication guidelines44
Publication Guidelines
  • Single-peptide proteins must be explicitly justified by
    • Peptide sequence
    • N and C terminal amino-acids
    • Precursor mass and charge
    • Peptide Scores
    • Multiple forms of the peptide counted once!
  • Biological conclusions based on single-peptide proteins must show the spectrum

US HUPO: Bioinformatics for Proteomics

publication guidelines45
Publication Guidelines
  • More stringent requirements for PMF data analysis
    • Similar to that for tandem mass spectra
  • Management of protein redundancy
    • Peptides identified from a different species?
  • Spectra submission encouraged

US HUPO: Bioinformatics for Proteomics

summary
Summary
  • Could guessing be as effective as a search?
  • More guesses improves the best guess
  • Better guessers help us be more discriminating
  • Independent observations only count if they are independent!
  • Peptide to proteins is not as simple as it seems
  • Publication guidelines reflect sound statistical principles.

US HUPO: Bioinformatics for Proteomics

ad