Create Presentation
Download Presentation

Download Presentation
## Peptide Identification Statistics Pin the tail on the donkey?

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Peptide Identification StatisticsPin the tail on the donkey?**US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2005**Peptide Identification**• Peptide fragmentation by CID is poorly understood • MS/MS spectra represent incomplete information about amino-acid sequence • I/L, K/Q, GG/N, … • Correct identifications don’t come with a certificate! US HUPO: Bioinformatics for Proteomics**Peptide Identification**• High-throughput workflows demand we analyze all spectra, all the time. • Spectra may not contain enough information to be interpreted correctly • …bad static on a cell phone • Peptides may not match our assumptions • …its all Greek to me • “Don’t know” is an acceptable answer! US HUPO: Bioinformatics for Proteomics**Peptide Identification**We can’t prove we are right… …so can we prove we aren’t wrong? US HUPO: Bioinformatics for Proteomics**Peptide Identification**We can’t prove we are right… …so can we prove we aren’t wrong? NO! US HUPO: Bioinformatics for Proteomics**Peptide Identification**We can’t prove we are right… …so can we prove we aren’t wrong? The best we can do is to show our answer is better than guessing! NO! US HUPO: Bioinformatics for Proteomics**Better than guessing…**• Better implies comparison • Score or measure of degree of success • Guessing implies randomness • Probability and statistics US HUPO: Bioinformatics for Proteomics**Pin the tail on the donkey…**US HUPO: Bioinformatics for Proteomics**Throwing darts**One at a time Blindfolded Identically distributed? Uniform distribution? Mutually exclusive? Independent? Pr [ Dart hits x ] = 0.05 Probability Concepts US HUPO: Bioinformatics for Proteomics**Probability Concepts**Throwing darts • One at a time • Blindfolded • Three darts Pr [Hitting 20 3 times] = 0.05 * 0.05 * 0.05 Pr [Hit 20 at least twice] = 0.007125 + 0.000125 US HUPO: Bioinformatics for Proteomics**Probability Concepts**US HUPO: Bioinformatics for Proteomics**Probability Concepts**Throwing darts • One at a time • Blindfolded • Three darts Pr [Hitting evens 3 times] = Pr [Hitting 1-10 3 times] = 0.5 * 0.5 * 0.5 Pr [Evens at least twice] = 0.5 US HUPO: Bioinformatics for Proteomics**Probability Concepts**US HUPO: Bioinformatics for Proteomics**Probability Concepts**Throwing darts • One at a time • Blindfolded • 100 darts Pr [Hitting 20 3 times] = 0.139575 Pr [Hit 20 at least twice] = 0.9629188 US HUPO: Bioinformatics for Proteomics**Probability Concepts**US HUPO: Bioinformatics for Proteomics**Match Score**• Dartboard is peaks in a spectrum • Each dart is a peptide fragment • Pr [ Match ≥ s peaks ] = Binomial( p , n ) ≈ Poisson( p n ), for small p and large n p is prob. of fragment / peak match, n is number of fragments US HUPO: Bioinformatics for Proteomics**Match Score**Theoretical distribution • Used by OMSSA • Proposed, in various forms, by many. • Probability of fragment / peak match • IID (independent, identically distributed) • Based on match tolerance • Can use fragments or peaks as darts! US HUPO: Bioinformatics for Proteomics**Match Score**Theoretical distribution assumptions • Each dart is independent • Peaks are not “related” • Each dart is identically distributed • Chance of fragment / peak match is the same for all peaks and fragments US HUPO: Bioinformatics for Proteomics**Tournament Size**100 people 1000 people 100 Darts, # 20’s 100000 people 10000 people US HUPO: Bioinformatics for Proteomics**Tournament Size**100 people 1000 people 100 Darts, # 20’s 100000 people 10000 people US HUPO: Bioinformatics for Proteomics**Number of Trials**• Tournament size == number of trials • Number of peptides tried • Related to sequence database size • Probability that a random match score is ≥ s • 1 – Pr [ all match scores < s ] • 1 – Pr [ match score < s ] Trials (*) • Assumes IID! • Expect value • E = Trials * Pr [ match ≥ s ] • Corresponds to Bonferroni bound on (*) US HUPO: Bioinformatics for Proteomics**Better Dart Throwers**US HUPO: Bioinformatics for Proteomics**Better Random Models**• Comparison with completely random model isn’t really fair • Match scores for real spectra with real peptides obey rules • Even incorrect peptides match with non-random structure! US HUPO: Bioinformatics for Proteomics**Better Random Models**• Want to generate random fragment masses (darts) that behave more like the real thing: • Some fragments are more likely than others • Some fragments depend on others • Theoretical models can only incorporate this structure to a limited extent. • Cannot model the properties of a particular peptide! • Must capture behavior of fragments in general US HUPO: Bioinformatics for Proteomics**Better Random Models**• Generate random peptides • Real looking fragment masses • No theoretical model! • Must use empirical distribution • Usually require they have the correct precursor mass • Score function can model anything we like! US HUPO: Bioinformatics for Proteomics**Better Random Models**Fenyo & Beavis, Anal. Chem., 2003 US HUPO: Bioinformatics for Proteomics**Better Random Models**Fenyo & Beavis, Anal. Chem., 2003 US HUPO: Bioinformatics for Proteomics**Better Random Models**• Truly random peptides don’t look much like real peptides • Just use peptides from the sequence database! • Caveats: • Correct peptide (non-random) may be included • Peptides are not independent • Reverse sequence avoids only the first problem US HUPO: Bioinformatics for Proteomics**Extrapolating from the Empirical Distribution**Fenyo & Beavis, Anal. Chem., 2003 US HUPO: Bioinformatics for Proteomics**Extrapolating from the Empirical Distribution**• Often, the empirical shape is consistent with a theoretical model Fenyo & Beavis, Anal. Chem., 2003 Geer et al., J. Proteome Research, 2004 US HUPO: Bioinformatics for Proteomics**Peptide Prophet**• From the Institute for Systems Biology • Keller et al., Anal. Chem. 2002 • Re-analysis of SEQUEST results • Spectra are trials (NOT peptides!) • Assumes that many of the spectra are not correctly identified US HUPO: Bioinformatics for Proteomics**Peptide Prophet**Keller et al., Anal. Chem. 2002 Distribution of spectral scores in the results US HUPO: Bioinformatics for Proteomics**Peptide Prophet**• Assumes a bimodal distribution of scores, with a particular shape • Ignores database size • …but it is included implicitly • Like empirical distribution for peptide sampling, can be applied to any score function • Can be applied to any search engines’ results US HUPO: Bioinformatics for Proteomics**Peptide Prophet**• Caveats • Are spectra scores sampled from the same distribution? • Is there enough correct identifications for second peak? • Are spectra independent observations? • Are distributions appropriately shaped? • Huge improvement over raw SEQUEST results US HUPO: Bioinformatics for Proteomics**Peptides to Proteins**Nesvizhskii et al., Anal. Chem. 2003 US HUPO: Bioinformatics for Proteomics**Peptides to Proteins**US HUPO: Bioinformatics for Proteomics**Peptides to Proteins**• A peptide sequence may occur in many different protein sequences • Variants, paralogues, protein families • Separation, digestion and ionization is not well understood • Proteins in sequence database are extremely non-random, and very dependent US HUPO: Bioinformatics for Proteomics**Peptides to Proteins**US HUPO: Bioinformatics for Proteomics**Peptides to Proteins**• Mascot • Protein score is sum of peptide scores • Assumes peptide identifications are independent! • SEQUEST • Keeps only one of the proteins for each peptide? US HUPO: Bioinformatics for Proteomics**Peptides to Proteins**• Peptide Prophet • Nesvizhskii, et al. Anal. Chem 2003 • Models probability that a protein is correct based on • Probability that its peptides are correct • Models probability that a peptide is correct based on • Probability that its proteins are correct • Proteins with one high-probability peptide are not eliminated • …but are down-weighted • Assumes identification probabilities from the same protein are independent (like Mascot) US HUPO: Bioinformatics for Proteomics**Peptides to Proteins**• Best available method, to date, is Protein Prophet. • The problem will only get worse, as we search variants and isoform sequences • Proteins do not have a single sequence! • Peptide identification is not protein identification! US HUPO: Bioinformatics for Proteomics**Publication Guidelines**US HUPO: Bioinformatics for Proteomics**Publication Guidelines**• Computational parameters • Spectral processing • Sequence database • Search program • Statistical analysis • Number of peptides per protein • Each peptide sequence counts once! • Multiple forms of the same peptide count once! US HUPO: Bioinformatics for Proteomics**Publication Guidelines**• Single-peptide proteins must be explicitly justified by • Peptide sequence • N and C terminal amino-acids • Precursor mass and charge • Peptide Scores • Multiple forms of the peptide counted once! • Biological conclusions based on single-peptide proteins must show the spectrum US HUPO: Bioinformatics for Proteomics**Publication Guidelines**• More stringent requirements for PMF data analysis • Similar to that for tandem mass spectra • Management of protein redundancy • Peptides identified from a different species? • Spectra submission encouraged US HUPO: Bioinformatics for Proteomics**Summary**• Could guessing be as effective as a search? • More guesses improves the best guess • Better guessers help us be more discriminating • Independent observations only count if they are independent! • Peptide to proteins is not as simple as it seems • Publication guidelines reflect sound statistical principles. US HUPO: Bioinformatics for Proteomics