Understanding Large-Scale Protein Identification Data Analysis

Name: Matrikelnumber: Exercise 1. Large scale protein identification data – take a look at single proteins Find alpha-enolase and double click. Why does the data not cover more of the protein sequence? Export the table to Excel and calculate the average mass error (and standard deviation) for the individual peptides. What does this value mean for protein ID by database searching? What is the range of peptide masses identified and the range of charge states. Any idea why you see what you see? Why do very long peptides show more b-ions than short peptides? Do all peptides of this protein have the same quality? How do you judge the quality of an individual spectrum?

Name: Matrikelnumber: Exercise 2. Large scale protein identification data – what is in a list Filter the protein list for 99% protein and peptide probability. How many proteins are identified in the HeLa and A431 samples? Filter the protein list by the number of unique peptides. Then sort the list by molecular weight (low MW at the top). Why are very small protein generally more difficult to identify than large proteins? How many kinases and receptors are in the list?

Name: Matrikelnumber: Exercise 3. Large scale protein identification data – false discovery rate Large scale protein identification relies on statistical criteria rather than manual verification of peptide-spectrum matches. How many proteins have been identified in A431 and HeLa cells respectively when using a protein and peptide false discovery rate of 1% each? Filter the protein list for 50% protein and peptide probability). Scaffold tells you that the FDR is about 1.4%. Rank the protein list according to the number of unique peptides. Decoy matches (e.g. IPI00375578-R) are fairly far down on the list. Why is this? Take a look at some decoy hits. What do you notice in terms of peptide length, Mascot ion score and Mascot delta ion score?

Name: Matrikelnumber: Exercise 4. Large scale protein identification data – Semi-quantitative analysis Rank the list of proteins by the number of assigned spectra. Could this rank order correspond to cellular abundance? What are the pros and cons of this? This file contains two analysis: one from a HeLa digest and one from a A431 digest. Compare the two lists – what do you notice? Is it fair to compare the number of spectra between these two experiments? How do you interpret the fact that some proteins that are shared between the two samples have very different numbers of peptides?

Name: Matrikelnumber: Exercise 5. Peptide modifications Methionine spontaneously oxidizes to the sulfoxide. Filter the protein list for this modification and pick a protein. The methyl sulfoxide group is often lost upon fragmentation. This generates satellite peaks in tandem mass spectra. Which mass difference to the y-ions containing the Met residue does this generate? Now do the same for acetylation. Which peptide often/exclusively carries this modification? Do you know why? Why do we see so few of them in the data? Now do the same for phosphorylation. Pick the BCL2-associated transcription factor. Phosphopeptides often loose phosphoric acid upon fragmentation (-98 Da). What does this do to the Ser and Thr residues? How is this useful for localizing the phosphorylation site?

Understanding Large-Scale Protein Identification Data Analysis