1 / 98

Protein Identification via Database searching

Protein Identification via Database searching. Attila Kert é sz- Farkas kfattila@icgeb.org Protein Structure and Bioinformatics Group, ICGEB, Trieste. Mass Spectra analysis. Biological sample. Results report. Mass Spectra analysis. Biological sample. Results report.

ophira
Download Presentation

Protein Identification via Database searching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Protein Identification via Database searching Attila Kertész-Farkas kfattila@icgeb.org Protein Structure and Bioinformatics Group, ICGEB, Trieste

  2. Mass Spectra analysis Biological sample Results report

  3. Mass Spectra analysis Biological sample Results report

  4. Computational analysis of MS/MS • Two approaches: • De novo sequencing • Database searching based • Hybrid

  5. De novo sequencing

  6. De novo sequencing •  • can identify new peptides and proteins • Able to discover (new) PTMs • Independent of protein databases •  • Requires MS/MS data of good quality • No statistics based validation

  7. Database searching-based MS/MS tandem mass spectra identification • Pipeline Input data Peptide assignment Validation Protein inference Interpretation Quantitation

  8. Database searching-based MS/MS tandem mass spectra identification • Pipeline Input data Peptide assignment Validation Protein inference Interpretation Quantitation

  9. Database searching-based MS/MS tandem mass spectra identification • Pipeline Input data Peptide identification Validation Protein inference Interpretation Data formats Database searching Statistical methods for validations Quantitation Protein assembling

  10. Input data Peptide assignment Validation Protein inference Interpretation Quantitation • Mass spectrum: • Histogram of the mass over charge of the observed fragment ions. • Spectrum normalization. Usually intensity is scaled to [0,100] interval.

  11. Input data Peptide assignment Validation Protein inference Interpretation Quantitation • Most common formats are the mzXML, MGF and DAT,

  12. MGF file format Input data Peptide assignment Validation Protein inference Interpretation Quantitation

  13. .mzXML Input data Peptide assignment Validation Protein inference Interpretation Quantitation

  14. Peptide assignment Input data Validation Protein inference Interpretation • Scores: • 1. 2 Quantitation Input data Experimental Spectra Spectra comparison: Protein sequence DB >IPI:IPI00000044.1|SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

  15. Peptide assignment Input data Validation Protein inference Interpretation • Scores: • 1. 2 • 2. 1 Quantitation Input data Experimental Spectra Spectra comparison: Protein sequence DB >IPI:IPI00000044.1|SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

  16. Peptide assignment Input data Validation Protein inference Interpretation Scores: 3. 4 1. 2 2. 1 Quantitation Input data Experimental Spectra Spectra comparison: Protein sequence DB >IPI:IPI00000044.1|SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

  17. Peptide assignment Input data Validation Protein inference Interpretation Scores: 3. 4 1. 2 2. 1 4. 1 Quantitation Input data Experimental Spectra Spectra comparison: Protein sequence DB >IPI:IPI00000044.1|SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

  18. Peptide assignment Input data Validation Protein inference Interpretation Scores: 3. 4 1. 2 2. 1 4. 1 5. 1 Quantitation Input data Experimental Spectra Spectra comparison: Protein sequence DB >IPI:IPI00000044.1|SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

  19. Peptide assignment Input data Validation Protein inference Interpretation Scores: 3. 4 1. 2 2. 2 2. 1 4. 1 5. 1 Quantitation Input data Experimental Spectra Spectra comparison: Protein sequence DB >IPI:IPI00000044.1|SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

  20. Peptide assignment Input data Validation Protein inference Interpretation • Scores: • 3. 4 • 14. 3 • 1. 2 • 2 • 7. 2 • 2. 1 • 4. 1 • 9. 1 • 12. 1 Quantitation Input data Experimental Spectra Spectra comparison: Protein sequence DB >IPI:IPI00000045.1|SWISS-PROT:P18510-1 MEICRGLRSHLITLLLFLFHSETICRPSGRKSSKMQAFRIWDVNQKTFYLRNNQLVAGYLQGPNVNLEEKIDVVPIEPHALFLGIHGGKMCLSCVKSGDETRLQLEAVNITDLSENRKQDKRFAFIRSDSGPTTSFESAACPGWFLCTAMEADQPVSLTNMPDEGVMVTKFYFQEDE

  21. Peptide assignment Input data Validation Protein inference Interpretation • Scores: • 15. 32 • 3. 4 • 14. 3 • 1. 2 • 2 • 7. 2 • 2. 1 • 4. 1 • 9. 1 • 12. 1 Quantitation Input data Experimental Spectra Spectra comparison: Protein sequence DB >IPI:IPI00000045.1|SWISS-PROT:P18510-1 MEICRGLRSHLITLLLFLFHSETICRPSGRKSSKMQAFRIWDVNQKTFYLRNNQLVAGYLQGPNVNLEEKIDVVPIEPHALFLGIHGGKMCLSCVKSGDETRLQLEAVNITDLSENRKQDKRFAFIRSDSGPTTSFESAACPGWFLCTAMEADQPVSLTNMPDEGVMVTKFYFQEDE

  22. Peptide assignment Input data Validation Protein inference Interpretation • Scores: • 15. 32 • 3. 4 • 14. 3 • 1. 2 • 2 • 7. 2 • 2. 1 • 4. 1 • 9. 1 • 12. 1 Quantitation Input data Experimental Spectra Score: 32 Peptide: SHLITLLLFLFHSETICR Protein sequence DB

  23. Peptide assignment Input data Validation Protein inference Interpretation Scores: 13. 4 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Quantitation Input data Experimental Spectra Spectra comparison: Protein sequence DB >IPI:IPI00000044.1|SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

  24. Peptide assignment Input data Validation Protein inference Interpretation Scores: 13. 4 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Quantitation Input data Experimental Spectra Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 4 Peptide: AELDLNMTR Protein sequence DB

  25. Peptide assignment Input data Validation Protein inference Interpretation Scores: 11. 3 6. 3 9. 3 3. 3 1. 3 4. 2 7. 2 13. 2 1. 1 10. 1 Quantitation Input data Experimental Spectra Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 4 Peptide: AELDLNMTR Score: 3 Peptide: MEICRGLR Protein sequence DB

  26. Peptide assignment Input data Validation Protein inference Interpretation Scores: 13. 15 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Quantitation Input data Experimental Spectra Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 4 Peptide: AELDLNMTR Score: 3 Peptide: MEICRGLR Score: 15 Peptide: LLHGDPGEEDK Protein sequence DB

  27. Peptide assignment Input data Validation Protein inference Interpretation • Scores: • 1. 2 Quantitation Input data Experimental Spectra Spectra comparison: Protein sequence DB >IPI:IPI00000044.1|SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

  28. Peptide assignment Input data Validation Protein inference Interpretation • Scores: • 1. 2 Quantitation Input data Experimental Spectra Spectra comparison: 1. Protein sequence DB >IPI:IPI00000044.1|SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

  29. Peptide assignment Input data Validation Protein inference Interpretation • Scores: • 1. 2 Quantitation Input data Experimental Spectra Spectra comparison: 1. 2. Protein sequence DB >IPI:IPI00000044.1|SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

  30. Peptide assignment Input data Validation Protein inference Interpretation Spectra comparison: 1. Quantitation 100% Shared Peak Count (SPC) This is the number of the peaks in the theoretical spectrum that are matched to peaks in the experimental spectrum 0% 1 0

  31. Peptide assignment Input data Validation Protein inference Interpretation Spectra comparison: 1. Quantitation 100% Shared Peak Count (SPC) This is the number of the peaks in the theoretical spectrum that are matched to peaks in the experimental spectrum 0% 1 SPC = 7 0

  32. Peptide assignment Input data Validation Protein inference Interpretation Spectra comparison: 1. Quantitation 100% Inner product (I) This is the sum of the intensities of the peaks in the experimental spectrum that match to peaks in the theoretical spectrum 0% 1 0

  33. Peptide assignment Input data Validation Protein inference Interpretation Spectra comparison: 1. Quantitation 100% Inner product (I) This is the sum of the intensities of the peaks in the experimental spectrum that match to peaks in the theoretical spectrum I = 3.5 0% 1 0

  34. Peptide assignment Input data Validation Protein inference Interpretation Spectra comparison: 1. Quantitation 100% Hyperscore: H= I*Nb!*Ny! I is the sum of the intensity of the matched peaks Nb, (resp. Ny) is the number of the matched b (resp. y) peaks in the theoretical spectrum ! is the factorial function. 0% 1 b y b b y b y y b y 0

  35. Peptide assignment Input data Validation Protein inference Interpretation Spectra comparison: 1. Quantitation 100% Hyperscore: H= I*Nb!*Ny! - I is the sum of the intensity of the matched peaks - Nb, (resp. Ny) is the number of the matched b (resp. y) peaks in the theoretical spectrum - ! is the factorial function. 0% 1 b y b b y b y y b y H = 3.2*3!*4! = 3.2*6*24 = 460.8 0

  36. Peptide assignment Input data Validation Protein inference Interpretation Spectra comparison: 1. Quantitation 100% Xcorr q is the query spectrum t is the theoretical spectrum 0% 1 0

  37. Peptide assignment Input data Validation Protein inference Interpretation Spectra comparison: 1. Quantitation 100% Xcorr q is the query spectrum t is the theoretical spectrum I(q,t)=3.2 0% 1 0

  38. Peptide assignment Input data Validation Protein inference Interpretation Spectra comparison: 1. Quantitation 100% Xcorr q is the query spectrum t is the theoretical spectrum I(q,t)=3.2 0% 1 I(q,t[-75])= 0

  39. Peptide assignment Input data Validation Protein inference Interpretation Spectra comparison: 1. Quantitation 100% Xcorr q is the query spectrum t is the theoretical spectrum I(q,t)=3.2 0% 1 I(q,t[-32])= 0

  40. Peptide assignment Input data Validation Protein inference Interpretation Spectra comparison: 1. Quantitation 100% Xcorr q is the query spectrum t is the theoretical spectrum I(q,t)=3.2 0% 1 I(q,t[0])= 0

  41. Peptide assignment Input data Validation Protein inference Interpretation Spectra comparison: 1. Quantitation 100% Xcorr q is the query spectrum t is the theoretical spectrum I(q,t)=3.2 0% 1 I(q,t[32])= 0 And so on.

  42. Peptide assignment Input data Validation Protein inference Interpretation Quantitation Protein Sequence Databases • Completeness: •  Complete •  Longer searching time • Redundancy: • Sequence variations can be found •  Redundant database can mess up the statistics • Quality of sequence annotation 2. Protein sequence DB

  43. Peptide assignment Input data Validation Protein inference Interpretation Quantitation • EntrezProtein DB • http://www.ncbi.nlm.nih.gov/sites/entrez?db=protein • Most complete, redundant • Reference Sequence (RefSeq) and UniProt (Swiss-Prot and TrEMBL) • http://www.ncbi.nlm.nih.gov/RefSeq/ • http://www.uniprot.org/ • Well annotated, non-redundant • International Protein Index (IPI) • http://www.ebi.ac.uk/IPI/IPIhelp.html • Represents a good balance between redundancy and completeness. • Contains cross-reference to Ensemble, UniProt, RefSeq. • Sequences from a single genome • Difficult to obtain good statistics on small datasats. 2. Protein sequence DB

  44. Peptide assignment Input data Validation Protein inference Interpretation Quantitation • Taxonomy • Allows searches to be limited to entries from particular species or groups of species. • Speed up a search, and ensures that the hit list will only contain entries from the selected species. • For non-redundant databases, a single entry may represent identical sequences from multiple species. The accession string and title text from the FASTA entry, listed on the master results page, will usually describe just one of these entries. To see the equivalent entries, and to explore their taxonomy, follow the accession number link in the results list to the Protein View. If the hit is from a non-redundant database, and represents multiple entries with identical sequences, the Protein View will include links to NCBI Entrez and the NCBI Taxonomy Browser for all equivalent entries. 2. Protein sequence DB

  45. Peptide assignment Input data Validation Protein inference Interpretation Quantitation Run time • Database search has to enumerate all peptides and compare them to all experimental spectra. • This can be slow with large protein sequence databases especially when slow scoring function is applied, like Xcorr.

  46. Peptide assignment Input data Validation Protein inference Interpretation Quantitation Speedup techniques • Fast database indexing • Fast implementation of sequence indexing in the database • Parent mass check • PTMs can be lost • Sequest’s preliminary score • Tag-based filtering (de novo hybrid) • Increases the specificity(or sensitivity)

  47. Peptide assignment Input data Validation Protein inference Interpretation Quantitation • Advanced database indexing • Better implementation of the sequence indexing • Better representation of protein sequences.

  48. Peptide assignment Input data Validation Protein inference Interpretation • Scores: • 1. 2 Quantitation Input data Experimental Spectra Parent mass check Spectra comparison  Protein sequence DB >IPI:IPI00000044.1|SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

  49. Peptide assignment Input data Validation Protein inference Interpretation • Scores: Quantitation Input data Experimental Spectra Parent mass check Spectra comparison  Protein sequence DB >IPI:IPI00000044.1|SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

  50. Peptide assignment Input data Validation Protein inference Interpretation Quantitation Fast prescoring (used in SEQUEST) So called Sp score: R(q,t) is the maximum number of consecutive matched b-y ions. 100% 0% 1 Sp=3.2*7*(1+0.0075*4)/10=2.3072 SEQUEST selects the top 500 scoring peptides, scored by Sp, and rescores them using the Xcorr. 0

More Related