1 / 36

Statistical calibration of MS/MS spectrum library search scores

Statistical calibration of MS/MS spectrum library search scores. Barbara Frewen January 10, 2011 University of Washington. Protein identification. Peptides EYWDYEAHMIEWGQIDDYQLVR GGTNIITLLDVVK VVVFLFDLLYFNGEPLV YQTTGQVQYSCLVR LIVVNSEDQLR HPLISLLLLIAFYSTSSEAFVPK …. Protein Mixture.

devon
Download Presentation

Statistical calibration of MS/MS spectrum library search scores

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical calibration of MS/MS spectrum library search scores Barbara Frewen January 10, 2011 University of Washington

  2. Protein identification Peptides EYWDYEAHMIEWGQIDDYQLVR GGTNIITLLDVVK VVVFLFDLLYFNGEPLV YQTTGQVQYSCLVR LIVVNSEDQLR HPLISLLLLIAFYSTSSEAFVPK … Protein Mixture Proteins B0205.7 casein kinase C29A12.3a lig-1 DNA ligase C29E6.1a mucin like protein … Digestion to Peptides

  3. Acquiring MS/MS spectra MS/MS Isolate Proteins Cell lysis Digest to Peptides MS Load onto column µLC/µLC

  4. Which proteins are in my sample? Peptides EYWDYEAHMIEWGQIDDYQLVR GGTNIITLLDVVK VVVFLFDLLYFNGEPLV YQTTGQVQYSCLVR LIVVNSEDQLR HPLISLLLLIAFYSTSSEAFVPK … Protein Mixture Proteins B0205.7 casein kinase C29A12.3a lig-1 DNA ligase C29E6.1a mucin like protein … Digestion to Peptides

  5. Matching a spectrum to a peptide sequence • De novo Infer peptide sequence from m/z of observed peaks • Database search Compare observed peaks to predict peaks for each peptide from a list of candidate sequences • Library search Compare observed peaks to known spectra

  6. Building a spectrum library • Ideally, infuse synthesized peptides • ISB has gold standard spectra from five peptides per protein in human • University of Washington (MacCoss) will have spectra from 790 transcription factors and 350 kinases • Alternatively, use high-quality peptide-spectrum matches from shotgun proteomics experiments • BiblioSpec now parses search results from SEQUEST, Mascot, X! Tandem, ProteinPilot, Scaffold

  7. Library file formats

  8. Using a spectrum library Spectrum identification via library searching Resource for designing SRM directed experiments Compact, unified format for compiling results and sharing between labs

  9. Searching a spectrum library SEQUEST BiblioSpec Peptide ID list Scan1 0.7 EGSSDEEVP… Scan1 0.3 TFAEILNPI… Scan1 0.2 ARFDLNNHD… ------------------- Scan2 0.5 EDEESIRAV… Scan2 0.2 WLGDDCFMV… Scan2 0.1 IDRAAWKAV… ------------------- Scan3 0.2 EITTRDMGN… Scan3 0.1 GRNMCTAKL… m/z 594.2 score = 0.2 MS/MS query spectra 2 GDTIENFK 300.4 1 CGCCLYNT 522.3 2 FMACSDEK 593.9 3 QWDKEPPR 765.1 Library of identified spectra 3 NGISLTIVR 940.4

  10. Comparing library and database search • Created a large library of spectra from worm peptides • Identified a different set of spectra using both library and database search • Compared BiblioSpecresults with SEQUEST results to evaluate performance • spectrum score library SEQUEST agree? • 0.l7 AFEQWK LVVAMK NO False positive • 0.83 DLAVER DLAVER YES True positive • …

  11. Similarity score discriminates between correct and incorrect matches agree disagree insert hist/roc Histogram of search scores ROC and 1% ROC curve AUC = 0.978

  12. BiblioSpec and SEQUEST results agree • BiblioSpecfound 91% of SEQUEST IDs • Two reasons BiblioSpec and SEQUEST disagree: • Query ion not in library • BiblioSpec found a different peptide to be more similar • Only 7% of query spectra notcorrectly identified were in library. Most disagreed because the correct match was not in library.

  13. Compute p-values to evaluate results • The BiblioSpec search score provides good discrimination • But it’s unclear where to place a threshold between correct and incorrect matches • Use statistical methods to estimate the probability that a match is incorrect and to estimate the fraction of incorrect matches above a score threshold.

  14. How likely is the match incorrect? distribution of scores for a spectrum vs all possible incorrect matches low score large area to right p-value = 0.4 high score small area to right p-value = 0.01 score

  15. Estimating the null distribution • Representative sample of scores from incorrect matches • Guarantee they are incorrect by using decoys • In database searching, scores from decoy peptides are used to estimate the null distribution • How can we create decoy spectra?

  16. Generate decoy spectra by shifting the m/z of the peaks Requirements: • fast to generate • sequence agnostic • representative scores Evaluation: • score distributions mimic real spectra • generate a data set of incorrect matches to real spectra real spectrum decoy spectrum

  17. Circularly shifted peaks are similar to real spectra

  18. Circularly shifted peaks are similar to real spectra

  19. Percolator computes p-values Semi–supervised machine learning to classify correct verses incorrect matches • Trains with high-scoring real matches vsdecoy matches • Classifies all real matches using that model • http://per-colator.com • Kället al. 2007 Nature Methods • Käll et al. 2008 Bioinformatics

  20. Evaluate p-values • Compute p-values for incorrect matches to real spectra • Percolator p-values should correspond with rank-based p-values ID Percolator rank rank/n 745AF_8518 0.000230787 1 1/n 691AF_10025 0.000461467 2 2/n 691AF_10107 0.000692201 3 3/n 691AF_10301 0.000922934 4 4/n ... ... ... ... 691AF_5048 0.001153669 12 12/n ... ... ... ...

  21. Calibrating p-values Calculated p-value Rank p-value

  22. Better discrimination with p-values Percolator combines: • search score • delta m/z • delta search score • charge • petpide length • candidates • copies in library precision (tp / tp + fp) recall (tp / tp + fn)

  23. Better discrimination with p-values

  24. p-valuesdistinguish between correct and incorrect matches precision (tp / tp + fp) recall (tp / tp + fn)

  25. p-values distinguish between correct and incorrect matches

  26. p-values provide a universal metric for comparing to other search results high scoring matches library search Spectra Compiled results low scoring spectra database search high scoring matches

  27. Acknowledgements MacCoss lab Jesse Canterbury Michael Bereman Jarrett Egertson Greg Finney Eileen Heimer Edward Hsieh Alana Killeen Brendan MacLean GenniferMerrihew Daniela Tomazela Mike MacCoss Bill Noble

  28. Number of real matches above fixed a q-value

  29. Percolator distinguishes between correct and incorrect matches

  30. Spectrum-sequence assignments • spectrum score library SEQUEST agree? • 0.l7 AFEQWK LVVAMK NO False positive • 0.83 DLAVER DLAVER YES True positive • …

  31. Query Spectra unfractionated worm one MuDPIT, 220,845 spectra similar DTASelect criteria 14,926 spectra 5,358 ions Test procedure MS/MS spectra whole worm lysate 4 fractionation methods 31 MuDPITs, 6,634,874 spectra Peptide ID List Scan1 0.7 EGSSDEEVP… Scan1 0.3 TFAEILNPI… Scan1 0.2 ARFDLNNHD… ------------------- Scan2 0.5 EDEESIRAV… Scan2 0.2 WLGDDCFMV… Scan2 0.1 IDRAAWKAV… ------------------- Scan3 0.2 EITTRDMGN… Scan3 0.1 GRNMCTAKL… BlibSearch SEQUEST DTASelect Library List of spectrum-sequence pairs 366,400 spectra estimated 51 false positives Library Multiple spectra per peptide BlibFilter Filtered Library Statistics 26,708 spectra 21,264 sequences 3,573 proteins file scan seq run1.ms2 404 DALLQW… run1.ms2 651 PJAMVM… run5.ms2 924 SAITTY… … BlibBuild

  32. Optimize processing parameters • Noise removal • a fixed number of peaks • a fixed fraction of the total intensity • all peaks above a defined noise level • Intensity normalization • log transform • bin peaks, divide by base peak in each bin • square root of intensity • square root weighted by peak m/z 100

  33. Uses of Spectrum Libraries • A basis for spectrum identification via spectrum-spectrum searches • A reference for designing SRM experiments • Skyline • A repository for spectrum identifications • A unified format for consolidating results, sharing with other labs

  34. Spectrum shuffling techniques • Blindly shuffle peaks • Shuffle blocks of peaks • Shift peaks circularly • Identify fragment ions from peptides, shuffle sequence and move peaks accordingly

  35. Parameter Test Results Processing Order: N noise first I intensity first Intensity Adjustments: BIN bin peaks, divide by max per bin MZ weight peak intensity by m/z SQ square root of intensity Noise Reduction: T top n peaks used C top 50% of peak intensity

More Related