Statistical calibration of MS/MS spectrum library search scores

Statistical calibration of MS/MS spectrum library search scores Barbara Frewen January 10, 2011 University of Washington

Protein identification Peptides EYWDYEAHMIEWGQIDDYQLVR GGTNIITLLDVVK VVVFLFDLLYFNGEPLV YQTTGQVQYSCLVR LIVVNSEDQLR HPLISLLLLIAFYSTSSEAFVPK … Protein Mixture Proteins B0205.7 casein kinase C29A12.3a lig-1 DNA ligase C29E6.1a mucin like protein … Digestion to Peptides

Acquiring MS/MS spectra MS/MS Isolate Proteins Cell lysis Digest to Peptides MS Load onto column µLC/µLC

Which proteins are in my sample? Peptides EYWDYEAHMIEWGQIDDYQLVR GGTNIITLLDVVK VVVFLFDLLYFNGEPLV YQTTGQVQYSCLVR LIVVNSEDQLR HPLISLLLLIAFYSTSSEAFVPK … Protein Mixture Proteins B0205.7 casein kinase C29A12.3a lig-1 DNA ligase C29E6.1a mucin like protein … Digestion to Peptides

Matching a spectrum to a peptide sequence • De novo Infer peptide sequence from m/z of observed peaks • Database search Compare observed peaks to predict peaks for each peptide from a list of candidate sequences • Library search Compare observed peaks to known spectra

Building a spectrum library • Ideally, infuse synthesized peptides • ISB has gold standard spectra from five peptides per protein in human • University of Washington (MacCoss) will have spectra from 790 transcription factors and 350 kinases • Alternatively, use high-quality peptide-spectrum matches from shotgun proteomics experiments • BiblioSpec now parses search results from SEQUEST, Mascot, X! Tandem, ProteinPilot, Scaffold

Library file formats

Using a spectrum library Spectrum identification via library searching Resource for designing SRM directed experiments Compact, unified format for compiling results and sharing between labs

Searching a spectrum library SEQUEST BiblioSpec Peptide ID list Scan1 0.7 EGSSDEEVP… Scan1 0.3 TFAEILNPI… Scan1 0.2 ARFDLNNHD… ------------------- Scan2 0.5 EDEESIRAV… Scan2 0.2 WLGDDCFMV… Scan2 0.1 IDRAAWKAV… ------------------- Scan3 0.2 EITTRDMGN… Scan3 0.1 GRNMCTAKL… m/z 594.2 score = 0.2 MS/MS query spectra 2 GDTIENFK 300.4 1 CGCCLYNT 522.3 2 FMACSDEK 593.9 3 QWDKEPPR 765.1 Library of identified spectra 3 NGISLTIVR 940.4

Comparing library and database search • Created a large library of spectra from worm peptides • Identified a different set of spectra using both library and database search • Compared BiblioSpecresults with SEQUEST results to evaluate performance • spectrum score library SEQUEST agree? • 0.l7 AFEQWK LVVAMK NO False positive • 0.83 DLAVER DLAVER YES True positive • …

Similarity score discriminates between correct and incorrect matches agree disagree insert hist/roc Histogram of search scores ROC and 1% ROC curve AUC = 0.978

BiblioSpec and SEQUEST results agree • BiblioSpecfound 91% of SEQUEST IDs • Two reasons BiblioSpec and SEQUEST disagree: • Query ion not in library • BiblioSpec found a different peptide to be more similar • Only 7% of query spectra notcorrectly identified were in library. Most disagreed because the correct match was not in library.

Compute p-values to evaluate results • The BiblioSpec search score provides good discrimination • But it’s unclear where to place a threshold between correct and incorrect matches • Use statistical methods to estimate the probability that a match is incorrect and to estimate the fraction of incorrect matches above a score threshold.

How likely is the match incorrect? distribution of scores for a spectrum vs all possible incorrect matches low score large area to right p-value = 0.4 high score small area to right p-value = 0.01 score

Estimating the null distribution • Representative sample of scores from incorrect matches • Guarantee they are incorrect by using decoys • In database searching, scores from decoy peptides are used to estimate the null distribution • How can we create decoy spectra?

Generate decoy spectra by shifting the m/z of the peaks Requirements: • fast to generate • sequence agnostic • representative scores Evaluation: • score distributions mimic real spectra • generate a data set of incorrect matches to real spectra real spectrum decoy spectrum

Circularly shifted peaks are similar to real spectra

Percolator computes p-values Semi–supervised machine learning to classify correct verses incorrect matches • Trains with high-scoring real matches vsdecoy matches • Classifies all real matches using that model • http://per-colator.com • Kället al. 2007 Nature Methods • Käll et al. 2008 Bioinformatics

Evaluate p-values • Compute p-values for incorrect matches to real spectra • Percolator p-values should correspond with rank-based p-values ID Percolator rank rank/n 745AF_8518 0.000230787 1 1/n 691AF_10025 0.000461467 2 2/n 691AF_10107 0.000692201 3 3/n 691AF_10301 0.000922934 4 4/n ... ... ... ... 691AF_5048 0.001153669 12 12/n ... ... ... ...

Calibrating p-values Calculated p-value Rank p-value

Better discrimination with p-values Percolator combines: • search score • delta m/z • delta search score • charge • petpide length • candidates • copies in library precision (tp / tp + fp) recall (tp / tp + fn)

Better discrimination with p-values

p-valuesdistinguish between correct and incorrect matches precision (tp / tp + fp) recall (tp / tp + fn)

p-values distinguish between correct and incorrect matches

p-values provide a universal metric for comparing to other search results high scoring matches library search Spectra Compiled results low scoring spectra database search high scoring matches

Acknowledgements MacCoss lab Jesse Canterbury Michael Bereman Jarrett Egertson Greg Finney Eileen Heimer Edward Hsieh Alana Killeen Brendan MacLean GenniferMerrihew Daniela Tomazela Mike MacCoss Bill Noble

Number of real matches above fixed a q-value

Percolator distinguishes between correct and incorrect matches

Spectrum-sequence assignments • spectrum score library SEQUEST agree? • 0.l7 AFEQWK LVVAMK NO False positive • 0.83 DLAVER DLAVER YES True positive • …

Query Spectra unfractionated worm one MuDPIT, 220,845 spectra similar DTASelect criteria 14,926 spectra 5,358 ions Test procedure MS/MS spectra whole worm lysate 4 fractionation methods 31 MuDPITs, 6,634,874 spectra Peptide ID List Scan1 0.7 EGSSDEEVP… Scan1 0.3 TFAEILNPI… Scan1 0.2 ARFDLNNHD… ------------------- Scan2 0.5 EDEESIRAV… Scan2 0.2 WLGDDCFMV… Scan2 0.1 IDRAAWKAV… ------------------- Scan3 0.2 EITTRDMGN… Scan3 0.1 GRNMCTAKL… BlibSearch SEQUEST DTASelect Library List of spectrum-sequence pairs 366,400 spectra estimated 51 false positives Library Multiple spectra per peptide BlibFilter Filtered Library Statistics 26,708 spectra 21,264 sequences 3,573 proteins file scan seq run1.ms2 404 DALLQW… run1.ms2 651 PJAMVM… run5.ms2 924 SAITTY… … BlibBuild

Optimize processing parameters • Noise removal • a fixed number of peaks • a fixed fraction of the total intensity • all peaks above a defined noise level • Intensity normalization • log transform • bin peaks, divide by base peak in each bin • square root of intensity • square root weighted by peak m/z 100

Uses of Spectrum Libraries • A basis for spectrum identification via spectrum-spectrum searches • A reference for designing SRM experiments • Skyline • A repository for spectrum identifications • A unified format for consolidating results, sharing with other labs

Spectrum shuffling techniques • Blindly shuffle peaks • Shuffle blocks of peaks • Shift peaks circularly • Identify fragment ions from peptides, shuffle sequence and move peaks accordingly

Parameter Test Results Processing Order: N noise first I intensity first Intensity Adjustments: BIN bin peaks, divide by max per bin MZ weight peak intensity by m/z SQ square root of intensity Noise Reduction: T top n peaks used C top 50% of peak intensity

Statistical calibration of MS/MS spectrum library search scores