Protein identification by sequence database search
Download
1 / 71

Protein Identification by Sequence Database Search - PowerPoint PPT Presentation


  • 85 Views
  • Uploaded on

Protein Identification by Sequence Database Search. Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University Medical Center. Peptide Mass Fingerprint. Cut out 2D-Gel Spot. Peptide Mass Fingerprint. Trypsin Digest. Peptide Mass Fingerprint. MS.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Protein Identification by Sequence Database Search' - tyler-mejia


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Protein identification by sequence database search

Protein Identification by Sequence Database Search

Nathan Edwards

Department of Biochemistry and Mol. & Cell. Biology

Georgetown University Medical Center


Peptide mass fingerprint
Peptide Mass Fingerprint

Cut out

2D-GelSpot


Peptide mass fingerprint1
Peptide Mass Fingerprint

Trypsin Digest




Peptide mass fingerprint4
Peptide Mass Fingerprint

  • Trypsin: digestion enzyme

    • Highly specific

    • Cuts after K & R except if followed by P

  • Protein sequence from sequence database

    • In silico digest

    • Mass computation

  • For each protein sequence in turn:

    • Compare computer generated masses with observed spectrum


Protein sequence
Protein Sequence

  • Myoglobin GLSDGEWQQV LNVWGKVEAD IAGHGQEVLI RLFTGHPETL EKFDKFKHLK TEAEMKASED LKKHGTVVLT ALGGILKKKG HHEAELKPLA QSHATKHKIP IKYLEFISDA IIHVLHSKHP GDFGADAQGA MTKALELFRN DIAAKYKELG FQG


Protein sequence1
Protein Sequence

  • Myoglobin GLSDGEWQQV LNVWGKVEAD IAGHGQEVLI RLFTGHPETL EKFDKFKHLK TEAEMKASED LKKHGTVVLT ALGGILKKKG HHEAELKPLA QSHATKHKIP IKYLEFISDA IIHVLHSKHP GDFGADAQGA MTKALELFRN DIAAKYKELG FQG



Peptide mass m z
Peptide Mass & m/z

  • Peptide Molecular Weight: N-terminal-mass (0.00) + Sum (AA masses) + C-terminal-mass (18.010560)

  • Observed Peptide m/z: (Peptide Molecular Weight + z * Proton-mass (1.007825)) / z

  • Monoisotopic mass values!


Peptide masses
Peptide Masses

1811.90 GLSDGEWQQVLNVWGK

1606.85 VEADIAGHGQEVLIR

1271.66 LFTGHPETLEK

1378.83 HGTVVLTALGGILK

1982.05 KGHHEAELKPLAQSHATK

1853.95 GHHEAELKPLAQSHATK

1884.01 YLEFISDAIIHVLHSK

1502.66 HPGDFGADAQGAMTK

748.43 ALELFR


Peptide mass fingerprint5
Peptide Mass Fingerprint

YLEFISDAIIHVLHSK

GHHEAELKPLAQSHATK

GLSDGEWQQVLNVWGK

HPGDFGADAQGAMTK

HGTVVLTALGGILK

VEADIAGHGQEVLIR

KGHHEAELKPLAQSHATK

ALELFR

LFTGHPETLEK


Sample preparation for tandem mass spectrometry

Enzymatic Digest

and

Fractionation

Sample Preparation for Tandem Mass Spectrometry




Peptide fragmentation
Peptide Fragmentation

Peptides consist of amino-acids arranged in a linear backbone.

N-terminus

H…-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH

Ri-1

Ri

Ri+1

C-terminus

AA residuei-1

AA residuei

AA residuei+1



Peptide fragmentation2

yn-i

bi

Peptide Fragmentation

yn-i-1

-HN-CH-CO-NH-CH-CO-NH-

Ri+1

Ri

bi+1


Peptide fragmentation3

xn-i

yn-i

zn-i

yn-i-1

-HN-CH-CO-NH-CH-CO-NH-

CH-R’

Ri

i+1

R”

ai

bi

ci

i+1

bi+1

Peptide Fragmentation


Peptide fragmentation4
Peptide Fragmentation

Peptide: S-G-F-L-E-E-D-E-L-K


Peptide fragmentation5

88

145

292

405

534

663

778

907

1020

1166

b ions

S

G

F

L

E

E

D

E

L

K

1166

1080

1022

875

762

633

504

389

260

147

y ions

y6

100

y7

% Intensity

y5

b3

b4

y2

y3

b5

y8

y4

b8

y9

b6

b7

b9

0

m/z

250

500

750

1000

Peptide Fragmentation


Peptide identification
Peptide Identification

Given:

  • The mass of the precursor ion, and

  • The MS/MS spectrum

    Output:

  • The amino-acid sequence of the peptide


Sequence database search

S

G

F

L

E

E

D

E

L

K

100

% Intensity

0

m/z

250

500

750

1000

Sequence Database Search


Sequence database search1

88

145

292

405

534

663

778

907

1020

1166

b ions

S

G

F

L

E

E

D

E

L

K

1166

1080

1022

875

762

633

504

389

260

147

y ions

100

% Intensity

0

m/z

250

500

750

1000

Sequence Database Search


Sequence database search2

88

145

292

405

534

663

778

907

1020

1166

b ions

S

G

F

L

E

E

D

E

L

K

1166

1080

1022

875

762

633

504

389

260

147

y ions

y6

100

y7

% Intensity

y5

b3

b4

y2

y3

b5

y8

y4

b8

y9

b6

b7

b9

0

m/z

250

500

750

1000

Sequence Database Search


Sequence database search3
Sequence Database Search

  • No need for complete ladders

  • Possible to model all known peptide fragments

  • Sequence permutations eliminated

  • All candidates have some biological relevance

  • Practical for high-throughput peptide identification

  • Correct peptide might be missing from database!


Peptide candidate filtering
Peptide Candidate Filtering

  • Digestion Enzyme: Trypsin

  • Cuts just after K or R unless followed by a P.

  • Basic residues (K & R) at C-terminal attract ionizing charge, leading to strong y-ions

  • “Average” peptide length about 10-15 amino-acids

  • Must allow for “missed” cleavage sites


Peptide candidate filtering1
Peptide Candidate Filtering

>ALBU_HUMAN MKWVTFISLLFLFSSAYSRGVFRRDAHKSEVAHRFKDLGEENFKALVLIAFAQYLQQCPFEDHVKLVNEVTEFAK…

No missed cleavage sites

MK

WVTFISLLFLFSSAYSR

GVFR

R

DAHK

SEVAHR

FK

DLGEENFK

ALVLIAFAQYLQQCPFEDHVK

LVNEVTEFAK


Peptide candidate filtering2
Peptide Candidate Filtering

>ALBU_HUMAN MKWVTFISLLFLFSSAYSRGVFRRDAHKSEVAHRFKDLGEENFKALVLIAFAQYLQQCPFEDHVKLVNEVTEFAK…

One missed cleavage site

MKWVTFISLLFLFSSAYSR

WVTFISLLFLFSSAYSRGVFR

GVFRR

RDAHK

DAHKSEVAHR

SEVAHRFK

FKDLGEENFK

DLGEENFKALVLIAFAQYLQQCPFEDHVK

ALVLIAFAQYLQQCPFEDHVKLVNEVTEFAK


Peptide candidate filtering3
Peptide Candidate Filtering

  • Peptide molecular weight

  • Only have m/z value

    • Need to determine charge state

  • Ion selection tolerance

  • Mass for each amino-acid symbol?

    • Monoisotopic vs. Average

    • “Default” residual mass

    • Depends on sample preparation protocol

    • Cysteine almost always modified


Peptide molecular weight

i=0

Same peptide,i = # of C13 isotope

i=1

i=2

i=3

i=4

Peptide Molecular Weight


Peptide molecular weight1
Peptide Molecular Weight

…from “Isotopes” – An IonSource.Com Tutorial


Peptide molecular weight2
Peptide Molecular Weight

  • Peptide sequence WVTFISLLFLFSSAYSR

  • Potential phosphorylation? S,T,Y + 80 Da

  • 7 Molecular Weights

  • 64 “Peptides”


Peptide scoring
Peptide Scoring

  • Peptide fragments vary based on

    • The instrument

    • The peptide’s amino-acid sequence

    • The peptide’s charge state

    • Etc…

  • Search engines model peptide fragmentation to various degrees.

    • Speed vs. sensitivity tradeoff

    • y-ions & b-ions occur most frequently

  • The scores have no apriority “scale”


Peptide identification1
Peptide Identification

  • High-throughput workflows demand we analyze all spectra, all the time.

  • Spectra may not contain enough information to be interpreted correctly

    • ...cell phone call drops in and out

  • Spectra may contain too many irrelevant peaks

    • …bad static

  • Peptides may not match our assumptions

    • …its all Greek to me

  • “Don’t know”is an acceptable answer!


Peptide identification2
Peptide Identification

  • Rank the best peptide identifications

  • Is the top ranked peptide correct?


Peptide identification3
Peptide Identification

  • Rank the best peptide identifications

  • Is the top ranked peptide correct?


Peptide identification4
Peptide Identification

  • Rank the best peptide identifications

  • Is the top ranked peptide correct?


Peptide identification5
Peptide Identification

  • Incorrect peptide has best score

    • Correct peptide is missing?

    • Potential for incorrect conclusion

    • What score ensures no incorrect peptides?

  • Correct peptide has weak score

    • Insufficient fragmentation, poor score

    • Potential for weakened conclusion

    • What score ensures we find all correct peptides?


Statistical significance
Statistical Significance

  • Can’t prove particular identifications are right or wrong...

    • ...need to know fragmentation in advance!

  • A minimal standard for identification scores...

    • ...better than guessing.

    • p-value, E-value, statistical significance


Random peptide models
Random Peptide Models

  • "Generate" random peptides

    • Real looking fragment masses

    • No theoretical model!

    • Must use empirical distribution

    • Usually require they have the correct precursor mass

  • Score function can model anything we like!


Random peptide models1
Random Peptide Models

Fenyo & Beavis, Anal. Chem., 2003


Random peptide models2
Random Peptide Models

Fenyo & Beavis, Anal. Chem., 2003


Random peptide models3
Random Peptide Models

  • Truly random peptides don’t look much like real peptides

    • Just use (incorrect) peptides from the sequence database!

  • Caveats:

    • Correct peptide (non-random) may be included

    • Homologous incorrect peptides may be included

    • (Incorrect) peptides are not independent


Extrapolating from the empirical distribution
Extrapolating from the Empirical Distribution

  • Often, the empirical shape is consistent with a theoretical model

Geer et al., J. Proteome Research, 2004

Fenyo & Beavis, Anal. Chem., 2003


False positive rate estimation
False Positive Rate Estimation

  • A form of statistical significance

  • Search engine independent

    • Easy to implement

  • Assumes a single threshold for all spectra

    • Best if E-value or similar is used to compute a spectrum normalized score


False positive rate estimation1
False Positive Rate Estimation

  • Each spectrum is a chance to be right, wrong, or inconclusive.

    • At any given threshold, how many peptide identifications are wrong?

    • Computed for an entire spectral dataset

  • Given identification criteria:

    • SEQUEST Xcorr, E-value, Score, etc., plus...

    • ...threshold

  • Use “decoy” sequences

    • random, reverse, cross-species

    • Identifications must be incorrect!


Decoy search strategies
Decoy Search Strategies

  • Concatenated target & decoy

    • “Competition” for best hit...

    • Masks good decoy scores due to spectral variation

  • Separate searches

    • Cleaner estimation of false hit distribution

    • More conservative than concatenation

  • Must ensure:

    • Decoy searches do not change target peptide scores

    • Single score distribution across dataset


Decoy search strategies1
Decoy Search Strategies

  • Reversed Decoys

    • Captures redundancy of peptide sequences

    • Susceptible to mass-shift anomalies

    • Bad choice for protein-level statistics

  • Shuffled & Random Decoys

    • Multiple independent decoys can be created.

    • Better estimation of tail probabilities

    • More conservative than reversed decoys


False positive rate estimation concatenated target decoy
False Positive Rate Estimation: Concatenated Target & Decoy

  • Choose a threshold t.

  • Count # of (rank 1) target ids (Tt) with score ≥t.

  • Count # of (rank 1) decoy ids (Dt) with score ≥t.

  • Compute FPR = ( 2 x Dt ) / ( Tt + Dt )

    Principle:

  • Decoy peptides equally likely as false hits at rank 1

    Issues:

  • What to do with decoy hits?

  • Change in database size may affect scores


False positive rate estimation separate decoy search
False Positive Rate Estimation: Separate Decoy Search

  • Choose a threshold t.

  • Count # of (rank 1) target ids (Tt) with score ≥t.

  • Count # of (rank 1) decoy ids (Dt) with score ≥t.

  • Compute FPR = Dt / Tt

    Principle:

  • Find the distribution of false hit scores, apply to target

    Issues:

  • Can choose to merge after the fact...

  • Decoy search cannot change target scores

  • A few good decoy scores can inflate small FDR values


Peptide prophet
Peptide Prophet

  • Re-analysis of SEQUEST results

    • Spectrum dependant scores (XCorr) +

    • Additional features form discriminant score

  • Assumes that many of the spectra are not correctly identified

    • These identifications act like decoy hits


Peptide prophet1
Peptide Prophet

Keller et al., Anal. Chem. 2002

Distribution of spectral scores in the results


Peptides to proteins
Peptides to Proteins

Nesvizhskii et al., Anal. Chem. 2003



Peptides to proteins2
Peptides to Proteins

  • A peptide sequence may occur in many different protein sequences

    • Variants, paralogues, protein families

  • Separation, digestion and ionization is not well understood

  • Proteins in sequence database are extremely non-random, and very dependent

  • No great tools for assessing statistical confidence of protein identifications.










Sequence database search traps and pitfalls
Sequence Database SearchTraps and Pitfalls

Search options may eliminate the correct peptide

  • Precursor mass tolerance too small

  • Fragment m/z tolerance too small

  • Incorrect precursor ion charge state

  • Non-tryptic or semi-tryptic peptide

  • Incorrect or unexpected modification

  • Sequence database too conservative

  • Unreliable taxonomy annotation


Sequence database search traps and pitfalls1
Sequence Database SearchTraps and Pitfalls

Search options can cause infinite search times

  • Variable modifications increase search times exponentially

  • Non-tryptic search increases search time by two orders of magnitude

  • Large sequence databases contain many irrelevant peptide candidates


Sequence database search traps and pitfalls2
Sequence Database SearchTraps and Pitfalls

Best available peptide isn’t necessarily correct!

  • Score statistics (e-values) are essential!

    • What is the chance a peptide could score this well by chance alone?

  • The wrong peptide can look correct if the right peptide is missing!

  • Need scores (or e-values) that are invariant to spectrum quality and peptide properties


Sequence database search traps and pitfalls3
Sequence Database SearchTraps and Pitfalls

Search engines often make incorrect assumptions about sample prep

  • Proteins with lots of identified peptides are not more likely to be present

  • Peptide identifications do not represent independent observations

  • All proteins are not equally interesting to report


Sequence database search traps and pitfalls4
Sequence Database SearchTraps and Pitfalls

Good spectral processing can make a big difference

  • Poorly calibrated spectra require large m/z tolerances

  • Poorly baselined spectra make small peaks hard to believe

  • Poorly de-isotoped spectra have extra peaks and misleading charge state assignments


Summary
Summary

  • Protein identification from tandem mass spectra is a key proteomics technology.

  • Protein identifications should be treated with healthy skepticism.

    • Look at all the evidence!

  • Spectra remain unidentified for a variety of reasons.


Further reading
Further Reading

  • Matrix Science (Mascot) Web Site

    • www.matrixscience.com

  • Seattle Proteome Center (ISB)

    • www.proteomecenter.org

  • Proteomic Mass Spectrometry Lab at The Scripps Research Institute

    • fields.scripps.edu

  • UCSF ProteinProspector

    • prospector.ucsf.edu