Mass Spectrometry-Based Methods for Protein Identification

Mass Spectrometry-Based Methods for Protein Identification Joseph A. Loo Department of Biological Chemistry David Geffen School of Medicine Department of Chemistry and Biochemistry University of California Los Angeles, CA USA

Genomics and ProteomicsCharacterizing many genes and gene products simultaneously

Proteomics Aids Biological Research complex protein mixture Biology protein identification protein modification protein abundance protein separation mass spectrometry

Proteomics - What is it? • An assay to systematically analyze the diverse properties of proteins • Biological processes are dynamic • A quantitative comparison of states is required • The study of protein expression and function on a genome scale • Purpose: Examine altered gene expression pathways in disease states and under different environmental conditions The completion of the human genome has provided researchers with the blueprint for life, and proteomics offers scientists the means for analyzing the expressed genome.

Genome to Proteome dsDNA (Gene) Transcription mRNA Translation Protein H2N COOH MTDLKASSLRALKLMDLTTLNDDDTDEKVIALCHQAKTPVGNTAAICIYP 51 RFIPIARKTLKEQGTPEIRIATVTNFPHGNDDIDIALAETRAAIAYGADE 101 VDVVFPYRALMAGNEQVGFDLVKACKEACAAANVLLKVIIETGELKDEAL 151 IRKASEISI Mass spectrometry The completion of the human genome has provided researchers with the blueprint for life, and proteomics offers scientists the means for analyzing the expressed genome.

Approaches for Protein Identification • Molecular weight • Isoelectric point • Amino acid composition • Other physical/chemical characteristics • Partial or complete amino acid sequence • Edman (N-terminal sequence) - if N-term. not blocked • C-terminal sequence - not commonly performed • Mass spectrometry-measured information What is this protein?

Excise separated protein “spots” In-gel trypsin digest Recover tryptic peptides Protein identification by searching proteomic or genomic databases Protein Identification by Mass Spectrometry 2-D Gel Electrophoresis 150- 75- 40- 25- MW x 103 18- 10- 6.5 6.0 5.5 5.0 4.5 pI 1547 1089 Peptide mass fingerprint by MALDI-TOF or LC-ESI-MS. Additional sequence information can be obtained by MS/MS. 2384 717 1857 1401 1700 1272 2791 3000 500 2500 1500 m/z

Other information can be inferred from a weight measurement. • Post-translational modifications • Molecular interactions • Shape • Sequence • Physical dimensions • etc... Mass Spectrometry:A method to “weigh” molecules A simple measurement of mass is used to confirm the identity of a molecule, but it can be used for much more……

Mass Analyzer Ion Detector Time-of-Flight (TOF) Quadrupole TOF (QTOF) Ion Trap (IT) Fourier Transform- Ion Cyclotron Resonance (FT-ICR) Mass Spectrometer for Proteomics Pre-Separation Ion Source Liquid Chromatography

John Fenn Koichi Tanaka The Nobel Prize in Chemistry 2002 "for the development of methods for identification and structure analyses of biological macromolecules" "for their development of soft desorption ionisation methods for mass spectrometric analyses of biological macromolecules"

Electrospray: Generation of aerosols and droplets

Electrospray Ionization (ESI) • Multiple charging • More charges for larger molecules • MW range > 150 kDa • Liquid introduction of analyte • Interface with liquid separation methods, e.g. liquid chromatography • Tandem mass spectrometry (MS/MS) for protein sequencing ESI MS high voltage highly charge droplets 20+ 19+ 18+ 21+ 17+ 16+ 22+ 15+ 14+ 500 700 900 1100 mass/charge (m/z)

46512 46048 mass ESI-MS of Large Proteins distribution of multiply charged molecules (M+14H)14+ (M+15H)15+ 3323 3102 (M+13H)13+ 3543 (M+16H)16+ ESI-MS (Q-TOF) pH 7.5 m/z

History of Electrospray Ionization • Malcolm Dole demonstrated the production of intact oligomers of polystyrene up to MW 500,000 • mass analysis of large ions was problematic • John Fenn (Yale University) • Chemical engineer - expert in supersonic molecular beams • Began work on electrospray in 1981 • Adapted ESI to operate on a more “conventional” mass spectrometer • Recognized that multiply charged ions were produced by ESI • Reduced the m/z range required

Cathode + + + + - - + + + + + + + + + + + + + + + + + + + + + + + - + + + + + + + + + + + - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + - + + + + + + + + + + + + + - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + _ - - - + Power Supply Electrospray process • Analyte dissolved in a suitable solvent flows through a small diameter capillary tube • Liquid in the presence of a high electric field generates a fine “mist” or aerosol spray of highly charged droplets 106 charges for 30 micron droplet

v3 m3 v2 m2 v1 m1 Matrix-assisted Laser Desorption/Ionization (MALDI) Time-of-Flight (TOF) Analyzer detector high voltage MALDI sample laser drift region m1 m2 m3

MALDI Mass Spectrometry of Large Proteins 100 97430 MALDI-MS of rat MVP (M+H)+ % Intensity (M+2H)2+ 98563 48658 36446 58309 30811 50608 70405 90202 110000 129797 m/z

MALDI • Developed by Tanaka (Japan) and Hillenkamp/Karas (Germany) • Peptide/protein analyte of interest is co-crystallized on the MALDI target plate with an appropriate matrix • small, highly conjugated organic molecules which strongly absorb energy at a particular wavelength • Energy is transferred to analyte indirectly, inducing desorption from target surface • Analyte is ionized by gas-phase proton transfer (perhaps from ionized matrix molecules) sample and matrix pulsed laser light peptide/protein ions desorbed from matrix 20 kV (sample stage or target)

MALDI matrices 2,5-dihydroxybenzoic acid (DHB) peptides and proteins 4-hydroxy--cyanocinnamic acid (“alpha-cyano” or 4-HCCA) peptides 3,5-dimethoxy-4-hydroxycinnamic acid (sinapinic acid) proteins matrices for 337 nm irradiation

MALDI • 337 nm irradiation is provided by a nitrogen (N2) laser • The target plate is inserted into the high vacuum region of the source and the sample is irradiated with a laser pulse. The matrix absorbs the laser energy and transfers energy to the analyte molecule. The molecules are desorbed and ionized during this stage of the process. • MALDI is most commonly interfaced to a time-of-flight (TOF) mass spectrometer.

R. Aebersold and M. Mann, Nature (2003), 422, 198-207.

Time-of-Flight Mass Spectrometer v3 m3 v2 m2 v1 m1 detector drift region (L) high voltage Principal of Operation of Linear TOF A time-of-flight mass spectrometer measures the mass-dependent time it takes ions of different masses to move from the ion source to the detector. This requires that the starting time (the time at which the ions leave the ion source) is well-defined. Recall that the kinetic energy of an ion is: where “” is ion velocity, “m” is mass, “e” is charge on electron, and “V” is electric field. The ion velocity, , is also the length of the flight path, L , divided by the flight time, t: Substituting this expression for into the kinetic energy relation, we can derive the working equation for the time-of-flight mass spectrometer: mass is proportional to (time)2

Approaches for Protein Sequencing and Identification “Top Down” MS/MS MIRERICACVLALGMLTGFTHAFGSKDAAADGKPLVVTTIGMIADAVKNIAQGDVHLKGLMGPGVDPHLYTATAGDVEWLGNADLILYNGLHLETKMGEVFSKLRGSRLVVAVSETIPVSQRLSLEEAEFDPHVWFDVKLWSYSVKAVYESLCKLLPGKTREFTQRYQAYQQQLDKLDAYVRRKAQSLPAERRVLVTAHDAFGYFSRAYGFEVKGLQGVSTASEASAHDMQELAAFIAQRKLPAIFIESSIPHKNVEALRDAVQARGHVVQIGGELFSDAMGDAGTSEGTYVGMVTHNIDTIVAALAR MS/MS Enzymatic or chemical degradation “Bottom Up”

Identification of proteins from gels • Proteins are separated first by high resolution two-dimensional polyacrylamide gel electrophoresis and then stained. At this point, to identify an individual or set of protein spots, several options can be considered by the researcher, depending on availability of techniques. • For protein spots that appear to be relatively abundant (e.g., more than 1 pmol), traditional protein characterization methods may be employed. • Methods such as amino acid analysis and Edman sequencing can be used to provide necessary protein identification information. With 2-DE, approximate molecular weight and isoelectric point characteristics are provided. Augmented with information on amino acid composition and/or amino-terminal sequence, a confident identification can be obtained. • The sensitivity gains of using MS allows for the identification of proteins below the one pmol level and in many cases in the femtomole regime.

Excise separated protein “spots” In-gel trypsin digest Recover tryptic peptides Protein identification by searching proteomic or genomic databases Protein Identification by Mass Spectrometry 2-D Gel Electrophoresis 150- 75- 40- 25- MW x 103 18- 10- 6.5 6.0 5.5 5.0 4.5 pI 1547 1089 Peptide mass fingerprint by MALDI-TOF or LC-ESI-MS. Additional sequence information can be obtained by MS/MS. 2384 717 1857 1401 1700 1272 2791 3000 500 2500 1500 m/z

Protein Cleavage • For the application of mass spectrometry for protein identification, the protein bands/spots from a 2-D gel are excised and are exposed to a highly specific enzymatic cleavage reagent (e.g., trypsin cleaves on the C-terminal side of arginine and lysine residues). The resulting tryptic fragments are extracted from the gel slice and are then subjected to MS-methods. One of the major barriers to high throughput in the proteomic approach to protein identification is the “in-gel” proteolytic digestion and subsequent extraction of the proteolytic peptides from the gel. Common protocols for this process are often long and labor intensive. protein digestion robot

Protein cleavage - proteolysis and chemical methods

Mass spectrometry-based protein identification • A mass spectrum of the resulting digest products produces a “peptide map” or a “peptide fingerprint”. • The measured masses can be compared to theoretical peptide maps derived from database sequences for identification. There are a few choices of mass analysis that can be selected from this point, depending on available instrumentation and other factors. The resulting peptide fragments can be subjected to MALDI-MS or ESI-MS analysis. • A small aliquot of the digest solution can be directly analyzed by MALDI-MS to obtain a peptide map. The resulting sequence coverage (relative to the entire protein sequence) displayed from the total number of tryptic peptides observed in the MALDI mass spectrum can be quite high, i.e., greater than 80% of the sequence, although it can vary considerably depending on the protein, sample amount, etc. The measured molecular weights of the peptide fragments along with the specificity of the enzyme employed can be searched and compared against protein sequence databases using a number of computer searching routines available on the Internet.

Protein identification from peptide fragments Tryptic peptides Protein Mass spectrum Protein sequence Theoretical tryptic peptides Theoretical mass spectrum SEMHIKHYTTK ILGFR EEGDSCPLK QWDDSK ILVAVADK LLEYEEK ILLFNSAK YLLDESSTYK LMHDDSV SEMHIKHYTTKILGFREEGDSCPLKQWDDSKILVAVADKLLEYEEKILLFNSAKYLLDESSTYKLMHDDSV

MALDI-MS of tryptic peptides 1247.70 all peaks are (M+H)+ 1116.67 1375.76 *trypsin autolysis 1505.77 1424.85 1665.89 2005.07 1287.73 * 2719.48 1574.20 1811.85 1849.12 * 2476.21 2550.52 1000 1500 2000 2500 3000 m/z ARIIVVTSGK GGVGKTTSSA AIATGLAQKG KKTVVIDFDI GLRNLDLIMG CERRVVYDFV NVIQGDATLN QALIKDKRTE NLYILPASQT RDKDALTREG VAKVLDDLKA MDFEFIVCDS PAGIETGALM ALYFADEAII TTNPEVSSVR DSDRILGILA SKSRRAENGE EPIKEHLLLT RYNPGRVSRG DMLSMEDVLE ILRIKLVGVI PEDQSVLRAS NQGEPVILDI NADAGKAYAD TVERLLGER PFRFIEEEKK GFLKRLFGG

100 Rel. Abund. 0 5 6 7 8 9 10 11 12 100 Rel. Abund. 0 400 800 1200 1600 2000 ESI-MS and LC-MS for protein identification • An approach for peptide mapping similar to MALDI-MS uses ESI-MS. A peptide map can be obtained by analysis of the peptide mixture by ESI-MS. An advantage of ESI is its ease of coupling to separation methodologies such as HPLC. Thus, alternatively, to reduce the complexity of the mixture, the peptides can be separated by HPLC with subsequent mass measurement by on-line ESI-MS. The measured masses can be compared to sequence databases. 9.4 LC-MS with ESI 8.4 9.8 8.9 7.7 6.8 6.2 Time (min) 965.3 (M+2H)2+ MW ~ 1928.6 Da 629.0 m/z

5 6 7 8 9 10 11 12 400 600 800 1000 1200 1400 1600 1800 400 800 1200 1600 2000 LC-MS/MS for protein identification • An improvement in throughput of the overall method can be obtained by performing LC-MS/MS in the data dependant mode. As full scan mass spectra are acquired continuously in LC-MS mode, any ion detected with a signal intensity above a pre-defined threshold will trigger the mass spectrometer to switch over to MS/MS mode. Thus, the mass spectrometer switches back and forth between MS- (molecular mass information) and MS/MS mode (sequence information) in a single LC run. The data dependant scanning capability can dramatically increase the capacity and throughput for protein identification. 9.4 y12 8.4 LC-MS/MS LC-MS 1261.4 9.8 y10 6.8 y13 1374.5 Time (min) b6 668.4 b8 965.3 838.5 b5 y9 y11 y14 (M+2H)2+ MS/MS y8 1474.4 y4 629.0 b3 m/z m/z

A D D D A A A A E E B B B B B C C C C m/z Peptide sequencing by mass spectrometry • Peptide molecules are fragmented by collisionally activated dissociation (CAD) • collisions with neutral background gas molecules (nitrogen, argon, etc) • typically dissociate by cleavage of -CO-NH- bond N-term. C-term. A N-terminal product ions

D D D D A E E E E E B B C C C m/z Peptide sequencing by mass spectrometry • Ideally, one can measure the spacings between product ion peaks to deduce the sequence • if each amide bond dissociates with equal probability • if only a single amide bond fragments for each molecule • if only C-terminal or N-terminal products ions are formed • In reality, this is not the case… C-terminal product ions

O O O R2 R1 R3 R4 H H H H H H H Nomenclature for MS Sequencing of Peptides Klaus Biemann, MIT subscript denotes the number of residues contained in product ion N-terminal fragments b2 b3 b1 H2N - C - C - N - C - C - N - C - C - N - C - COOH y3 y2 y1 C-terminal fragments

Nomenclature for MS Sequencing of Peptides • Low-energy collisions promote fragmentation of a peptide primarily along the peptide backbone • Peptide fragmentation which maintains the charge on the C terminus is designated a y-ion • Fragmentation which maintains the charge on the N terminus is designated a b-ion • Low energy collisions: ion trap, QQQ, QTOF, FT-ICR • High energy collisions: TOF-TOF • cleavage of amino acid side chain bonds (d-ion and w-ion) • differentiate Leu vs. Ile

242 259 100 0 400 600 800 1000 1200 1400 1600 1800 Peptide Sequencing by Mass Spectrometry y4-14 LVDKVIGITNEEAISTAR Cysteine Synthase A b3-17 MS/MS of 2+ charged tryptic peptides yield (often) 1+ charged product ions (but 2+ charged products can be observed as well) y12 1261.4 y10 mixture of b-ions and y-ions are present 1091.5 y13 1374.5 Rel. Abund. b6 668.4 b8 b5 y9 838.5 y14 555.4 y11 990.5 y5 b7 1474.4 y4 b4 b9 y6 y8 b3 y7 b14 b10 b11 b12 b13 b16 b17 b15 m/z

Computer-based Sequence Searching Strategies • A list of experimentally determined masses is compared to lists of computer-generated theoretical masses prepared from a database of protein primary sequences. With the current exponential growth in the generation of genomic data, these databases are expanding every day. • There are typically three types of search strategies employed: • searching with peptide fingerprint data • searching with sequence data • searching with raw MS/MS data. • One limiting factor that must be considered for all of the approaches is that they can only identify proteins that have been identified and reside within an available database, or very homologous to one that resides in the database.

Searching with Peptide Fingerprints • The majority of the available search engines allow one to define certain experimental parameters to optimize a particular search. • Minimum number of peptides to be matched • Allowable mass error • Monoisotopic versus average mass data • Mass range of starting protein • Type of protease used for digestion • Information about potential protein modification, such as N- and C-terminal modification, carboxymethylation, oxidized methionines, etc.

Searching with Peptide Fingerprints • Most protein databases contain primary sequence information only • Any shift in mass incorporated into the primary sequence as a result of post-translational modification will result in an experimental mass that is in disagreement with the theoretical mass. • Modifications such as glycation and phosphorylation can result in missed identifications. • A single amino acid substitution can shift the mass of a peptide to such a degree that even a protein with a great deal of homology with another in the database can not be identified.

Searching with Peptide Fingerprints • A number of factors affect the utility of peptide fingerprinting. • The greater the experimental mass accuracy, the narrower you can set your search tolerances, thereby increasing your confidence in the match, and decreasing the number of "false positive" responses. • A common practice used to increase mass accuracy in peptide fingerprinting is to employ an autolysis fragment from the proteolytic enzyme as an internal standard to calibrate a MALDI mass spectrum. • Peptide fingerprinting is also amenable to the identification of proteins in complex mixtures. • Peptides generated from the digest of a protein mixture will simply return two or more results that are a "good" fit. • Peptides that are "left over" in a peptide fingerprint after the identification of one component can be resubmitted for the possible identification of another component.

Web addresses of some representative internet resources for protein identification from mass spectrometry data

Mascot • Among the first programs for identifying proteins by peptide mass fingerprinting, MOWSE, developed out of a collaboration between Imperial Cancer Research Fund (ICRF) and SERC Daresbury Laboratory, UK. • The name chosen was an acronym of Molecular Weight Search. The MOWSE databases were fully indexed so as to allow very rapid searching and retrieval of sequence data. Subsequently, the software was further developed and renamed Mascot. • Licensed and distributed by Matrix Science Ltd. • Specialized tools include Peptide Mass Fingerprint, Sequence Query, and MS/MS Ion Search. • Search output Web-based. • Good visual representation of search quality (graphical probability chart). • Simple graphical user interface. • Reports MOWSE scores as a quantitative measure of search quality.

Mowse Scoring • Rather than just counting the number of matching peptides, Mowse uses empirically determined factors to assign a statistical weight to each individual peptide match. Rapid identification of proteins by peptide-mass fingerprinting. Curr. Biol. 3:327, 1993.) • Scoring scheme assigns more weight to matches of higher molecular weight peptides (more discriminating). • Compensates for the non-random distribution of fragment molecular weights in proteins of different sizes. • Was first protein identification program to recognize that the relative abundance of peptides of a given length in a proteolytic digest depends on the lengths of both peptide and protein. • Developed for MALDI peptide mass fingerprinting. • Probability-Based Mowse • Mascot incorporates a probability-based enhanced Mowse algorithm, described in Perkins et al. (Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20:3551-3567, 1999). • A simple rule can be used to judge whether a result is significant or not. Different types of matching (peptide masses and fragment ions) can be combined in a single search.

Databases • Three components are required for database searching support of proteomics: MALDI or MS/MS data, the algorithms used to search protein databases with the MALDI or MS/MS data, and the protein databases themselves. • The protein databases can be as small as one protein, can be large, public domain databases of all known and predicted proteins, or may be predicted open reading frames based on genomic sequence. • A major challenge for database searching is that these protein databases are constantly changing, making database search results potentially obsolete as new entries are added that better fit the MALDI or MS data. • Even as genomes are completed there is still flux as new coding regions are identified and novel mechanisms of increased translational complexity are better understood, such as alternative splice products, RNA editing, and ribosome slippage leading to novel, unexpected translation products.

Databases • NCBI non-redundant (NCBInr) • Non-redundant database from the National Center for Biotechnology Information for use with their search tools BLAST and Entrez; comprised of translated sequences from the Genbank /EMBL/DDBJ consortium, SwissProt, Protein Information Resource (PIR), and Brookhaven Protein Data Bank (PDB). • New releases are published bimonthly while updates occur daily. • OWL • OWL is comprised of Swiss-Prot, PIR, translated Genbank, and NRL-3D (PDB). All sequences are compared to Swiss-Prot to remove identical and “trivially different “ sequences. Has not been updated since May, 1999. • SWISSPROT • While SwissProt contains only a subset of proteins, the proteins in this database are much better annotated and the sequences are much more reliable than those available in any other database. • MSDB • Comprehensive, non-identical protein sequence database maintained by the Proteomics Department at the Hammersmith Campus of Imperial College London. Designed specifically for MS applications.

Databases • EST Clusters (dBEST) • Division of GenBank that contains "single-pass" cDNA sequences, or Expressed Sequence Tags (EST’s), from a number of organisms. • EST’s are relatively short, usually 3’ end sequences from isolated mRNA. • EST’s tend to be highly redundant and the sequence is much lower quality than from other sources. An advantage to using these EST’s is that they represent only expressed sequences (no introns) and include alternative splice variants; their length, redundancy, and low quality are far improved by using clustered EST’s, such as the Compugen clusters. • The EST database has some redundancy because it contains all possible combinations of alternative splice products, and so it can be very large (and slow to search). • During a Mascot search, the nucleic acid sequences are translated in all six reading frames. dbEST is a very large database, and is divided into three sections: EST_human, EST_mouse, and EST_others. Even so, searches of these databases take far longer than a search of one of the non-redundant protein databases. You should only search an EST database if a search of a protein database has failed to find a match.

MALDI-MS peptide fingerprint(tryptic digest of a single protein) 1247.70 all peaks are (M+H)+ 1116.67 1375.76 *trypsin autolysis 1505.77 1424.85 1665.89 2005.07 1287.73 2719.48 1574.20 * 1811.85 1849.12 2476.21 2550.52 * 1000 1500 2000 2500 3000 m/z

Mascot (Matrix Science) for peptide mass fingerprints enter peak list

Mascot (Matrix Science) for peptide mass fingerprints possible identification

Mass Spectrometry-Based Methods for Protein Identification