Efficient and accurate algorithms for peptide mass spectrometry. Dissertation presentation Stephen Tanner May 30, 2007 Lab page: http://peptide.ucsd.edu. Overview. Introduction: What is mass spectrometry? How does it fit into the broader context of biology and bioinformatics? (Chapter 1)
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
May 30, 2007
Lab page: http://peptide.ucsd.edu
Capillary sequencers are a central technology for studying DNA
Microarrays are a central technology for studying RNA
Mass spectrometry is a central technology for studying the proteome.
“…Some five years ago, mass spectrometry definitively crossed the border to biochemistry. The general ways that it provides structural deter-mination, identification and trace level analysis have many applications in the biochemical field. It has become an attractive alternative to Edman sequencing, earlier dominant, and has an unsurpassed ability to identify posttranscriptional modifications and non-covalent interactions in for example antigen-antibody binding studies for identifying ligands to orphan receptors….”
A protein sample is digested (typically with trypsin) to generate peptides.
The peptides are then separated by liquid chromatography.
The mass spectrometer separates the eluting peptides by mass-to-charge ratio (m/z), and records a mass spectrum.
Above: Diagram of a mass spectrometer (courtesy of ChemGuide.com). Molecules are accelerated by a series of charged plates, their time of flight determined by their mass-to-charge ratio.
Right: A high-end Fourier Transform mass spectrometer (image from Pacific Northwest National Labs)
Ionized parent peptide
peak for each
Above: A sample peptide tandem mass spectrum, identified and labeled by the InsPecT software toolkit.
Known problem labeled by the InsPecT software toolkit.
Genome assemblyand gene finding are two important problems in genomics.
Finding up- and down-regulated genes and gene sets is a key problem in transcriptomics.
Peptide identification (InsPecT) and modification site identification (MS-Alignment) are two important problems in proteomics.
Tanner, S., Shu, H., Frank, A., Wang, L., Zandi, E., Mumby, M., Pevzner, P., and Bafna, V., 2005. Inspect: Fast and accurate identification of post-translationally modified peptides from tandem mass spectra. Anal. Chem., 77(14):4626–4639.Frank, A., Tanner, S., Bafna, V., and Pevzner, P., 2005. Peptide sequence tags for fast database search in mass-spectrometry. J. of Proteome Research, 4(4):1287–1295.
InsPecT: Fast and Accurate Spectrum Annotation
(#1 of 10,000)
Tools like Sequest must score every peptide from the database with approximately correct mass (left). Using InsPecT, the expensive scoring step need only be run on those candidates matching a sequence tag (right).
(root) M., Pevzner, P., and Bafna, V., 2005. Inspect: Fast and accurate identification of post-translationally modified peptides from tandem mass spectra.
Tags from all spectra are loaded into a trie. The trie lets us scan the protein database for any number of strings in linear time. When a tripeptide tag is matched and the flanking masses are matched, we obtain a candidate peptide.
Figure 3.2: Bayesian network for scoring masses. In nodes corresponding to peaks, the odds that a peak is present (in a charge-2 or a charge-3) spectrum are indicated.
Left: Simplified portion of the conditional probability table for one node of the bayesian network. In ion trap spectra, most break points produce a relatively strong y fragment, and a weak (but present) b fragment.
Above: Histogram showing false discovery rate (y axis) versus weighted score (x axis) for results of a large search.
The seqeuence of human crystallin beta B1 is shown above, annotated with post-translational modifications discovered by InsPecT in a study of cataractous lens.
Some modifications are produced by chemical damage, others are “deliberate” modifications carried out in a carefully-regulated manner. Comparisons of modificaiton rates suggest that deamidation (net mass shift +1) plays a role in cataract formation.
Tsur, D., Tanner, S., Zandi, E., Bafna, V., and Pevzner, P., 2005. Identification of post-translational modifications via blind search of mass-spectra. Nature Biotechnology, 23:1562–1567.
Tanner, S., Pevzner, P., and Bafna, V., 2006. Unrestrictive identification of post-translational modifications through peptide mass spectrometry. Nat Protocols, 1(1):67–72.
Wilmarth, P. A. amd Tanner, S., Dasari, S., Nagalla, S. R., Riviere, M. A., Bafna, V., Pevzner, P. A., and David, L. L., 2006. Age-related changes in human crystallins determined from comparative analysis of post-translational modifications in young and aged lens: Does deamidation contribute to crystallin insolubility? Journal of Proteome Research, 2006.
Tanner, S., Payne, S. H., Dasari, S., Shen, Z., Wilmarth, P., David, L., Loomis, W. F., Briggs, S. P., and Bafna, V., 2007. Accurate annotation of peptide modifications through unrestrictive database search. In preparation.
Simplified diagram of MS-Alignment algorithm. We construct Searchdots for each database position (horizontal axis) and for each spectrum peak (vertical axis). Paths are diagonal lines, with one or two modifications (horizontal / vertical segments) permitted. An annotation is a path from top to the bottom of the graph. The highest-scoring paths are retained and re-scored.
We obtained interesting results in the Nature Biotechnology paper, but did not report a false discovery rate for sites.
As peptide datasets grow, there will be less emphasis on individual spectral correctness.
Instead we use the high redundancy of large datasets to focus on identification of modified peptides, and modified sites.Analysis of unrestrictive results
Figure 6.3: ROC curve for categorization of modified lens peptides using the PTMFinder support vector machine (SVM). The accuracy of the PTMFinder model is significantly higher than a simple spectrum-level score cutoff. In addition, PTMFinder is more effective than selecting those sites which correspond to the most common modification types (amino acid and mass) by spectrum count.
Ten different peptide species witness histidine methylation of actin. Combining evidence from multiple peptide species gives a site p-value of 6.6x10-12. Fully tryptic peptides are most common, but missed cleavages and post-digest decay produce several other peptide species. We found this modification site to be conserved between Homo sapiens and the protist Dictyostelium discoidens.
Figure 6.5: Venn diagram summarizing sites of N-terminal acetylation (left) and phosphorylation (right) sites in human proteins. Known sites from two databases (Uniprot and HPRD) are shown, along with sites identified from a corpus of ~20 million spectra derived from the HEK293 cell line analyzed by MS-Alignment and PTMFinder.
Improving gene annotation with mass spectrometry. Tanner, Stephen and Shen, Zhouxin and Ng, Julio and Florea, Liliana and Guigo, Roderic and Briggs, Steven P and Bafna, Vineet, 2007. Genome Research 17(2), 231-239.
Whole proteome analysis of post-translational modifications: applications of mass-spectrometry for proteogenomic annotation. Nitin Gupta, Stephen Tanner, Navdeep Jaitly, Joshua Adkins, Mary Lipton, Robert Edwards, Margaret Romine, Andrei Osterman, Vineet Bafna, Richard D. Smith, Pavel Pevzner, 2007. In preparation.
Genome annotation is generally seen as something done before transcriptomics and proteomics.
The direction of information flow mirrors the central dogma.
Mass spectrometry is an attractive method for discovering genes and improving gene annotations.
Roughly 25% of tryptic peptides span a splice junction, so intron boundaries (and alternative splicing) can be conformed at the translational level.
MS/MS has different sources of error than ESTs, providing a novel line of evidence for gene finding.
Figure 7.3: A portion of the exon graph for heterogenous nuclear ribonuclear protein K. The labeled edge represents a codon split across a splice junction. The dotted edge is an “adjacent edge” corresponding to a longer form of an exon. Searching the exon graph reveals peptides spanning both outgoing edges from the central node, confirming alternative splicing at the level of translation.
Above: Evidence for novel exons upstream of the annotated start site of retinoblastoma-associated protein RAP140 (gi:5881256). Matched peptides are shown below their corresponding genomic location, with spectrum counts indicated in parentheses. Those peptides which match the reference protein sequence are also shown.
Figure 7.6: Novel exons are supported by peptide identifications and by sequence homology. Above is a multiple alignment for hypothetical protein sequences from chimp (gi:55639283), rat (gi:62531299), and human (genome translation, similar to gi:20070384). Introns are indicated by colons. The peptides identified from mass spectra are indicated below the protein sequence. The novel 3’ exon is supported by three peptide identifications, as well as >95% amino acid sequence conservation across species.
Generalized gene set queries for microarray analysis. Stephen Tanner, Pankaj Agarwal, 2007. In preparation.
Above: Empirical cumulative distribution function (CDF) for gene set scores across a corpus of experiments. Calibration filters false positives - e.g. a score of 0.05 is highly significant for one gene set, but not for others.
Accuracy (computed using the false discovery rate) of queries for five pairs of related microarray experiments. Calibration of p-values using a large corpus (either GEO or the Connectivity Map) was significantly more effective than computing a p-value using permutation of class labels. (Permutation also requires many replicates to be reliable)
Above: Examples of gene sets obtained by the GQuery algorithm. Statistical power is greatly increased when measuring up- or down-regulation of many genes at once. Gene sets such as “Long-term memory” are easier to interpret than lists of gene identifiers.