Efficient and accurate algorithms for peptide mass spectrometry
Download
1 / 65

Efficient and accurate algorithms for peptide mass spectrometry - PowerPoint PPT Presentation


  • 128 Views
  • Uploaded on

Efficient and accurate algorithms for peptide mass spectrometry. Dissertation presentation Stephen Tanner May 30, 2007 Lab page: http://peptide.ucsd.edu. Overview. Introduction: What is mass spectrometry? How does it fit into the broader context of biology and bioinformatics? (Chapter 1)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Efficient and accurate algorithms for peptide mass spectrometry' - ayanna


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Efficient and accurate algorithms for peptide mass spectrometry

Efficient and accurate algorithms for peptide mass spectrometry

Dissertation presentation

Stephen Tanner

May 30, 2007

Lab page: http://peptide.ucsd.edu


Overview
Overview spectrometry

  • Introduction: What is mass spectrometry? How does it fit into the broader context of biology and bioinformatics? (Chapter 1)

  • Spectrum annotation (Chapters 2, 3, 4)

  • Discovering post-translational modifications (Chapters 5, 6)

  • Genome annotation (Chapter 7)

  • Gene set analysis of microarrays (Chapter 8)


From genomics to proteomics
From genomics to proteomics spectrometry

DNA

Transcription

mRNA

Translation

Protein


Key technologies
Key technologies spectrometry

Genomics

Capillary sequencers are a central technology for studying DNA

Microarrays are a central technology for studying RNA

Mass spectrometry is a central technology for studying the proteome.

Transcript-

omics

Proteomics


2002 chemistry nobel prize
2002 Chemistry Nobel Prize spectrometry

  • Given for MS and NMR applied to proteins

  • The citation highlights several current and potential applications

“…Some five years ago, mass spectrometry definitively crossed the border to biochemistry. The general ways that it provides structural deter-mination, identification and trace level analysis have many applications in the biochemical field. It has become an attractive alternative to Edman sequencing, earlier dominant, and has an unsurpassed ability to identify posttranscriptional modifications and non-covalent interactions in for example antigen-antibody binding studies for identifying ligands to orphan receptors….”


Peptide mass spectrometry
Peptide Mass Spectrometry spectrometry

A protein sample is digested (typically with trypsin) to generate peptides.

The peptides are then separated by liquid chromatography.


Mass spectrometry
Mass spectrometry spectrometry

The mass spectrometer separates the eluting peptides by mass-to-charge ratio (m/z), and records a mass spectrum.

Intensity

m/z


Efficient and accurate algorithms for peptide mass spectrometry

Above: Diagram of a mass spectrometer (courtesy of ChemGuide.com). Molecules are accelerated by a series of charged plates, their time of flight determined by their mass-to-charge ratio.


Efficient and accurate algorithms for peptide mass spectrometry

Left: An LTQ mass spectrometer (image from University of Vermont)

Right: A high-end Fourier Transform mass spectrometer (image from Pacific Northwest National Labs)


Tandem ms
Tandem MS Vermont)

Secondary Fragmentation

Ionized parent peptide


Peptide fragmentation
Peptide fragmentation Vermont)

H...-HN-CH-CO-NH-CH-CO-…OH

  • Peptides are fragmented, typically through collision with inert atoms.

  • Peptides break at peptide bonds, generating an N-terminal b ion and a C-terminal y ion.

Rn-1

Rn+1

H...-HN-CH-CO

H3N-CH-CO-…-OH

Rn-1

Rn+1

b ion

(includes N-terminus)

y ion

(includes C-terminus)

Spectrum: One

peak for each

fragment type


Efficient and accurate algorithms for peptide mass spectrometry

Above: A sample peptide tandem mass spectrum, identified and labeled by the InsPecT software toolkit.


The need for bioinformatics
The Need for Bioinformatics labeled by the InsPecT software toolkit.

  • High-throughput technologies like MS generate huge volumes of data much faster than the data can be analyzed and integrated by legacy methods.

  • Analysis becomes the bottleneck, and algorithms address this bottleneck

  • Bioinformatics also helps improve accuracy - and provide accurate measurements of accuracy.


Efficient and accurate algorithms for peptide mass spectrometry

Known problem labeled by the InsPecT software toolkit.

Bioinformatics application

  • Suppose it takes 1 second to interpret one spectrum using a database. How long would it take to search 1 million spectra?

  • Early tools, like Sequest, have runtimes that grow linearly with the number of scans

  • InsPecT uses the Aho-Corasik algorithm to search efficiently (up to 100 times faster than Sequest)

  • Suppose it takes 1 second to locate one word in a large text. How long would it take to locate 1 million words?

  • (The naive answer: One million seconds!)

  • The Aho-Corasik algorithm takes roughly the same time to find one million words as for one word.


Key algorithms
Key algorithms labeled by the InsPecT software toolkit.

Genome assemblyand gene finding are two important problems in genomics.

Finding up- and down-regulated genes and gene sets is a key problem in transcriptomics.

Peptide identification (InsPecT) and modification site identification (MS-Alignment) are two important problems in proteomics.

Genomics

Transcript-

omics

Proteomics


Peptide identification
Peptide identification labeled by the InsPecT software toolkit.

  • Given a peptide tandem spectrum, we wish to identify the peptide which produced it.

  • Identifying peptides with modified residues (or point mutation) is important as well

  • Many interesting applications of mass spectrometry (e.g. quantitation) rely upon accurate peptide annotations.


Efficient and accurate algorithms for peptide mass spectrometry

Tanner, S., Shu, H., Frank, A., Wang, L., Zandi, E., Mumby, M., Pevzner, P., and Bafna, V., 2005. Inspect: Fast and accurate identification of post-translationally modified peptides from tandem mass spectra. Anal. Chem., 77(14):4626–4639.Frank, A., Tanner, S., Bafna, V., and Pevzner, P., 2005. Peptide sequence tags for fast database search in mass-spectrometry. J. of Proteome Research, 4(4):1287–1295.

InsPecT: Fast and Accurate Spectrum Annotation


Database search
Database search M., Pevzner, P., and Bafna, V., 2005. Inspect: Fast and accurate identification of post-translationally modified peptides from tandem mass spectra.

  • One way to identify peptides (first implemented by tools like Sequest and Mascot) is to enumerate and score all possibilities from the sequence database.

  • Theoretical spectra are compared against the “mass fingerprint” of the spectrum

Theoretical spectrum

(#1 of 10,000)

Input spectrum

Match score


Drawbacks of database search
Drawbacks of database search M., Pevzner, P., and Bafna, V., 2005. Inspect: Fast and accurate identification of post-translationally modified peptides from tandem mass spectra.

  • Enumerating all candidates is too slow, particularly when modifications and non-tryptic peptides must be considered.

  • A modern instrument produces a million spectra per day!

  • Early tools used an over-simplified match scoring model


De novo interpretation
De novo interpretation M., Pevzner, P., and Bafna, V., 2005. Inspect: Fast and accurate identification of post-translationally modified peptides from tandem mass spectra.

  • What if we have no sequence database?

  • A de novo algorithm such as PEAKS or PepNovo attempts to recover the entire peptide sequence from the spectrum.

  • However, due to incomplete fragmentation and noise peaks, we can only generate partial peptide reconstructions in most cases.

NG?

GN?

AT?

G

V

P

??


Filtering via tags
Filtering via tags M., Pevzner, P., and Bafna, V., 2005. Inspect: Fast and accurate identification of post-translationally modified peptides from tandem mass spectra.

  • If we identify a part of the sequence (tag) from the spectrum itself, we can efficiently filter for regions containing that string.

    • Recall: Exact match for strings is very fast.

    • Search time does not grow with number of query strings.

  • Computational problem: identify a collection of tags from a spectrum, such that at least one matches the true peptide.

  • We identify tags via a graph theoretic formulation


Peptide mass graphs
Peptide mass graphs M., Pevzner, P., and Bafna, V., 2005. Inspect: Fast and accurate identification of post-translationally modified peptides from tandem mass spectra.

  • We obtain candidate prefix residue masses by treating spectrum peaks as b or y fragments.

  • Masses which differ by the mass of an amino acid are linked by an edge.

W

R

V

A

L

G

T

E

P

L

K

C

W

D

T


Tag based search
Tag-based search M., Pevzner, P., and Bafna, V., 2005. Inspect: Fast and accurate identification of post-translationally modified peptides from tandem mass spectra.

W

  • InsPecT generates short peptide sequence tags from the spectrum, and uses these tags to filter the database.

  • Tag-based search is a hybrid of de novo and traditional database search.

  • Tags make database search much faster, analogous to the way that BLAST’s filter speeds up sequence search.

R

TAGPrefix Mass

AVG 0.0

WTD 120.2

PET 211.4

V

A

L

T

G

E

P

L

K

C

W

D

T


Tag based filtering
Tag-based filtering M., Pevzner, P., and Bafna, V., 2005. Inspect: Fast and accurate identification of post-translationally modified peptides from tandem mass spectra.

MDHPEDESHSEK

QDDEEALARLEEIK

SIEAKLTLR

QNNLNPERPDSAYLR

LKQINEEQREGLR

FVSEAVTAICEAK

SSDIQAAVQICSLLHQR

EFSASLTQGLLK

SAEDLEADK

MDHPEDESHSEK

QDDEEALARLEEIK

SIEAKLTLR

QNNLNPERPDSAYLR

LKQINEEQREGLR

FVSEAVTAICEAK

SSDIQAAVQICSLLHQR

EFSASLTQGLLK

SAEDLEADK

Tools like Sequest must score every peptide from the database with approximately correct mass (left). Using InsPecT, the expensive scoring step need only be run on those candidates matching a sequence tag (right).


Efficient and accurate algorithms for peptide mass spectrometry

(root) M., Pevzner, P., and Bafna, V., 2005. Inspect: Fast and accurate identification of post-translationally modified peptides from tandem mass spectra.

A

D

F

...

H

V

...

I/L

M

Prefix 250.1Da

Suffix 1000.5Da

Spectrum #1

Prefix 762.8Da

Suffix 626.0Da

Spectrum #23

Prefix 334.5Da

Suffix 220.5Da

Spectrum #3

Tags from all spectra are loaded into a trie. The trie lets us scan the protein database for any number of strings in linear time. When a tripeptide tag is matched and the flanking masses are matched, we obtain a candidate peptide.


Scoring tag masses
Scoring tag masses M., Pevzner, P., and Bafna, V., 2005. Inspect: Fast and accurate identification of post-translationally modified peptides from tandem mass spectra.

Figure 3.2: Bayesian network for scoring masses. In nodes corresponding to peaks, the odds that a peak is present (in a charge-2 or a charge-3) spectrum are indicated.


Scoring tag masses1
Scoring tag masses M., Pevzner, P., and Bafna, V., 2005. Inspect: Fast and accurate identification of post-translationally modified peptides from tandem mass spectra.

  • We use a Bayesian network to score each mass, using binned intensity levels

  • Masses receive high scores if they have peak patterns typical of valid break points

Left: Simplified portion of the conditional probability table for one node of the bayesian network. In ion trap spectra, most break points produce a relatively strong y fragment, and a weak (but present) b fragment.


Scoring tags
Scoring tags M., Pevzner, P., and Bafna, V., 2005. Inspect: Fast and accurate identification of post-translationally modified peptides from tandem mass spectra.

  • Each tag is scored using the Bayesian network (for masses), including flanking amino acid effects.

  • Edge skew is penalized.

  • The top 25 tags are retained for searching.

  • InsPecT can easily be extended to new instruments. For instance, it can be retrained to handle c and z ion series (from ETD instruments) without recompiling the code.


Scoring candidate peptides
Scoring candidate peptides M., Pevzner, P., and Bafna, V., 2005. Inspect: Fast and accurate identification of post-translationally modified peptides from tandem mass spectra.

  • Filtering results in a list of candidate peptides which must be scored to obtain the best match.

  • A match scoring function assigns a match quality score (MQScore), given a spectrum and a peptide.

  • The MQScore is computed using a support vector machine (SVN) on a total of seven features measuring match quality.

  • The MQScore distinguishes the correct candidate from incorrect candidates.


Identifying correct annotations
Identifying correct annotations M., Pevzner, P., and Bafna, V., 2005. Inspect: Fast and accurate identification of post-translationally modified peptides from tandem mass spectra.

  • In a typical experiment, only 10-30% of spectra are successfully interpreted.

  • We wish to focus on those spectra whose top-ranking candidate is correct.

  • To help do this, we consider the gap between the top candidate’s MQScore and the nearest runner-up (delta-score).


False discovery rates
False discovery rates M., Pevzner, P., and Bafna, V., 2005. Inspect: Fast and accurate identification of post-translationally modified peptides from tandem mass spectra.

  • In any high-throughput experiment, quantifying false discovery rates is crucial

  • We include decoy (shuffled) proteins in the database as a negative control.

  • We quantify the empirical false discovery rate by counting the number of matches to these bogus records.


Efficient and accurate algorithms for peptide mass spectrometry

Above: Histogram showing false discovery rate (y axis) versus weighted score (x axis) for results of a large search.


Efficient and accurate algorithms for peptide mass spectrometry

The seqeuence of human crystallin beta B1 is shown above, annotated with post-translational modifications discovered by InsPecT in a study of cataractous lens.

Some modifications are produced by chemical damage, others are “deliberate” modifications carried out in a carefully-regulated manner. Comparisons of modificaiton rates suggest that deamidation (net mass shift +1) plays a role in cataract formation.


Ms alignment and ptmfinder unrestrictive modification search
MS-Alignment and PTMFinder: Unrestrictive Modification Search

Tsur, D., Tanner, S., Zandi, E., Bafna, V., and Pevzner, P., 2005. Identification of post-translational modifications via blind search of mass-spectra. Nature Biotechnology, 23:1562–1567.

Tanner, S., Pevzner, P., and Bafna, V., 2006. Unrestrictive identification of post-translational modifications through peptide mass spectrometry. Nat Protocols, 1(1):67–72.

Wilmarth, P. A. amd Tanner, S., Dasari, S., Nagalla, S. R., Riviere, M. A., Bafna, V., Pevzner, P. A., and David, L. L., 2006. Age-related changes in human crystallins determined from comparative analysis of post-translational modifications in young and aged lens: Does deamidation contribute to crystallin insolubility? Journal of Proteome Research, 2006.

Tanner, S., Payne, S. H., Dasari, S., Shen, Z., Wilmarth, P., David, L., Loomis, W. F., Briggs, S. P., and Bafna, V., 2007. Accurate annotation of peptide modifications through unrestrictive database search. In preparation.


Post translational modifications
Post-translational modifications Search

  • After assembly, proteins are often modified to control their structure, to regulate enzyme activity, or by chemical damage.

  • Hundreds of different modification types are known. Databases such as UniMod, RESID, and ABRF catalog them.


Restrictive vs unrestrictive search
Restrictive vs. unrestrictive search Search

  • InsPecT can handle several modification types at once, but the user must still “guess” a list of allowed modification types

  • In unrestrictive search, the virtual database of modified peptides is thousands of times larger than the sequence database itself.

  • Identifying all peptide candidates becomes unfeasible. However, an alignment procedure can find the best modified peptide


Efficient and accurate algorithms for peptide mass spectrometry

Simplified diagram of MS-Alignment algorithm. We construct Searchdots for each database position (horizontal axis) and for each spectrum peak (vertical axis). Paths are diagonal lines, with one or two modifications (horizontal / vertical segments) permitted. An annotation is a path from top to the bottom of the graph. The highest-scoring paths are retained and re-scored.


Analysis of unrestrictive results

We obtained interesting results in the Nature Biotechnology paper, but did not report a false discovery rate for sites.

As peptide datasets grow, there will be less emphasis on individual spectral correctness.

Instead we use the high redundancy of large datasets to focus on identification of modified peptides, and modified sites.

Analysis of unrestrictive results


Ptmfinder
PTMFinder paper, but did not report a false discovery rate for sites.

  • The PTMFinder procedure attaches a false discovery rate to modification sites(analogous to PeptideProphet and unmodified search)

  • A site may be supported by several peptides, and by hundreds of spectra.

  • High spectrum-level accuracy is not sufficient (or necessary) to give high site-level accuracy

  • Combining features across spectra produces a very accurate model.


Handling correct annotations
Handling paper, but did not report a false discovery rate for sites.δ-correct annotations

  • In unrestrictive search, each peptide has dozens of “neighbors” with similar fragmentation

  • Examples:

    Q-17GEAMLAPK QG-17EAMLAPK

    Q-16GEAMLAPK G+111EAMLAPK

  • PTMFinder merges and reconciles redundant peptides, and attempts to annotate peptides using known chemical modifications (Unrestrictive, but not blind)


Efficient and accurate algorithms for peptide mass spectrometry

Figure 6.3: ROC curve for categorization of modified lens peptides using the PTMFinder support vector machine (SVM). The accuracy of the PTMFinder model is significantly higher than a simple spectrum-level score cutoff. In addition, PTMFinder is more effective than selecting those sites which correspond to the most common modification types (amino acid and mass) by spectrum count.


Ptmfinder analysis
PTMFinder analysis peptides using the PTMFinder support vector machine (SVM). The accuracy of the PTMFinder model is significantly higher than a simple spectrum-level score cutoff. In addition, PTMFinder is more effective than selecting those sites which correspond to the most common modification types (amino acid and mass) by spectrum count.

  • Studied a small, heavily-modified data set from human lens, and a large data set from HEK293 cell extract

  • Also studied ~1.4million spectra from the protist Dictyostelium discoidens


Efficient and accurate algorithms for peptide mass spectrometry

Ten different peptide species witness histidine methylation of actin. Combining evidence from multiple peptide species gives a site p-value of 6.6x10-12. Fully tryptic peptides are most common, but missed cleavages and post-digest decay produce several other peptide species. We found this modification site to be conserved between Homo sapiens and the protist Dictyostelium discoidens.


Efficient and accurate algorithms for peptide mass spectrometry

Figure 6.5: Venn diagram summarizing sites of N-terminal acetylation (left) and phosphorylation (right) sites in human proteins. Known sites from two databases (Uniprot and HPRD) are shown, along with sites identified from a corpus of ~20 million spectra derived from the HEK293 cell line analyzed by MS-Alignment and PTMFinder.


Genome annotation
Genome Annotation acetylation (left) and phosphorylation (right) sites in human proteins. Known sites from two databases (Uniprot and HPRD) are shown, along with sites identified from a corpus of ~20 million spectra derived from the HEK293 cell line analyzed by MS-Alignment and PTMFinder.

Improving gene annotation with mass spectrometry. Tanner, Stephen and Shen, Zhouxin and Ng, Julio and Florea, Liliana and Guigo, Roderic and Briggs, Steven P and Bafna, Vineet, 2007. Genome Research 17(2), 231-239.

Whole proteome analysis of post-translational modifications: applications of mass-spectrometry for proteogenomic annotation. Nitin Gupta, Stephen Tanner, Navdeep Jaitly, Joshua Adkins, Mary Lipton, Robert Edwards, Margaret Romine, Andrei Osterman, Vineet Bafna, Richard D. Smith, Pavel Pevzner, 2007. In preparation.


Genome annotation1
Genome annotation acetylation (left) and phosphorylation (right) sites in human proteins. Known sites from two databases (Uniprot and HPRD) are shown, along with sites identified from a corpus of ~20 million spectra derived from the HEK293 cell line analyzed by MS-Alignment and PTMFinder.

Genomics

Genome annotation is generally seen as something done before transcriptomics and proteomics.

The direction of information flow mirrors the central dogma.

Transcript-

omics

Proteomics


Genome annotation2
Genome annotation acetylation (left) and phosphorylation (right) sites in human proteins. Known sites from two databases (Uniprot and HPRD) are shown, along with sites identified from a corpus of ~20 million spectra derived from the HEK293 cell line analyzed by MS-Alignment and PTMFinder.

Genomics

Mass spectrometry is an attractive method for discovering genes and improving gene annotations.

Roughly 25% of tryptic peptides span a splice junction, so intron boundaries (and alternative splicing) can be conformed at the translational level.

MS/MS has different sources of error than ESTs, providing a novel line of evidence for gene finding.

ESTs

Transcript-

omics

Peptide IDs

Proteomics


Genomic search
Genomic Search acetylation (left) and phosphorylation (right) sites in human proteins. Known sites from two databases (Uniprot and HPRD) are shown, along with sites identified from a corpus of ~20 million spectra derived from the HEK293 cell line analyzed by MS-Alignment and PTMFinder.

  • Proteins have many isoforms and sequence variants. Storing and searching every feasible sequence is inefficient!

  • Storing the proteome as an exon graph is more efficient, and results are trivially mapped back onto the genome

  • (Valuable even in cases where the genome is perfectly annotated!)


Efficient and accurate algorithms for peptide mass spectrometry

Figure 7.3: A portion of the exon graph for heterogenous nuclear ribonuclear protein K. The labeled edge represents a codon split across a splice junction. The dotted edge is an “adjacent edge” corresponding to a longer form of an exon. Searching the exon graph reveals peptides spanning both outgoing edges from the central node, confirming alternative splicing at the level of translation.


Exon graph
Exon Graph nuclear ribonuclear protein K. The labeled edge represents a codon split across a splice junction. The dotted edge is an “adjacent edge” corresponding to a longer form of an exon. Searching the exon graph reveals peptides spanning both outgoing edges from the central node, confirming alternative splicing at the level of translation.

  • We used gene predictions (GeneID) and EST mappings (dbEST, ESTMapper) to build a graph of putative exons and introns in the human genome

  • The graph incorporates coding SNPs from dbSNP

  • A modified version of InsPecT was then used to search the graph


Efficient and accurate algorithms for peptide mass spectrometry

Above: Evidence for novel exons upstream of the annotated start site of retinoblastoma-associated protein RAP140 (gi:5881256). Matched peptides are shown below their corresponding genomic location, with spectrum counts indicated in parentheses. Those peptides which match the reference protein sequence are also shown.


Efficient and accurate algorithms for peptide mass spectrometry

Figure 7.6: Novel exons are supported by peptide identifications and by sequence homology. Above is a multiple alignment for hypothetical protein sequences from chimp (gi:55639283), rat (gi:62531299), and human (genome translation, similar to gi:20070384). Introns are indicated by colons. The peptides identified from mass spectra are indicated below the protein sequence. The novel 3’ exon is supported by three peptide identifications, as well as >95% amino acid sequence conservation across species.


Genome annotation results
Genome annotation results identifications and by sequence homology. Above is a multiple alignment for hypothetical protein sequences from chimp (gi:55639283), rat (gi:62531299), and human (genome translation, similar to gi:20070384). Introns are indicated by colons. The peptides identified from mass spectra are indicated below the protein sequence. The novel 3’ exon is supported by three peptide identifications, as well as >95% amino acid sequence conservation across species.

  • Discovery of novel exons in a dozen human genes

  • Confirmation of many genes predicted de novo or from ESTs

  • Detection of alternative splicing, and coding SNPs, at the translational level


Gene set analysis
Gene Set Analysis identifications and by sequence homology. Above is a multiple alignment for hypothetical protein sequences from chimp (gi:55639283), rat (gi:62531299), and human (genome translation, similar to gi:20070384). Introns are indicated by colons. The peptides identified from mass spectra are indicated below the protein sequence. The novel 3’ exon is supported by three peptide identifications, as well as >95% amino acid sequence conservation across species.

Generalized gene set queries for microarray analysis. Stephen Tanner, Pankaj Agarwal, 2007. In preparation.


Overview1
Overview identifications and by sequence homology. Above is a multiple alignment for hypothetical protein sequences from chimp (gi:55639283), rat (gi:62531299), and human (genome translation, similar to gi:20070384). Introns are indicated by colons. The peptides identified from mass spectra are indicated below the protein sequence. The novel 3’ exon is supported by three peptide identifications, as well as >95% amino acid sequence conservation across species.

  • Microarray experiments compare mRNA between tissue types or treatment conditions. They measure up- and down-regulation of genes.

  • The data is very noisy (particularly for low-abundance transcripts) and can be difficult to interpret.

  • Collecting readings corresponding to gene sets (e.g. a set of all genes annotated with a GO term) are one way to address this.


Motivating example
Motivating Example identifications and by sequence homology. Above is a multiple alignment for hypothetical protein sequences from chimp (gi:55639283), rat (gi:62531299), and human (genome translation, similar to gi:20070384). Introns are indicated by colons. The peptides identified from mass spectra are indicated below the protein sequence. The novel 3’ exon is supported by three peptide identifications, as well as >95% amino acid sequence conservation across species.

  • A microarray experiment compared muscle RNA from young and from aged males (GEO data-set GDS287).

  • I computed the cyber-t statistic for each gene.

  • After correcting for multiple hypothesis testing, no genes were significantly up- or down-regulated.

  • But, perhaps we can find a set of genes that’s significantly up- or down-regulated.


Gquery algorithm
GQuery algorithm identifications and by sequence homology. Above is a multiple alignment for hypothetical protein sequences from chimp (gi:55639283), rat (gi:62531299), and human (genome translation, similar to gi:20070384). Introns are indicated by colons. The peptides identified from mass spectra are indicated below the protein sequence. The novel 3’ exon is supported by three peptide identifications, as well as >95% amino acid sequence conservation across species.

Input:

  • a vector of readings for ~20,000 genes

  • a binary vector indicating which genes are members of the gene set

    Output:

  • An enrichment score measuring the degree to which the set is enriched for up- or down-regulated genes. Computed using Pearson correlation.


Gquery algorithm notes
GQuery algorithm notes identifications and by sequence homology. Above is a multiple alignment for hypothetical protein sequences from chimp (gi:55639283), rat (gi:62531299), and human (genome translation, similar to gi:20070384). Introns are indicated by colons. The peptides identified from mass spectra are indicated below the protein sequence. The novel 3’ exon is supported by three peptide identifications, as well as >95% amino acid sequence conservation across species.

  • Several other enirchment statistics were tried, with similar results

  • The same model handles queries against a database of other microarray experiments (e.g. matching diseases to compounds with opposing effects)

  • The computations are simple, but careful statistics are required.


Challenge false positives
Challenge: False positives identifications and by sequence homology. Above is a multiple alignment for hypothetical protein sequences from chimp (gi:55639283), rat (gi:62531299), and human (genome translation, similar to gi:20070384). Introns are indicated by colons. The peptides identified from mass spectra are indicated below the protein sequence. The novel 3’ exon is supported by three peptide identifications, as well as >95% amino acid sequence conservation across species.

  • Because gene sets represent co-regulated genes, the expression levels of their members are tightly correlated.

  • Our null model must correct for this, or we will obtain many false positives.

  • We calibrated our p-value readings using a diverse corpus of microarray experiments.


Efficient and accurate algorithms for peptide mass spectrometry

Above: Empirical cumulative distribution function (CDF) for gene set scores across a corpus of experiments. Calibration filters false positives - e.g. a score of 0.05 is highly significant for one gene set, but not for others.


Validation experiment
Validation experiment gene set scores across a corpus of experiments.

  • Given related experiments, we expect to see the same enriched gene sets

  • Example: Comparing muscle from young and aged males/females, we see downregulation of the TCA cycle

  • To a first approximation, sets shared across unrelated experiments are false positives.


Efficient and accurate algorithms for peptide mass spectrometry

Accuracy (computed using the false discovery rate) of queries for five pairs of related microarray experiments. Calibration of p-values using a large corpus (either GEO or the Connectivity Map) was significantly more effective than computing a p-value using permutation of class labels. (Permutation also requires many replicates to be reliable)


Efficient and accurate algorithms for peptide mass spectrometry

Above: Examples of gene sets obtained by the GQuery algorithm. Statistical power is greatly increased when measuring up- or down-regulation of many genes at once. Gene sets such as “Long-term memory” are easier to interpret than lists of gene identifiers.


Summary
Summary algorithm. Statistical power is greatly increased when measuring up- or down-regulation of many genes at once. Gene sets such as “Long-term memory” are easier to interpret than lists of gene identifiers.

  • Analysis of high-throughput mass spectrometry requires efficeint algorithms

  • Reporting accuracy (e.g. using a negative control to compute false discovery rates) is vital for high-throughput analysis

  • The software I developed will be used and extended by our lab and by collaborators.


Acknowledgments
Acknowledgments algorithm. Statistical power is greatly increased when measuring up- or down-regulation of many genes at once. Gene sets such as “Long-term memory” are easier to interpret than lists of gene identifiers.

  • My committee - Vineet Bafna, Julian Schroeder, Pavel Pevzner, Steve Briggs, Trey Ideker

  • Fellow students - Ari Frank, Nuno Bandeira, Samuel Payne, Nitin Gupta, Julio Ng, Natalie Castellana, Vagisha Sharma, Qian Peng

  • Industry contacts - Helge Weissig, Pankaj Agarwal

  • Collaborating labs (Ingolf Krueger, Larry David, Ebrahim Zandi, Marc Mumby, Elizabeth Komives, Guy Salvesen, Richard Smith, and many more!)