slide1 l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression PowerPoint Presentation
Download Presentation
Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression

Loading in 2 Seconds...

play fullscreen
1 / 1

Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression - PowerPoint PPT Presentation


  • 118 Views
  • Uploaded on

Introduction. C 3 Database Compression. Faster Search . Searching Human ESTs . C 3 sequence database [1] Complete: All 30-mers are represented Correct: No new 30-mers are represented Compact: 30-mers occur exactly once.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Introduction

C3 Database Compression

Faster Search

Searching Human ESTs

C3 sequence database [1]

Complete: All 30-mers are represented

Correct: No new 30-mers are represented

Compact: 30-mers occur exactly once.

Figure 1 shows the size of the C3 sequence database compression for various databases.

Human ESTs represent a largely unexplored source of peptide sequence data. EST sequence databases are highly redundant, with a sequencing error rate of about 1%.

We translate the Human ESTs in all 6 frames and C3 compress the result. In the process, we eliminate open reading frames of less than 50 amino acids, and amino-acid 30-mers that are observed only once.

We achieve 40 fold compression in sequence database size and therefore running time.

Current sequence databases contain considerable peptide sequence redundancy. Amino-acid sequence databases often contain considerably fewer distinct 30-mers than its size suggests. The distinct 30-mers can be searched faster than the original sequence database, while the statistical significance of the peptide identifications is improved.

Peptide Sequences are Short

  • Peptides identified in MS/MS workflows are rarely longer than 30 amino-acids.
  • Trypsin cuts at K or R, unless followed by P
  • Precursor ion is usually < 3000 Da
  • Charge state is usually +1, +2, or +3
  • Peptides of more than 20 amino-acids typically don’t fragment well.
  • 30 is a conservative upper-bound on the length of peptides identified by MS/MS workflows.

(Projected)

(Projected)

Figure 2: Absolute (seconds) and relative Mascot search time for original and C3 sequence databases. Blue represents Mascot search time, other colors represent peptide mapping time.

Protein Sequence Databases

More Sensitive Search

We searched the Open Proteomics Database MS/MS dataset “SiHa human cell line used to model cervical cancer” against C3 Human dbEST with Mascot on a PC (512Mb RAM)

Total spectra: 47788

Search Time: 4.5 hours

Novel Peptides: 19

Figure 1: Sequence databases: size, C3 size, distinct 30-mers.

IPI-HUMAN, from EBI; IPI, concatenation of IPI-HUMAN, IPI-MOUSE, and IPI-RAT from EBI; Swiss-Prot, from ExPASy; Swiss-Prot-VS, Swiss-Prot plus varsplic.pl variant enumeration; UniProt, concatenation of Swiss-Prot and TrEMBL; UniProt-VS, UniProt plus varsplic.pl variant enumeration; MSDB, from Imperial College; NRP, from NCI Frederick; NCBI-nr, from NCBI; and UnionNR, non-redundant union of all.

Figure 1 shows size and distinct 30-mers.

Experiment Parameters

  • ISB 17 Protein Mix (2043 MS/MS Spectra)
  • Mascot 2.0
    • Precursor tolerance: 2 Da
    • Fragment tolerance: 0.15 Da
    • Up to 2 missed trypsin cleavages
  • IPI-HUMAN, Swiss-Prot(-VS), UniProt(-VS)
  • Dell PC with 512 Mb of RAM

[1] N. Edwards and R. Lippert. Sequence database compression for peptide identification from tandem mass spectra. WABI 2004.

Figure 3: Significant E-values for Mascot search against Swiss-Prot-VS (x) vs C3 Swiss-Prot-VS (y).

Faster, more sensitive peptide identification fromtandem mass spectra by sequence database compression

Nathan J. EdwardsCenter for Bioinformatics & Computational Biology, University of Maryland, College Park

P-1014

Figure 4: Size (Blue, bottom axis, in Mb) and Mascot search time (Red, top axis, in hours) for C3 and brute force Human EST database.

References