Loading in 2 Seconds...
Loading in 2 Seconds...
Introduction. C 3 Database Compression. Faster Search . Searching Human ESTs . C 3 sequence database  Complete: All 30-mers are represented Correct: No new 30-mers are represented Compact: 30-mers occur exactly once.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
C3 Database Compression
Searching Human ESTs
C3 sequence database 
Complete: All 30-mers are represented
Correct: No new 30-mers are represented
Compact: 30-mers occur exactly once.
Figure 1 shows the size of the C3 sequence database compression for various databases.
Human ESTs represent a largely unexplored source of peptide sequence data. EST sequence databases are highly redundant, with a sequencing error rate of about 1%.
We translate the Human ESTs in all 6 frames and C3 compress the result. In the process, we eliminate open reading frames of less than 50 amino acids, and amino-acid 30-mers that are observed only once.
We achieve 40 fold compression in sequence database size and therefore running time.
Current sequence databases contain considerable peptide sequence redundancy. Amino-acid sequence databases often contain considerably fewer distinct 30-mers than its size suggests. The distinct 30-mers can be searched faster than the original sequence database, while the statistical significance of the peptide identifications is improved.
Peptide Sequences are Short
Figure 2: Absolute (seconds) and relative Mascot search time for original and C3 sequence databases. Blue represents Mascot search time, other colors represent peptide mapping time.
Protein Sequence Databases
More Sensitive Search
We searched the Open Proteomics Database MS/MS dataset “SiHa human cell line used to model cervical cancer” against C3 Human dbEST with Mascot on a PC (512Mb RAM)
Total spectra: 47788
Search Time: 4.5 hours
Novel Peptides: 19
Figure 1: Sequence databases: size, C3 size, distinct 30-mers.
IPI-HUMAN, from EBI; IPI, concatenation of IPI-HUMAN, IPI-MOUSE, and IPI-RAT from EBI; Swiss-Prot, from ExPASy; Swiss-Prot-VS, Swiss-Prot plus varsplic.pl variant enumeration; UniProt, concatenation of Swiss-Prot and TrEMBL; UniProt-VS, UniProt plus varsplic.pl variant enumeration; MSDB, from Imperial College; NRP, from NCI Frederick; NCBI-nr, from NCBI; and UnionNR, non-redundant union of all.
Figure 1 shows size and distinct 30-mers.
 N. Edwards and R. Lippert. Sequence database compression for peptide identification from tandem mass spectra. WABI 2004.
Figure 3: Significant E-values for Mascot search against Swiss-Prot-VS (x) vs C3 Swiss-Prot-VS (y).
Faster, more sensitive peptide identification fromtandem mass spectra by sequence database compression
Nathan J. EdwardsCenter for Bioinformatics & Computational Biology, University of Maryland, College Park
Figure 4: Size (Blue, bottom axis, in Mb) and Mascot search time (Red, top axis, in hours) for C3 and brute force Human EST database.