De novo sequencing and homology searching with de novo sequence tags
Sponsored Links
This presentation is the property of its rightful owner.
1 / 50

De Novo Sequencing and Homology Searching with De Novo Sequence Tags PowerPoint PPT Presentation


  • 133 Views
  • Uploaded on
  • Presentation posted in: General

De Novo Sequencing and Homology Searching with De Novo Sequence Tags. Inexact protein DB. protein DB. Possible Ways to Interpret MS/MS Data. MS/MS Spectra. 2. de novo sequencing. peptides. homology search. database search. 1. 3. peptides. h omologous peptides. Why Bother?.

Download Presentation

De Novo Sequencing and Homology Searching with De Novo Sequence Tags

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


De Novo Sequencing and Homology Searching with De Novo Sequence Tags


Inexact

protein DB

protein DB

Possible Ways to Interpret MS/MS Data

MS/MS Spectra

2

de novo

sequencing

peptides

homology

search

database

search

1

3

peptides

homologous peptides


Why Bother?

  • De novo sequencing derives the sequence without looking into a database.

  • De novosequencing is useful for

    • unsequenced genomes (no protein database)

    • novel peptides (unmatched spectra after database search)

    • single amino acid polymorphism

    • unexpected PTM

    • database error

    • validate a database match


Outline

  • Basics

  • Manual De Novo Sequencing

  • De novo Sequencing Algorithm (PEAKS)

  • Homology Search with De Novo Tags


O

[N-term]-NH-CHR-C---NH2+-[C-term]H+

Sequence-specific fragment ions

(M+2H)+2

[N-term]-NH-CHR-CO+ + NH2-[C-term] H+

b-ion

y-ion

a – NH3 or H2O

b – NH3 or H2O

y – NH3 or H2O

[N-term]-NH=CHR+ + CO

a-ion


Non-sequence-specific fragmentations


Why does everyone analyze positively-charged tryptic peptides?

  • Usually better sensitivity from positively-charged peptide ions.

  • “Mobile protons” protonate peptide bonds and promote b/y fragmentation

    • Arg sequesters protons in gas phase

    • Tryptic peptides typically have 0 -1 Arg

    • Tryptic peptide ions typically have two protons

    • Therefore, tryptic peptides usually have b/y ions

  • Placing Arg’s at the C-terminus makes it more likely that a complete series of y-ions will be observed.


MS/MS spectrum of doubly-charged tryptic peptide (one Arg and two protons)

Y

y6

y5

y4

y3

y2

y1

Y L Y E I A R

b2

a2

y5

b2

L

y2

y1

y4

y3

y6


MS/MS spectrum of a doubly-charged non-tryptic peptide(two Arg’s and two protons)

(M+2H)+2

y2

Y S R R H P E

a5-17

(b6+18)+2

Relative Ab. (%)

b5-17

Y

y4-17

(b5+18)+2

b4

H

y4

y3

P

y6-17

y1


300

400

500

600

700

800

900

1000

1100

1200

1300

1400

1500

m/z

0

100

200

300

400

500

600

700

800

900

1000

1100

1200

1300

1400

1500

CID in traps vsquadrupoles

b13

IPIGFAGAQGGFDTR

Ion trap

y9

y14+2

y2

y10

Relative Abundance

y12

y6

y5

b6

b9

y3

y7

y8

b10

y4

b7

b8

b12

y13

b4

b5

y11

b3

b11

y2

IPIGFAGAQGGFDTR

Qtof

b2

(M + 2H) +2

Relative Abundance

b6

b4

b5

y3

y5

y14+2

y1

y9

y12

y4

y10

y6

b3

y7

y8

y13

y11

b13

m/z


Annoying things to remember when sequencing peptides by MS/MS

  • Leucine and isoleucine have the same mass

  • Glutamine and lysine differ by 0.036 u

  • Phenylalanine and oxidized methionine differ by 0.033 u

  • Cleavages do not occur at every bond (more often than not, there is no cleavage between the first and second residues)

  • Certain amino acids have the same mass as pairs of other amino acids: G + G = N, A + G = Q, G + V ~ R, A + D ~ W,

  • S + V ~ W

  • However: mass accuracy resolves many of these ambiguities


Outline

  • Basics

  • Manual De Novo Sequencing

  • De novo Sequencing Algorithm (PEAKS)

  • Homology Search with De Novo Tags


Two approaches to manually sequencing peptides from MS/MS spectra

  • Finding a series of ions in the middle of the peptide, and working out towards the termini (illustrated using ion trap data)

  • Finding the C-terminus and working towards the N-terminus (illustrated using qtof data)


Sequencing from the middle: look for ion series in the region above the precursor ion (m/z 615)


An obvious series is the one that involves the more abundant fragment ions (m/z 575, 688, 775, 888, and 987)

L

S

L

V


-18

-18

-18

-18

-18

Another ion series contains pairs separated by 18 Da (water losses)

L

V

E

S


Two ion series have been identified in the region above the precursor ion

Problem: Two ion series defining partial sequences LSLV and LVES have been identified, but it is not known if these are y- or b-ions (i.e., the sequence direction is unknown).

Solution: Since ion trap data often exhibits high mass b-ions, check to see if the highest mass ion in either series corresponds to a loss of either Arg or Lys (the usual tryptic C-terminus). If not, check to see if the mass difference corresponds to a dipeptide containing Lys or Arg (it is possible that the b-ion defining the C-terminus is absent).

Calculation: Peptide MW – 17 – fragment ion = C-terminal residue mass


For the first ion series: 1228 – 17 – 987 = 224 Da

L

S

L

V

224 – 128 = 96

224 – 156 = 68

Therefore this does not look like a b-ion series


-18

-18

-18

-18

-18

For the second ion series: 1228 – 17 – 1083 = 128 (the residue mass of Lys); this looks like a b-ion series and maybe the other one is a y-ion series

L

V

E

S

K


The high mass b-series predicts the presence of some low mass y-ions; are they there?

b-series: …LVESK

y1: 147 No

y2: 234 Yes

y3: 363 Yes

y4: 462 Yes

y5: 575 Yes!!

y-ions = residue mass

plus 19 Da

b-ions

y-ions


The high mass y-series predicts the presence of some low mass b-ions; are they there?

y-series: [242]VLSL…

b2: 243 Yes

b3: 342 Yes

b4: 455 Yes

b5: 542 Yes

b6: 655!! Yes

b-ions

y-ions


Can I account for most of the remaining ions as neutral losses or internal fragments?

[242]VLSLLVESK

242 = N+Q, N+K, L+E

b-ions

y-ions

neutral loss


Two approaches to manually sequencing peptides from MS/MS spectra

  • Finding a series of ions in the middle of the peptide, and working out towards one of the termini (illustrated using ion trap data)

  • Finding the C-terminus and working towards the N-terminus (illustrated using qtof data)


Outline

  • Basics

  • Manual De Novo Sequencing

  • De novo Sequencing Algorithm (PEAKS)

  • Homology Search with De Novo Tags


Algorithm Design

  • The first thing for algorithm design is to define the property of the solution.

  • For the de novo sequencing problem, one wants to compute a peptide that “best matches” the given spectrum.

  • This “best match” is practically defined by a scoring function.


Peptide-Spectrum Match Score

peptide

suffix

prefix

  • A fragment score can be computed for every two adjacent amino acids. This score depends on the presence of the corresponding b and y ions.

  • The peptide score is the sum of the fragment scores.


The Fragment Score for a Mass

peptide

suffix

prefix

  • The fragment score calculation only requires the prefix mass but not the sequence

    • Note: suffix mass = total residue mass – prefix mass.

  • Thus it is possible to define score for each prefix mass value .


How to Define Statistically

  • Learn two probabilities from large training data

    • : Prob(a peak is observed at a y-ion m/z).

    • : Prob(a peak is observed at a random m/z).

    • Usually .

  • If an expected y-ion is observed, is added to .

    • is called the log-likelihood-ratio

  • If an expected y-ion is missing, , is added to .

  • Thus, matching ion is rewarded and missing ion is penalized.

  • Other fragment ion types can be considered similarly.


De Novo Sequencing

  • For a sequence with prefix masses the peptide score is defined as

  • De Novo Seuqenicng: Given scoring function and mass , computes a sequence P with total residue mass , and maximizing .


Algorithm Idea

  • : the maximized score that can be achieved by a prefix with mass .

best sequence for

  • If is the best sequence for , then must be the best sequence for .

  • Thus, .

  • To compute , try 20 residues and use the one that maximizes the above formula.


Dynamic Programming

  • The algorithm initializes and all other cells to be .

  • Then computes for from 1 to by

    .

  • The best sequence can be retrieved by a backtracking process by repetitively computing the last amino acid .

BestScore

0

3

1

2


A Note on PTM

  • Variable PTM does not cause major speed slow down for de novo sequencing algorithms.

    • Instead of trying 20 regular amino acids in the maximization, the algorithm simply tries all modified amino acids too.

    • The time complexity is increased by a constant factor. (Compare to the exponential growth in database search approach).

  • However, since the solution space is larger when many variable PTMs are allowed, the accuracy of the algorithm is reduced.


Accounting for Other Ion Types

  • When internal cleavage ions are considered in the scoring function, it becomes difficult to design efficient algorithm to find the optimal sequence.

  • A compromise between efficiency and accuracy is to employ a two-stage approach.

    • First, compute many (e.g. 10,000) sequences using an efficient score function that uses only a few of the most important ions.

    • Then, evaluate these candidates using a more sophisticated scoring function additional ions.

  • This two-round approach is a tradeoff between the algorithm speed and accuracy.


Mass Segment Error

  • Most errors are due to incomplete ion ladders in the spectrum.

    • Thus, a segment of amino acids cannot be determined.

    • However, the total mass of the segment, is fixed.

    • E.g. [242]VLSLLVESK, where 242 = N+Q, N+K, or L+E

  • The first two or three residues often have low confidence, because of a lack of fragment ions.

  • Most de novo sequencing software uses the precursor mass as a constraint (thus the mass of the derived sequence is usually correct).


Outline

  • Basics

  • Manual De Novo Sequencing

  • De novo sequencing Algorithm

  • Homology Search with De Novo Tags


Why Homology Search with De Novo Sequence

  • Advantages:

    • Database may not contain the exact peptide sequence, but a homologous one is there.

    • De novo + homology search is great to use the database of one organism to study a similar organism.

  • Disadvantages:

    • De novo sequence can only provide partially correct sequence tags.

    • Conventional homology search may fail due to de novo sequencing errors.


Traditional Sequence Alignment

  • Two peptide sequences are aligned by inserting spaces to appropriate positions. E.g.

    FVEVTKL-TDLTK

    | || || |||||

    FAEV-KLVTDLTK

  • The matching residues (including gaps, ‘-’) in each column has a similarity score that can be looked up in a pre-defined amino acid substitution matrix, such as BLOSUM or PAM.

  • The alignment score is equal to the sum of the column-wise scores.

  • There are algorithms to compute the optimal alignment that maximizes the alignment score.


Limitations of Conventional Homology Search

  • Conventional search ignores the possible errors in de novo sequencing.

  • Suppose a true sequence is SLCAFK, and the de novo sequence is LSCFAK, and the homolog is SLAAFK.

(denovo) X: LSCFAK

|

(homolog) Z: SLAAFK

(denovo) X: [LS]C[FA]K

(real) Y: [SL]C[AF]K

|| || |

(homolog) Z: [SL]A[AF]K

Conventional search using evolutionary similarities to explain the mismatches results in a poor match.

If de novo sequencing errors are considered, the match becomes more significant.


A Simple Approach

  • We can enumerate all possible combinations of a mass segment, and search all of them together.

    • MS BLAST will do this.

  • Difficulties:

    • Do not know which portion of the sequence is error.

    • Exponential growth of possibilities.

LSCFAK

SLCFAK

TVCFAK

VTCFAK

LSCAFK

SLCAFK

TVCAFK

VTCAFK

[LS]C[FA]K


SPIDER Model

(de novo) X: [LS]C[FA]K

(real) Y: [SL]C[AF]K

|| || |

(homolog) Z: [SL]A[AF]K

  • Given a de novo sequence X, and a database sequence Z. Try to reconstruct the real sequence Y.

    • The difference between X and Y is explained by de novo sequencing errors.

    • The difference between Y and Z is explained by homology mutations.

  • The real Y should minimize the de novo errors and the homology mutations needed in the above explanation.


Two exercises

(denovo) X: LSCFAV

(real) Y: SLCFAV

(homolog) Z: SLCF-V

  • The swap of L and S is more likely a de novo error than a mutation.

  • The deletion of A is unlikely a de novo error (de novo does not change peptide mass).

(denovo) X: LSCFV

(real) Y: EACFV

(homolog) Z: DACFV

  • Mutation and de novo error overlap. Hard for manual interpretation. Algorithm is needed.

blosum62

m(LS)=m(EA)=200.1 Da


Conclusion

  • When the target peptides are not in a database.

    • De novo sequencing

  • When the homologous peptides are in database

    • Homology search with the de novo tags can find them

    • Some de novo errors can be corrected by combining the homolog information


  • Login