Popitam,
This presentation is the property of its rightful owner.
Sponsored Links
1 / 31

Popitam, PowerPoint PPT Presentation


  • 95 Views
  • Uploaded on
  • Presentation posted in: General

Popitam, une méthode tolérante aux mutations/modifications pour l'identification de protéines à partir de données de spectrométrie de masse (MS/MS). Patricia Hernandez Swiss Institute of Bioinformatics. Overview. - proteomics - proteome - proteome visualization: 2D gels

Download Presentation

Popitam,

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Popitam

Popitam,

une méthode tolérante aux mutations/modifications pour l'identification de protéines à partir de données de spectrométrie de masse (MS/MS)

Patricia Hernandez

Swiss Institute of Bioinformatics


Popitam

Overview

- proteomics

- proteome

- proteome visualization: 2D gels

- protein identification- classical workflow

- shared peak count

- modifications and identification

- modified peptides

- SPC

- spectral alignment, de novo sequencing, tag extraction

- Popitam- overview- tags

- scoring function, genetic programming- some results


Popitam

proteomics

Proteome

--> Proteomics: science that studies proteins expressed by a genome --> proteome

--> changes with the state of development, the tissue or the environmental conditions

--> identification and quantification--> 3D structure prediction--> localisation in the cell--> biological function --> modifications --> interactions with other proteins

...


Popitam

proteomics

2d gels

--> a simple way to "see" a proteome

--> numerous proteins from a biological sample (example: blood) are separated according to 2 criteria :

molecular weight of the protein

isoelectric point

--> this method allows separating simultaneously thousands of

proteins and displaying them on a

two-dimensional map

--> spot = (generally) one purified protein

--> we can "see" the proteins, but we don't

know to which protein corresponds a

given spot...


Popitam

protein identification

Spots identification: classical workflow

--> identify a spot = give a protein name to a spot

--> protein databases (for example SwissProt)- records all known proteic sequences- annotated

MS/MS identification

MGMGQ MGQGWAWATWATA...

fragmentit

select a peptide

measure the mass of the fragments by ms

cut the aa chain into peptides (every K and R aa)

measure the mass of the peptides by ms

select an unknown purified protein

MGQGWATAGLPSFRPEPYKCYGHPVPSQEASQQVTVK...

MS

identification

(PMF)

MGQGWATAGLPSFRPEPYKCYGHPVPSQEASQQVTVKTHGTSSQATTSSQK…


Popitam

protein identification

Shared peak count

MS spectrum: list of the masses of peptides that constitute the protein of interestMS/MS spectrum: list of masses of fragments that constitute a peptide of the protein of interest

MS: virtually cut the theo. seq. into peptides and compute masses

compare the list of experimental and theoretical masses in order to find the best match between experimental and virtual spectra--> detection

--> ions

--> noise

MS/MS: virtually cut the theo. seq. into peptides, and further cut the peptides into fragments, and compute the masses

p i g

protein

database

hbb_human


Popitam

modifications and identification

Modified peptides (1)

PTMs--> most eukaryote proteins

--> addition of a chemical group :

--> participate to:

- methylation:+14- phosphroylation:+80- glycosylation: >800 ...

- proteic structures- proteic functions - control of metabolic pathways

The sequence of the database may differ from the experimental peptide:

CONFLICT (different sources report differing sequences)

--> in about 4'600 human entries

VARIANT (authors report that sequence variants exist) = alleles

--> in about 2'200 human entries

MUTATIONS associated with diseases

--> 187 references to mutations and diseases in COMMENTS section


Popitam

EPYK

PEP

MGQGWATAGLPSFRPEPYKCYGHPVPSQEASQQVTVKTHGTSSQATTSSQK…

PEPYK

intensity

intensity

PYK

m/z

m/z

modifications and identification

Modified peptides (2)

a modified protein

MS, selection of the peptide

digestion

fragmentation


Popitam

modifications and identification

SPC and modified peptides

experimental MS/MS spectrum

modified experimental MS/MS spectrum

intensity

intensity

m/z

m/z

intensity

intensity

m/z

m/z

theoretical peptide

"Shared peak count" algorithms have to introduce modifications into the theoretical peptide databases.


Popitam

AAIEGKLMQRAPALK

modifications and identification

Database size (1)

AAIEGKaAIEGKAaIEGKaaIEGKAAIeGKaAIeGKAaIeGKaaIeGK

LMQRlMQR

APALKaPALKAPaLKaPaLKAPAlKaPAlKAPalKaPalK

AAIEGKLMQRAPALK

New database, if the two following modifications are taken into account

- modification occurring on amino acid A: A->a

- modification occurring on amino acids L: L->l and E: E->e

= all the peptide from the initial database, plus all modified peptides that can be built from the initial database


Popitam

modifications and identification

Database size (2)

B(L,p,k) gives the probability to have k positions of modification in a sequence of lenght L, if p is the probability that a position may be modified

(we assume the positions to be independent)

Aim: assess the number of peptides that contain zero, one, two... "positions" for a possible modification

xxxxoxxx

xoxx

xxox

xxxo

ooxx

oxox

oxxo

xoox

xoxo

xxoo

N0N1

N2

L = 10, p = 1/20:800'000 = 478'990 + 252'100 + 59'710 + 8'380 + 771 + c

L= 10, p= 5/20:

800'000 = 45'050 + 150'169 + 225'254 + 200'225 + 116'798 + c


Popitam

modifications and identification

Database size (3)

Expected number s of peptides that may contain exactly M modifications

Expected size of database when taking into account 0 to M modifications

xxxxoxxx

xoxx

xxox

xxxo

ooxx

ooxx

oxoxoxox

...

N0N1

N2


Popitam

modifications and identification

Database size (3)

SwissProt Human, 10'000 proteins

n = 806'787 peptides [300,3000] (=~from 3 to 30 aa)

L = 11 amino acids

0 to 3 modifications occuring on one specific amino acid: p=1/20P0to3_mod = 1'375'700 + c

0 to 3 modifications that may occur on several loci:

Phosphorylation: H,D,S,T,Y (eucaryotes): p = 5/20P0to3_mod = 4'865'100 + c

0 to 3 modifications that may occur on every amino acid: p=1

P0to3_mod = 3,97e12 + c

Mutation scenario:

Each amino acid may mutate into one of the remaining 19 amino acids:All possible words = 19k-1

P1_mut = 1.16e14


Popitam

modifications and identification

Other strategies

2 major problems:

- size of the database

- a priori knowledge on the deltaMass due to the modification

Solutions:

Define an identification algorithm that is not based on a SPC

--> spectral convolution/alignment

- PEDENTA (2000)

--> de novo sequencing followed by sequence matching

- extraction of one or several complete sequences

LUTEFISK (1997), SHERENGA (1999)...- extraction of one or several small tags (PeptideSearch, 1994), Patchwork sequencing...

--> Popitam (2003): "guided" sequencing


Popitam

A

B

C

D

E

F

if (i',j') and (i,j) are co-diagonal

C

otherwise

E

modifications and identification

Spectral convolution/alignment

Pevzner PA, Dancik V, Tang CL: Mutation-tolerant protein identification by mass spectrometry. J.Comput.Biol. 2000, 7:777-787

Key idea:k-similarity D(k)

Given Sexp and Stheo, the goal is to find a serie of k shifts in Sexp that makes Sexp and Stheo as similar as possible.

D(k) represents the maximum number of elements in common between a

theoretical and an experimental spectrum after k shifts

theo. MS/MS spectrum

A

B

D

SPC score:D(k=0) = 2

SA score: D(k=2) = 6

exp. MS/MS spectrum

F


Popitam

modifications and identification

De novo sequencing

Taylor JA, Johnson RS: Sequence database searches via de novo peptide sequencing by tandem mass spectrometry. Rapid Commun.Mass Spectrom. 1997, 11:1067-1075

Longest path problem in a directed acyclic graph --> dynamic programming--> complete sequences

--> mutations, but no modifications

4/24


Popitam

modifications and identification

Tag extraction

Island of sequence ionsThe tags (m1-SEQ-m2) are manually extracted2 steps: tags as filtering, then SPC

Mann M, Wilm M: Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal.Chem. 1994, 66:4390-4399

Schlosser A, Lehmann WD: Patchwork peptide sequencing: Extraction of sequence information from accurate mass data of peptide tandem mass spectra recorded at high resolution*. Proteomics. 2002, 2:524-533

Based on very accurate masses (10 mDa)Small tags are extracted from low mass regions (2 aa)


Popitam

Popitam

Popitam key's idea

Spectrum graph--> good way to structure the information contained in the MS/MS spectrum, allows mutations

Tags--> modified source peptides--> fragmented spectra

Search space

--> use dtb information during tag extraction

--> take into account only mutations compatible with the spectrum (graph)

--> make only modification scenarios compatible with the current theoretical peptide

Scoring function --> take into account a lot of parameters--> genetic programming


Popitam

For each Pi

extractTags(); processTags();

score();

Popitam

Popitam overview

any source of biological sequences

initial node

I(P1)

I(P2)

...

P1

P2

...

Peptide

sequence

database

filter

final node

IDENTIFICATION

MS/MS

7/12


Popitam

b+-NH3

measured mass

[m/z]

a+-H20

bMass (ideal fragmentation)

y++

b+

- # nodes > # peaks- families

- selection based on intensity- for each peak, make all possible hypotheses

“N-term”: bMass = chargeNb * m/z – (chargeNb-1) – offset“C-term”: bMass = PM – […]

Popitam

Spectrum graph

5/12


Popitam

Popitam

Tag extraction

ckTEetvmgoEV

LTELetLvmITEIetIvmtlE

peLTEpeLetpeLvmpeITEpeIetpeIvmpetlE

9 nodes,11 edges

--> 21 tags


Popitam

Popitam

Tag extraction (2)

LVNELTEFAK (125 peaks)

Pentium, 1.6 GHz

AIGGGLSSVGGSSTIK (1159 peaks)

1 16/97 5.6*104 0m02s

2 30/338 5.4*106 0m27s3 44/692 5.7*107 3m16s4 58/1121 3.4*108 21m09s5 72/1667 2.3*109 2h17m07s

AHFSISNSAEDPFIAIHADSK(145 peaks)1 24/121 6.1*104 0m02s2 46/308 1.9*108 16m15s3 68/831 2.0*1010 22h06m47s


Popitam

Popitam

Tag extraction (3)

Recursively extract from the graph all tags that are compatible with the current theoretical peptide--> a tag = a path (bMass, edge label, ionic hypothesis…)

ACCACMCAK

-

k

A

C

MCAK

MCAK

A

C

k

CACMCAK

CACMCAK

MCAK

CMCAK

k


Popitam

Popitam

Tag processing

  • discard subtags- discard tags that begin the theo. peptide, but not the graph (and vice versa)- discard tags that finish on the last aa, but not on the last node- group "family" tags

  • AVVQDPALKPLALVYGEATSRPeakNb: 1260

    ParentMass: 2197.15

    NodeNb : 86

    EdgeNb : 142 / 1098

29 tags --> 13 subSeqs

KplALVYGE 30 39 43 45 50 58 63 64 68 plALVYGE 39 43 45 50 58 63 64 68 ALVYGE 43 45 50 58 63 64 68 LVYGE 45 50 58 63 64 68 VYGE 50 58 63 64 68 YGE 58 63 64 68

paLKplALvy 0 4 10 16 22 26 31 42LKplALvy 4 10 16 22 26 31 42 KplALvy 10 16 22 26 31 42 plALvy 16 22 26 31 42 ALvy 22 26 31 42

LKPla 10 13 19 22 31 LKPla 10 14 19 22 31KPla 13 19 22 31 KPla 14 19 22 31

PLAlv 29 35 40 42 48

LAlv 35 40 42 48

DpaL 65 69 78 84

LKP 11 15 20 24

LVY 16 19 24 29

LVY 44 49 57 62PAL 19 22 26 31

QDP 10 16 20 24

alkpL 54 63 71 75

avVqd 0 5 9 18 dpAL 37 43 45 50 avVQD 55 60 65 70 75 VQD 60 65 70 75

paLK 59 66 69 75


Popitam

Popitam

Subsequence processing (1)

Aim:

Find all possible arrangements of subsequences, given the theoretical peptideBUTdo not include in a same arrangement tags that are incompatible with the others.

Compatibility rules:

--> no peak shared

--> beginMasses must respect positions in the sequences

A V V Q D P A L K P L A L V Y G E A T S R0 5 10 15

Compatibility graph

0 1 2 3 4 5 ... 0 x x 1 x x 2 x x 3 x x x x 4 x 5 x x x ...

0 KplALVYGE 794.41 0 1 2 6 15 19 21 27 30

1 LKPla 282.17 2 7 29 33 41

2 PLAlv 785.34 6 8 19 21 28

3 DpaL 1673.89 14 20 31 36

4 LKP 284.11 17 22 32 36

5 LVY 410.26 14 22 28 29

...

Each found clique in the graph is a possible arrangement of subsequencesHere, 91 cliques, but most of them are really uninteresting.


Popitam

Popitam

Scoring function (1)

--> 2 levels scoring:

- scoring linked to the subsequences (local)

subscores:

number of tags that compose the subsequence

length of the subsequence

occurrence probabilities of the ionic type hypothesized (geometric/arithmetic mean)

- scoring linked to the arrangement (global)

subscores:

global coverage

linear regression

AVVQDPALKPLALVYGEATSR KplALVYGE 794.4 LKP 284.1 LVY1202.7AVVQDPALKPLALVYGEATSR KplALVYGE 794.4 LKP 284.1 LVY1202.7 avVqd 1.0AVVQDPALKPLALVYGEATSR KplALVYGE 794.4 LKP 284.1 avVqd 1.0AVVQDPALKPLALVYGEATSR KplALVYGE 794.4 LVY1202.7

...


Popitam

Popitam

Scoring function (2)

How can we combine the subscores in order to build an efficient scoring function ?--> empirical function (expert knowledge)

--> probabilitic function

--> function built using GENETIC PROGRAMMING

GENETIC PROGRAMMING

population of "programs" : trees

nodes : mathematic operators (+, -, *, /, ^, ...)

bolean operators (AND, OR, NOT...)conditional operators (if-then-else...)

iterative functions (do-until...)

other specific functions...

leaves : subscores, coefficient


Popitam

Popitam

Genetic operators (1)

Initiation:

Programs are initially randomly determined (structure, functions, values)

Iterations:

At each iteration, the programs are evaluated (fitness function). Only the best are allowed to reproduce, using genetic operators (permutation, mutation, crossing-over...).


Popitam

Popitam

Genetic operators (2)


Popitam

Popitam

Genetic programming

genetic programming allows testing several scoring functions and making them "cleverly" evolve in order to find an optimal one

tree population

if (correctId() ) si  ]0.5;1[ (according to the discriminative power)

else {

if (belongToList() )

si  ]0;0.5] (according to the position in the list) else

si = 0;

scoring function1

Popitam

fitness

scoring function3

scoring function2

Popitam

Popitam

fitness

fitness


Popitam

Popitam

Some results


  • Login