1 / 37

T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models). Morten Nielsen, CBS, BioCentrum, DTU. Processing of intracellular proteins. MHC binding. http://www.nki.nl/nkidep/h4/neefjes/neefjes.htm. What makes a peptide a potential and effective epitope?.

teenie
Download Presentation

T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. T cell Epitope predictionsusing bioinformatics(Neural Networks andhidden Markov models) Morten Nielsen, CBS, BioCentrum, DTU

  2. Processing of intracellular proteins MHC binding http://www.nki.nl/nkidep/h4/neefjes/neefjes.htm

  3. What makes a peptide a potential and effective epitope? • Part of a pathogen protein • Successful processing • Proteasome cleavage • TAP binding • Binds to MHC molecule • Protein function • Early in replication • Sequence conservation in evolution

  4. From proteins to immunogens 20% processed 0.5% bind MHC 50% CTL response Lauemøller et al., 2000 => 1/2000 peptide are immunogenic

  5. MHC Class I and II • Class I • Peptides 8-12 amino acids long • Intracellular pathogen presentation • Broad range of bioinformatical prediction tools • Class II • Peptides 13+ amino acids long • Intravesicular pathogen presentation • Few prediction tools

  6. MHC class I with peptide http://www.nki.nl/nkidep/h4/neefjes/neefjes.htm

  7. Prediction of HLA binding specificity • Simple Motifs • Allowed/non allowed amino acids • Extended motifs • Amino acid preferences (SYFPEITHI) • Anchor/Preferred/other amino acids • Hidden Markov models • Peptide statistics from sequence alignment • Neural networks • Can take sequence correlations into account

  8. Syfpeithi database • Anchors: • Required for binding • Auxiliary anchor: • Helps binding

  9. 10 peptides from MHCpep database Bind to the MHC complex A*0201 Which of the following are most likely to bind? FLLTRILTI WLDQVPFSV TVILGVLLL Regular expression X1[LMIV]2X3…X8[MVL]9 2 and 3 will bind and 1 will not bind Cannot tell if 2 if more likely to bind Truth is that 1 and 2 binds and 1 binds the strongest. 3 does not bind A probabilistic model can capture this! ALAKAAAAM ALAKAAAAM ALAKAAAAM ALAKAAAAV ALAKAAAAV GMNERPILV GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Pattern recognition

  10. ALAKAAAAM ALAKAAAAM ALAKAAAAM ALAKAAAAV ALAKAAAAV GMNERPILV GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Probability estimation

  11. Weight matrices • Estimate amino acid frequencies from alignment • Now a weight matrix is given as Wij = log(pij/qj) • Here i is a position in the motif, and j an amino acid. qj is the background frequency for amino acid j. • In nature not all amino acids are found equally often • PA = 0.07, PW = 0.013 • Finding 6% A is hence not significant, but 6% W highly significant • W is a L x 20 matrix, L is motif length

  12. Scoring sequences to a weight matrix A R N D C Q E G H I L K M F P S T W Y V 1 0.6 0.4 -3.5 -2.4 -0.4 -1.9 -2.7 0.3 -1.1 1.0 0.3 0.0 1.4 1.2 -2.7 1.4 -1.2 -2.0 1.1 0.7 2 -1.6 -6.6 -6.5 -5.4 -2.5 -4.0 -4.7 -3.7 -6.3 1.0 5.1 -3.7 3.1 -4.2 -4.3 -4.2 -0.2 -5.9 -3.8 0.4 3 0.2 -1.3 0.1 1.5 0.0 -1.8 -3.3 0.4 0.5 -1.0 0.3 -2.5 1.2 1.0 -0.1 -0.3 -0.5 3.4 1.6 0.0 4 -0.1 -0.1 -2.0 2.0 -1.6 0.5 0.8 2.0 -3.3 0.1 -1.7 -1.0 -2.2 -1.6 1.7 -0.6 -0.2 1.3 -6.8 -0.7 5 -1.6 -0.1 0.1 -2.2 -1.2 0.4 -0.5 1.9 1.2 -2.2 -0.5 -1.3 -2.2 1.7 1.2 -2.5 -0.1 1.7 1.5 1.0 6 -0.7 -1.4 -1.0 -2.3 1.1 -1.3 -1.4 -0.2 -1.0 1.8 0.8 -1.9 0.2 1.0 -0.4 -0.6 0.4 -0.5 -0.0 2.1 7 1.1 -3.8 -0.2 -1.3 1.3 -0.3 -1.3 -1.4 2.1 0.6 0.7 -5.0 1.1 0.9 1.3 -0.5 -0.9 2.9 -0.4 0.5 8 -2.2 1.0 -0.8 -2.9 -1.4 0.4 0.1 -0.4 0.2 -0.0 1.1 -0.5 -0.5 0.7 -0.3 0.8 0.8 -0.7 1.3 -1.1 9 -0.2 -3.5 -6.1 -4.5 0.7 -0.8 -2.5 -4.0 -2.6 0.9 2.8 -3.0 -1.8 -1.4 -6.2 -1.9 -1.6 -4.9 -1.6 4.5 15.0 -3.4 0.8 ILYQVPFSV ALPYWNFAT MTAQWWLDA Which peptide is most likely to bind? Which peptide second?

  13. 10 peptides from MHCpep database Bind the MHC complex Estimate sequence motif and weight matrix Evaluate on 528 peptides (not included in training) ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Weight-matrix construction Example from real life

  14. ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV } Similar sequences Weight 1/5 Pseudo-count and sequence weighting • Limited number of data • Poor or biased sampling of sequence space • I is not found at position P9. Does this mean that I is forbidden? • No! Use Blosum substitution matrix to estimate pseudo frequency of I at P9

  15. # I L V L 0.12 0.38 0.10 V 0.16 0.13 0.27 Every time for instance L/V is observed, I is also likely to occur Estimate low (pseudo) count correction using this approach As more data are included the pseudo count correction becomes less important Low count correction using Blosum matrices Blosum62 substitution frequencies Neff: Number of sequences b: Weight on prior or pseudo count

  16. Example from real life (cont.) • Raw sequence counting • No sequence weighting • No pseudo count • Prediction accuracy 0.45 • Sequence weighting • No pseudo count • Prediction accuracy 0.5

  17. Example from real life (cont.) • Sequence weighting and pseudo count • Prediction accuracy 0.60 • Sequence weighting, pseudo count and anchor weighting • Prediction accuracy 0.72

  18. Example from real life (cont.) • Sequence weighting, pseudo count and anchor weighting • Prediction accuracy 0.72 • Motif found on all data (485) • Prediction accuracy 0.79

  19. Training on small data sets Class I Class II Using a biased weight matrix with differential weight on anchor positions gives reliable performance for N~20-50 Lundegaard et al. 2004

  20. How to predict • The effect on the binding affinity of having a given amino acid at one position can be influenced by the amino acids at other positions in the peptide (sequence correlations). • Two adjacent amino acids may for example compete for the space in a pocket in the MHC molecule. • Artificial neural networks (ANN) are ideally suited to take such correlations into account

  21. Neural networks • Neural networks can learn higher order correlations! • What does this mean? 0 0 => 0 0 1 => 1 1 0 => 1 1 1 => 0 No linear function can learn this pattern

  22. X1 X2 X1 X2 W11 W22 W1 W2 W21 W12 h1 hs V1 V2 0 0 Solution Learning higher order correlation 0 0 => 0; 1 0 => 1 1 1 => 0; 0 1 => 1 Has no solution!

  23. Mutual information P6 P1 I(i,j) = SaaiSaajP(aai, aaj) * log[P(aai, aaj)/P(aai)*P(aaj)] P(G1) = 2/9 = 0.22, .. P(V6) = 4/9 = 0.44,.. P(G1,V6) = 2/9 = 0.22, P(G1)*P(V6) = 8/81 = 0.10 log(0.22/0.10) > 0 ALWGFFPVA ILKEPVHGV ILGFVFTLT LLFGYPVYV GLSPTVWLS YMNGTMSQV GILGFVFTL WLSLLVPFV FLPSDFFPS

  24. Epitope predictionsMutual information 313 binding peptides 313 random peptides

  25. Choice of method • Neural networks are superior when trained on many data • Simple and extended motif method when little or no data is available • HMM/weight matrices with position specific differential weight otherwise • Increase weight on anchor positions

  26. TP FP AP AN Evaluation of prediction accuracy True positive proportion = TP/(AP) False positive proportion = FP/(AN) Pearson correlation Roc curves Aroc=0.8 Aroc=0.5

  27. Construction of ROC curves True positive proportion = TP/(AP) False positive proportion = FP/(AN) >0.5 AP (16) <0.5 AN (4) Number Sequence Assignment Prediction 1 ILYQVPFSV 0.853 12.137 2 YLEPGPVTV 0.647 11.509 3 GLMTAVYLV 0.798 10.021 4 YLDLALMSV 0.842 9.632 5 GLYSSTVPV 0.697 9.335 6 HLYQGCQVV 0.539 9.265 7 RMYGVLPWI 0.689 8.948 8 FLPWHRLFL 0.564 8.926 9 LLPSLFLLL 0.554 8.890 10 ILSSLGLPV 0.638 8.491 11 FLLTRILTI 0.803 8.343 12 ILDEAYVMA 0.494 6.084 13 VVMGTLVAL 0.589 5.935 14 MALLRLPLV 0.634 4.761 15 MLQDMAILT 0.527 4.450 16 KILSVFFLA 0.851 3.578 17 ILTVILGVL 0.451 3.358 18 ALAKAAAAA 0.563 2.849 19 LVSLLTFMI 0.301 1.193 20 ALPYWNFAT 0.323 0.994 TP=3,FP=0 Roc curves TP=3,FP=0 Aroc=0.8 TP=11,FP=1 Aroc=0.5 TP=16,FP=4

  28. Epitope predictionsSequence motif and HMM’s Sequence motif HMM cc: 0.76 Aroc: 0.92 cc: 0.80 Aroc: 0.95

  29. Epitope prediction. Neural Networks cc: 0.91 Aroc: 0.98

  30. Evaluation of prediction accuracy

  31. Location of class I epitopes GP1200 protein Structure (1GM9)

  32. Hepatitis C virus. Epitope predictions

  33. MHC Class II binding • TEPITOPE. Virtual matrices (Hammer, J., Current Opinion in Immunology 7, 263-269, 1995) • PROPRED. Quantitative matrices (Singh H, Raghava GPBioinformatics 2001 Dec;17(12):1236-7) • Web interface http://www.imtech.res.in/raghava/propred • Gibbs sampler (Nielsen et al., Bioinformatics 2004. Improved prediction of MHC class I and II epitopes using a novel Gibbs sampler approach)

  34. Complexity of problem Peptides of different length Weak motif signal Alignment crucial Gibbs Monte Carlo sampler RFFGGDRGAPKRG YLDPLIRGLLARPAKLQV KPGQPPRLLIYDASNRATGIPA GSLFVYNITTNKYKAFLDKQ SALLSSDITASVNCAK PKYVHQNTLKLAT GFKGEQGPKGEP DVFKELKVHHANENI SRYWAIRTRSGGI TYSTNEIDLQLSQEDGQTIE MHC class II prediction

  35. Class II binding motif Alignment by Gibbs sampler Gibbs sampler motif RFFGGDRGAPKRG YLDPLIRGLLARPAKLQV KPGQPPRLLIYDASNRATGIPA GSLFVYNITTNKYKAFLDKQ SALLSSDITASVNCAK PKYVHQNTLKLAT GFKGEQGPKGEP DVFKELKVHHANENI SRYWAIRTRSGGI TYSTNEIDLQLSQEDGQTI

  36. MHC class II predictionsAlleleDRB1_0401 Accuracy

  37. Summary • Binding motif of class I MHC binding well characterized by HMM/weight matrices • This even when limited data is available • Neural networks can be trained to predict MHC binding with high accuracy • NN can include higher order sequence correlations • MHC Class II peptide binding motif can be described using a Gibbs sampler algorithm

More Related