T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)

T cell Epitope predictionsusing bioinformatics(Neural Networks andhidden Markov models) Morten Nielsen, CBS, BioCentrum, DTU

Processing of intracellular proteins MHC binding http://www.nki.nl/nkidep/h4/neefjes/neefjes.htm

What makes a peptide a potential and effective epitope? • Part of a pathogen protein • Successful processing • Proteasome cleavage • TAP binding • Binds to MHC molecule • Protein function • Early in replication • Sequence conservation in evolution

From proteins to immunogens 20% processed 0.5% bind MHC 50% CTL response Lauemøller et al., 2000 => 1/2000 peptide are immunogenic

MHC Class I and II • Class I • Peptides 8-12 amino acids long • Intracellular pathogen presentation • Broad range of bioinformatical prediction tools • Class II • Peptides 13+ amino acids long • Intravesicular pathogen presentation • Few prediction tools

MHC class I with peptide http://www.nki.nl/nkidep/h4/neefjes/neefjes.htm

Prediction of HLA binding specificity • Simple Motifs • Allowed/non allowed amino acids • Extended motifs • Amino acid preferences (SYFPEITHI) • Anchor/Preferred/other amino acids • Hidden Markov models • Peptide statistics from sequence alignment • Neural networks • Can take sequence correlations into account

Syfpeithi database • Anchors: • Required for binding • Auxiliary anchor: • Helps binding

10 peptides from MHCpep database Bind to the MHC complex A*0201 Which of the following are most likely to bind? FLLTRILTI WLDQVPFSV TVILGVLLL Regular expression X1[LMIV]2X3…X8[MVL]9 2 and 3 will bind and 1 will not bind Cannot tell if 2 if more likely to bind Truth is that 1 and 2 binds and 1 binds the strongest. 3 does not bind A probabilistic model can capture this! ALAKAAAAM ALAKAAAAM ALAKAAAAM ALAKAAAAV ALAKAAAAV GMNERPILV GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Pattern recognition

ALAKAAAAM ALAKAAAAM ALAKAAAAM ALAKAAAAV ALAKAAAAV GMNERPILV GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Probability estimation

Weight matrices • Estimate amino acid frequencies from alignment • Now a weight matrix is given as Wij = log(pij/qj) • Here i is a position in the motif, and j an amino acid. qj is the background frequency for amino acid j. • In nature not all amino acids are found equally often • PA = 0.07, PW = 0.013 • Finding 6% A is hence not significant, but 6% W highly significant • W is a L x 20 matrix, L is motif length

Scoring sequences to a weight matrix A R N D C Q E G H I L K M F P S T W Y V 1 0.6 0.4 -3.5 -2.4 -0.4 -1.9 -2.7 0.3 -1.1 1.0 0.3 0.0 1.4 1.2 -2.7 1.4 -1.2 -2.0 1.1 0.7 2 -1.6 -6.6 -6.5 -5.4 -2.5 -4.0 -4.7 -3.7 -6.3 1.0 5.1 -3.7 3.1 -4.2 -4.3 -4.2 -0.2 -5.9 -3.8 0.4 3 0.2 -1.3 0.1 1.5 0.0 -1.8 -3.3 0.4 0.5 -1.0 0.3 -2.5 1.2 1.0 -0.1 -0.3 -0.5 3.4 1.6 0.0 4 -0.1 -0.1 -2.0 2.0 -1.6 0.5 0.8 2.0 -3.3 0.1 -1.7 -1.0 -2.2 -1.6 1.7 -0.6 -0.2 1.3 -6.8 -0.7 5 -1.6 -0.1 0.1 -2.2 -1.2 0.4 -0.5 1.9 1.2 -2.2 -0.5 -1.3 -2.2 1.7 1.2 -2.5 -0.1 1.7 1.5 1.0 6 -0.7 -1.4 -1.0 -2.3 1.1 -1.3 -1.4 -0.2 -1.0 1.8 0.8 -1.9 0.2 1.0 -0.4 -0.6 0.4 -0.5 -0.0 2.1 7 1.1 -3.8 -0.2 -1.3 1.3 -0.3 -1.3 -1.4 2.1 0.6 0.7 -5.0 1.1 0.9 1.3 -0.5 -0.9 2.9 -0.4 0.5 8 -2.2 1.0 -0.8 -2.9 -1.4 0.4 0.1 -0.4 0.2 -0.0 1.1 -0.5 -0.5 0.7 -0.3 0.8 0.8 -0.7 1.3 -1.1 9 -0.2 -3.5 -6.1 -4.5 0.7 -0.8 -2.5 -4.0 -2.6 0.9 2.8 -3.0 -1.8 -1.4 -6.2 -1.9 -1.6 -4.9 -1.6 4.5 15.0 -3.4 0.8 ILYQVPFSV ALPYWNFAT MTAQWWLDA Which peptide is most likely to bind? Which peptide second?

10 peptides from MHCpep database Bind the MHC complex Estimate sequence motif and weight matrix Evaluate on 528 peptides (not included in training) ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Weight-matrix construction Example from real life

ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV } Similar sequences Weight 1/5 Pseudo-count and sequence weighting • Limited number of data • Poor or biased sampling of sequence space • I is not found at position P9. Does this mean that I is forbidden? • No! Use Blosum substitution matrix to estimate pseudo frequency of I at P9

# I L V L 0.12 0.38 0.10 V 0.16 0.13 0.27 Every time for instance L/V is observed, I is also likely to occur Estimate low (pseudo) count correction using this approach As more data are included the pseudo count correction becomes less important Low count correction using Blosum matrices Blosum62 substitution frequencies Neff: Number of sequences b: Weight on prior or pseudo count

Example from real life (cont.) • Raw sequence counting • No sequence weighting • No pseudo count • Prediction accuracy 0.45 • Sequence weighting • No pseudo count • Prediction accuracy 0.5

Example from real life (cont.) • Sequence weighting and pseudo count • Prediction accuracy 0.60 • Sequence weighting, pseudo count and anchor weighting • Prediction accuracy 0.72

Example from real life (cont.) • Sequence weighting, pseudo count and anchor weighting • Prediction accuracy 0.72 • Motif found on all data (485) • Prediction accuracy 0.79

Training on small data sets Class I Class II Using a biased weight matrix with differential weight on anchor positions gives reliable performance for N~20-50 Lundegaard et al. 2004

How to predict • The effect on the binding affinity of having a given amino acid at one position can be influenced by the amino acids at other positions in the peptide (sequence correlations). • Two adjacent amino acids may for example compete for the space in a pocket in the MHC molecule. • Artificial neural networks (ANN) are ideally suited to take such correlations into account

Neural networks • Neural networks can learn higher order correlations! • What does this mean? 0 0 => 0 0 1 => 1 1 0 => 1 1 1 => 0 No linear function can learn this pattern

X1 X2 X1 X2 W11 W22 W1 W2 W21 W12 h1 hs V1 V2 0 0 Solution Learning higher order correlation 0 0 => 0; 1 0 => 1 1 1 => 0; 0 1 => 1 Has no solution!

Mutual information P6 P1 I(i,j) = SaaiSaajP(aai, aaj) * log[P(aai, aaj)/P(aai)*P(aaj)] P(G1) = 2/9 = 0.22, .. P(V6) = 4/9 = 0.44,.. P(G1,V6) = 2/9 = 0.22, P(G1)*P(V6) = 8/81 = 0.10 log(0.22/0.10) > 0 ALWGFFPVA ILKEPVHGV ILGFVFTLT LLFGYPVYV GLSPTVWLS YMNGTMSQV GILGFVFTL WLSLLVPFV FLPSDFFPS

Epitope predictionsMutual information 313 binding peptides 313 random peptides

Choice of method • Neural networks are superior when trained on many data • Simple and extended motif method when little or no data is available • HMM/weight matrices with position specific differential weight otherwise • Increase weight on anchor positions

TP FP AP AN Evaluation of prediction accuracy True positive proportion = TP/(AP) False positive proportion = FP/(AN) Pearson correlation Roc curves Aroc=0.8 Aroc=0.5

Construction of ROC curves True positive proportion = TP/(AP) False positive proportion = FP/(AN) >0.5 AP (16) <0.5 AN (4) Number Sequence Assignment Prediction 1 ILYQVPFSV 0.853 12.137 2 YLEPGPVTV 0.647 11.509 3 GLMTAVYLV 0.798 10.021 4 YLDLALMSV 0.842 9.632 5 GLYSSTVPV 0.697 9.335 6 HLYQGCQVV 0.539 9.265 7 RMYGVLPWI 0.689 8.948 8 FLPWHRLFL 0.564 8.926 9 LLPSLFLLL 0.554 8.890 10 ILSSLGLPV 0.638 8.491 11 FLLTRILTI 0.803 8.343 12 ILDEAYVMA 0.494 6.084 13 VVMGTLVAL 0.589 5.935 14 MALLRLPLV 0.634 4.761 15 MLQDMAILT 0.527 4.450 16 KILSVFFLA 0.851 3.578 17 ILTVILGVL 0.451 3.358 18 ALAKAAAAA 0.563 2.849 19 LVSLLTFMI 0.301 1.193 20 ALPYWNFAT 0.323 0.994 TP=3,FP=0 Roc curves TP=3,FP=0 Aroc=0.8 TP=11,FP=1 Aroc=0.5 TP=16,FP=4

Epitope predictionsSequence motif and HMM’s Sequence motif HMM cc: 0.76 Aroc: 0.92 cc: 0.80 Aroc: 0.95

Epitope prediction. Neural Networks cc: 0.91 Aroc: 0.98

Evaluation of prediction accuracy

Location of class I epitopes GP1200 protein Structure (1GM9)

Hepatitis C virus. Epitope predictions

MHC Class II binding • TEPITOPE. Virtual matrices (Hammer, J., Current Opinion in Immunology 7, 263-269, 1995) • PROPRED. Quantitative matrices (Singh H, Raghava GPBioinformatics 2001 Dec;17(12):1236-7) • Web interface http://www.imtech.res.in/raghava/propred • Gibbs sampler (Nielsen et al., Bioinformatics 2004. Improved prediction of MHC class I and II epitopes using a novel Gibbs sampler approach)

Complexity of problem Peptides of different length Weak motif signal Alignment crucial Gibbs Monte Carlo sampler RFFGGDRGAPKRG YLDPLIRGLLARPAKLQV KPGQPPRLLIYDASNRATGIPA GSLFVYNITTNKYKAFLDKQ SALLSSDITASVNCAK PKYVHQNTLKLAT GFKGEQGPKGEP DVFKELKVHHANENI SRYWAIRTRSGGI TYSTNEIDLQLSQEDGQTIE MHC class II prediction

Class II binding motif Alignment by Gibbs sampler Gibbs sampler motif RFFGGDRGAPKRG YLDPLIRGLLARPAKLQV KPGQPPRLLIYDASNRATGIPA GSLFVYNITTNKYKAFLDKQ SALLSSDITASVNCAK PKYVHQNTLKLAT GFKGEQGPKGEP DVFKELKVHHANENI SRYWAIRTRSGGI TYSTNEIDLQLSQEDGQTI

MHC class II predictionsAlleleDRB1_0401 Accuracy

Summary • Binding motif of class I MHC binding well characterized by HMM/weight matrices • This even when limited data is available • Neural networks can be trained to predict MHC binding with high accuracy • NN can include higher order sequence correlations • MHC Class II peptide binding motif can be described using a Gibbs sampler algorithm

T cell Epitope predictions using bioinformatics (Neural Networks and hidden Markov models)