A neural network-based method for predicting protein stability changes upon single point mutations

Biocomputing Unit Department of Biology, University of Bologna, Italy www.biocomp.unibo.it A neural network-based method for predicting protein stability changes upon single point mutations Emidio Capriotti, Piero Fariselli and Rita Casadio

Problem Definition • The State of the Art • Data Base • Neural Network Predictor • Results • Comparison with other Methods • I-Mutant

NativeA Mutant L Problem Definition (I) If we change Alanine 35 with a Leucine,is the protein stability increased ? Decreased? A35L

Native Mutant Free Energy U U DGfmut F F DGfnat DGf=Gu-Gf Problem Definition (I) If we change Alanine 35 with a Leucine,is the protein stability increased ? Decreased? DDGf= DGfmut - DGfnat

Problem Definition (II) The signof DDGfu identifies the direction of the stability change The sign is more informative than the |DDG| DDGf< 0 => the mutation increases the protein stability DDGf> 0 => the mutation decreases the protein stability Our Neural Networks are trained to predict the sign of the stability change

1) physical effective energy potentials (classical MM force fields) E= ½ks,ij(rij -ro)2 + ½kb,ij(ij –o)2 +... 2) statistical potentials E(i,j) = - KT log ( f(i,j) ) 3) empirical energies DG =WvdwDGvdw + WsolvDGsolv + Wsc TDSsc +... The State of the Art Energy-based predictive methods {

Over/Under-predictions OK - Over/Under-predictions + OK + -

ProTherm is a collection of numerical data of thermodynamic parameters including Gibbs free energy change, enthalpy change, heat capacity change, transition temperature etc. for wild type and mutant proteins Total number of entries 15379 Number of unique proteins 471 Total number of all proteins 668 Number of Proteins with mutants 195 Number of Single Mutations 7586Number of Double Mutations 1192 Number of Multiple Mutations 563 Number of Wild Type 6038 Gromiha et al. (2000). Nucleic Acids Res. 28, 283-285 http://www.rtc.riken.go.jp/jouhou/Protherm/ The Data Base

Training/testing Data set (I) The data set of proteins was extracted from ProTherm, with the following constraints: i) the DDG value was experimentally detected and reported in the data base; ii) the protein structure is known with atomic resolution (and deposited in the PDB (Berman et al., 2000)); iii) the data are relative to single mutations (no multiple mutations have been taken into account).

S1615 S388 Training/testing Data set (II) After this filtering procedure, we ended up with 2 data sets S1615 : 1615 different single mutations S388 : 388 mutations from containing only experiments performed at physiological conditions (T 20-40 °C, pH 6-8)

Neural Network Predictor (I) • N1: A 20 element vector that describes the aminoacid mutation, pH and T • N2: adds to the N1 input one more neuron for the relative accessibility surface of the mutated residue • N3: adds to N2 20 more input neurons (43 in total) encoding the three-dimensional residue environment

A G L E E->A L L Radius N2 I N3 A E G Relative Solvent Accessibility Mutation E->A Environment A C D E F G H I K L M N P Q R S T V W Y A C D E F G H I K L M N P Q R S T V W Y 1 -1 1 2 2 3 2 A Neural Network Predictor (II) Network N1 pH T

Method Q2 P(+) Q(+) P(-) Q(-) C N1 0.74 0.59 0.23 0.76 0.94 0.24 N2 0.75 0.57 0.45 0.80 0.87 0.34 N3 0.81 0.71 0.52 0.83 0.91 0.49 + and – : the index is evaluated for positive and negative signs of protein energy stability change, respectively. Cross-validation performance of the different neural networks on S1615

Method Radius Q2 P(+) Q(+) P(-) Q(-) C N3-4.5 4.5 0.79 0.63 0.55 0.83 0.88 0.45 N3-6.0 6.0 0.79 0.63 0.57 0.84 0.87 0.46 N3-9.0 9.0 0.81 0.71 0.52 0.83 0.91 0.49 N3-12.0 12.0 0.79 0.63 0.59 0.84 0.87 0.47 Cross-validation performance of N3 as a function of different protein environments (different radius) centred on the mutated residue

Q2 accuracy of neural network (N3-9.0) as a function of the reliability index (Rel)

Q2 accuracy of neural network (N3-9.0) as a function of the absolute value of protein stability changes upon mutation (|Stability Change|) Kcal/mol

Method Q2 P(+) Q(+) P(-) Q(-) C FOLDX(1) 0.75 0.26 0.56 0.93 0.78 0.25 DFIRE(2) 0.68 0.18 0.44 0.90 0.71 0.11 PoPMuSiC(3) 0.85 0.33 0.25 0.90 0.93 0.20 N3-9.0 0.87 0.44 0.21 0.90 0.96 0.25 (1) http://fold-x.embl-heidelberg.de. (2) http://phyyz4.med.buffalo.edu/hzhou/dmutation.html (3) http://babylone.ulb.ac.be/popmusic/ Comparison of neural network with other methods on S388

Method Agreement Q2 P(+) Q(+) P(-) Q(-) C N3-9.0 72% 0.93 0.88 0.28 0.93 0.99 0.47+ FOLDX(1) N3-9.0 69% 0.90 0.36 0.16 0.92 0.97 0.19+ DFIRE(2) N3-9.0 86% 0.91 0.67 0.07 0.92 0.99 0.19+PoPMuSiC(3) Accuracy of joint-methods on subsets of S388

I-Mutant

I-Mutant Web Server http://gpcr.biocomp.unibo.it/cgi/predictors/I-Mutant/I-Mutant.cgi/

thank you for your attention that’s all ! Stability test Emidio Capriotti, Piero Fariselli Rita Casadio

Measures of Accuracy The efficiency of the predictor is scored using the statistical indexes defined following. Overall Accuracy Coverage Probability correct Prediction Correlation coefficient Where N is the total number of prediction, p the correct number of predictions, u and o are the numbers of under and over predictions.

Q2 accuracy of the neural network (N3-9.0) as a function of the relative accessibility value of the mutated residue

native \ newCharged Polar Apolar Charged Polar Apolar 0.62 (4%) 0.77 (8%) 0.72 (9%) 0.69 (6%) 0.82 (10%) 0.77 (17%) 0.75 (3%) 0.92 (12%) 0.87 (31%) Q2 accuracy as a function of the residue mutation type

A neural network-based method for predicting protein stability changes upon single point mutations