Aim of this presentation: About the project The Program

BioInformatics - Protein Structure PredictionRajalingam AravinthanGad AbrahamSummer Studentship(2003/2004) Under the supervision of Professor Heiko Schroder, Dr Margaret Hamilton and Dr Ron Van Schyndelwith help from Abdullah Amin and Saravanan Dayalan. • Aim of this presentation: • About the project • The Program • To analyze what we have done so far and where we should concentrate in our project.

The Project Specifications This research project attempts to predict protein structures in linear time. Our method is based on statistical analysis of known structures - We compare a given sequence of amino acids of unknown structure with sequences from a database of known structures. (any examples ?) In particular we are looking for perfect matches of short sub-sequences (maximal length 5) from a given protein sequence. With our approach we cannot go beyond the length of 5 as then the expected number of matches in the existing database will be below 1. We would like to complement and compare our method by looking at "less then perfect matches" of sub-sequences of considerably longer length than 5.

In order to do this it is necessary to implement the Smith-Waterman algorithm (a well known algorithm used in the area of homology modeling) This algorithm is then to be run against the complete database of known protein structures. From this project we expect the answer to the question of how perfect matches of very short sub-sequences compare with "less than perfect matches" of significantly longer sequences. Eg : The phi and psi angles around Glysine in perfect matches with ‘VGI’. And less than perfect matches with ‘AVGID’. In the latter case if we have say ‘ALGID’ That will be considered as well for analysis.

The PDB file The pdb file for this project was created by Mr. Saravanan as part of his PhD research. The pdb contains all the proteins in a systematic format. It was very useful to our project as it is easy for us to extract the angles and data from this text data base. ‘pdb’ file <Serial Number> <Protein> <Amino Acid> <Phi> <Psi >1 1BA1 GLY 0.000000 -61.2000002 1BA1 PRO -63.900000 158.0000003 1BA1 ALA -86.100000 153.4000004 1BA1 VAL -118.500000 163.900000 : : : : : : : : : : : : : : : : : 377 1BA1 LEU -78.600000 -9.100000378 1BA1 SER -82.900000 0.000000379 1BA2 LYS 0.000000 144.100000380 1BA2 ASP -71.700000 163.600000: : : : : : : : : : : : : : : : : 10757863 1R1A ASP 21.000000 0.000000 1BA1 1B50

The Program • To take any protein from the pdb file “phi-psi” and create a protein file that will be analyzed by the main program - “Align.java” Ex: Say we are analyzing ‘1BA1’ • The protein ‘1BA1’ is a chain of 378 Amino Acids. The program ‘Align.java’ can take any number of odd length sub-sequences from the ‘1BA1’ and do alignment with the whole pdb file. • Say we are considering a length of 5. So there could be 374 windows of length 5 sub-sequences of amino acids.

Sliding windows of sub-sequences in a chain of amino acids Window1-first 5 a.a out of the 378 amino acids S K G P A V G I D L G T T Y S G V Q H G K V……. Window 3 • For each sub-sequence (Window above) we will do the alignment for all the proteins in the pdb file. There are around 17,388 proteins in the database. Alignment will be done separately for each one of those 17,388 proteins.(Note 1) • Alignment creates an alignment matrix. We will be using a score matrix to give values in the matrix. Ex: ‘blosum62’ or ‘pam250’ • Our program is a modified model of the Smith-Waterman algorithm.

1. When there is an amino acid code (code) match we give the score from the score matrix being used. 2. When there is a mismatch we don’t give any gap costs instead apply the score from the matrix used for those codes. 3. Each cell’s score is the sum of its score from matrix and the upper diagonal neighbor's score instead of the max. of three adjacent cells . Negative values are kept. An example alignment for sequence AVGID with protein ‘1BA1’: ……. …….. ……..

Where ‘10d-test’ is the file with sub-sequence ‘AVGID’ ‘1BA1’ is the pdb file (it could be one or many proteins) ‘blosum62’ is the score matrix used 5 is the length of sub-sequence 0.2 is the penalty as a percentage Say ‘AVGID’ has a maximum score of 24 (that is when matched with another ‘AVGID’ as the score matrix has higher scores diagonally) then the penalty for this window of sequence is 20% of 24 or approx 5. Please see ‘BackTracking’ for the importance of this penalty value

Blosum 62 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Back Tracking:1. In this modified version of SW-Algorithm we only backtrack from the bottom row. That is we want to preserve the matched pdb sub-sequence to be the same length as the current window.2. We ignore gaps altogether. So there is only diagonal trace back.3. Trace back is done when the score is >= target score( Max.Score – penalty = 24- 20% of 24 = 24 –5 = 19) A sample trace back is shown in green in the alignment Matrix.4. We get the pdb sub-sequences and find the phi and psi angle for the middle amino acid. Keep count of the phi and psi anglesseparately in 1/10 degree intervals or 10 degree interval as appropriate.

Draw the Final angle counts using jGraphExample in 10 degree intervals for AVGID with the whole pdb The points were taken as the middle value of the range except for zero, –180.00 and +180.00 which were unchanged.Ex:0-10 = 5, 40-50 = bin 45 180.00 = bin 180.0 0.0 = bin 0.0

9. The program out puts “phi” and “psi” files, which will contain the graph points for the actual and predicted angles. When we say predicted angles it is the angle for the high frequency of occurrences for a given sub-sequence.We can use jGraph to plot these points. • This simple way of finding the predicted angle is not accurate.How can we say with (some)certainty that this will be the angle for all the sub-sequence? We need more sophisticated way of finding the peaks.

What we have done so far Running tests on sequence lengths 3,5,7,9 for the penalty of 0.1 to 0.5 and produced graphs in 1/10 degree intervals. Sequence 3: VGI ; penalty = 0.0 ; 1/10 Degree interval phi psi

Why Penalty? The ‘pdb’ has about 11 million amino acids. This is from 20 known Amino Acids. When we are doing matching for length 3 sub-sequences the probability of finding one is then 1/( 20^3) = 1/8000 So we could expect to find 10757863/8000 occurrences of that particular sub-sequence. Which is around 1344. It is a good number for statistics.But, in this project, we are considering lengths of 5 or more!For an exact match the probability is 1/(20^5) = 1/3.2millionSo we could only expect 3 occurrences of them throughout the entire database. This will not be enough to do statistical analyses. Penalty gives the freedom of not matching exact sequences but, close enough sequences. So the results could be considered for the analysis in determining the phi, psi angles

Penalty Comparisons Sequence 5 : AVGID ; Penalty = 0.1 psi phi Sequence 5 : AVGID ; Penalty = 0.5 psi phi

For the sequence AVGID the max score is 24. When the penalty is 0.1 The the target score is 21. So from pdb any sub-sequence having score more than 21 is considered. But the narrow margin of penalty score 3 allows only certain elements to be swapped depending on the chosen matrix. It depends on two factors. • Element been swapped and the score when matched against itself. Ex: I-I = 4 • The element comes in and the score when matched against the element been swapped. I-V= 3 • For the sub-sequence to be considered max score -(1) +(2) >= 21 must be true. • AVGID Searched Sequence : . : . : AIGVD Protein sequence • Both gives Alignment Score = 22 AVGID Searched Sequence : : : . : AVGLD Protein sequence

It shows that when the penalty is low(0.1) the matrix allows only certain elements to be swapped. Here according to blosum62 only the elements marked with pink squares(positive values) can be swapped. Like V-I, I-V, I-L And More precisely when penalty is 0.1 if ‘A’ has to swapped out (A-A =4 according to blosum62) then any element having more than a score of >=+1 with ‘A’ can take that place but, if for ‘R’(R-R=5) then it has to be >=+2 with ‘R’ to keep the target score >=21 So it shows increasing the penalty not only increases the chances of elements being swapped but, also what elements could be swapped as well.

Reasons for penalty as a percentage Assume we have two sub-sequences in protein. 1. ‘AVGID’ max score =24 according to blosum62. 2. ‘PCHWT’. The max.score is 40 according to blosum62. If the penalty is a number say 3 then the target score for seq1 is 21 and 37 for seq2. A lower max.score means the sequence contains elements which are of lower scores. A penalty of 3 could be easily substituted with another element. But for sequence2 substituting even the lower scored element(T-T=5)costs 5 and so we have to find an element having at least a score of 2 with ‘T’ to come in ! That means no substitution possible according to blosum62.

That is why we take a percentage of the max.score .If 0.1 penalty for seq2 then that is 10% of 40=4.0 = 4 So any swap that loses a score of 4 or less can be considered instead of 3 as before. So we have to find an element having at least a score of 1 with ‘T’ to come in . Here ‘T’ could be substituted with ‘S’ (T-S =1) and yet we can keep the score at 36 which is equal or above the target 36. So we give a margin for sub-sequences to be considered and will get at least some data from pdb.

What happens when the penalty increases?Sequence 5 :AVGID ; Penalty = 3.0; 10 degree interval The total max score possible for AVGID = 24. Giving penalty 3.0 makes the target score = -48(24*(1-3.0))As we can see the are the phi and psi distribution for ‘G’ throughout the whole pdb data base Phi Artifacts Psi

Length of Sequence Comparisons Comparison : For a given penalty how graphs differ when we take a length of 3, 5, 7, 9.Important point to note is that we took a short sequence of 9 amino acids and took all these sub-sequences keeping the 5th element as the middle one. 1 1BA1 GLY 0.000000 -61.2000002 1BA1 PRO -63.900000 158.0000003 1BA1 ALA -86.100000 153.4000004 1BA1 VAL -118.500000 163.9000005 1BA1 GLY -113.200000 129.8000006 1BA1 ILE -125.600000 126.9000007 1BA1 ASP -99.600000 104.4000008 1BA1 LEU -89.200000 86.9000009 1BA1 GLY -78.700000 161.000000

Sequence 3 : VGI ; Penalty = 0.0; 1/10th degree interval psi phi Sequence 5 : AVGID ; Penalty = 0.4; 1/10 degree interval phi psi

Sequence 7 : PAVGIDL ; Penalty = 0.4; 1/10th degree interval phi psi Sequence 9 :GPAVGIDLG ; Penalty = 0.4; 1/10th degree interval phi psi

1/10th Degree & 10 degree interval for Sequence 3 Phi Psi 1/10 degree 10 degree

Pam250 & Blosum62 Comparisons Sequence 5 :AVGID ; Penalty = 0.3; 10 degree interval Phi Angles Blosum62 Pam250 Psi Angles

Sequence5, Blosum62 PhiAngles for ‘G’- Glysine in sequence 5 with penalties 10% – 100% The phi Distribution for ‘G’

The psi Distribution for ‘G’ PsiAngles for ‘G’- Glysinein sequence 5 with penalties 0.1 – 1.0

Predicting ‘1BA1’ Comparing the Phi and Psi angles for the first 25 amino acid sequences from ‘1BA1’ with the test outputs. The peak values were selected manually from the outputs.

‘1BA1’ Phi for first 5 windows ‘1BA1’ Psi for first 5 windows

END

Aim of this presentation: About the project The Program