Exhaustive RT-RICO Algorithm For Mining Association Rules In Protein Secondary Structure

Exhaustive RT-RICO Algorithm For Mining Association Rules In Protein Secondary Structure Leong Lee, Department of Computer Science, Austin Peay State University, Clarksville, Tennessee, USA Jennifer L. Leopold, Department of Computer Science,Ronald L. Frank, Department of Biological Sciences,Missouri University of Science and Technology, Rolla, Missouri, USA

Introduction Central Dogma of Biology Protein Structure Prediction: A Brief Introduction Protein Secondary Structure Prediction Problem Related Work BLAST-ERT-RICO Exhaustive RT-RICO Rule Generation Algorithm References

Central Dogma of Biology DNA --> transcription --> RNA --> translation --> protein Is referred to as the central dogma in molecular biology (Jones and Pevzner, 2004) DNA sequence determines protein sequence Protein sequence determines protein structure Protein structure determines protein function Regulatory mechanisms, delivers the right amount of the right function to the right place at the right time (Lesk, 2008)

Molecular Biology: A Brief Introduction Cell Information: instruction book of life DNA/RNA: strings written in four-letter nucleotide (A C G T/U) Protein: strings written in 20-letter amino acid Example, the transcription of DNA into RNA, and the translation of RNA into a protein (Jones and Pevzner, 2004) DNA: TAC CGC GGC TAT TAC TGC CAG GAA GGA ACT RNA: AUG GCG CCG AUA AUG ACG GUC CUU CCU UGA Protein: Met Ala Pro Ile Met Thr Val Leu Pro Stop

Molecular Biology: A Brief Introduction Image courtesy of Griffiths et al. Genetic code, from the perspective of mRNA. AUG also acts as a “start” codon

Protein Structure Prediction: A Brief Introduction >1PSN:A|PDBID|CHAIN|SEQUENCE VDEQPLENYLDMEYFGTIGIGTPAQDFTVVFDTGSSNLWVPSVYCSSLACTNHNRFNPEDSSTYQSTSETVSITYGTGSMTGILGYDTVQVGGISDTNQIFGLSETEPGSFLYYAPFDGILGLAYPSISSSGATPVFDNIWNQGLVSQDLFSVYLSADDQSGSVVIFGGIDSSYYTGSLNWVPVTVEGYWQITVDSITMNGEAIACAEGCQAIVDTGTSLLTGPTSPIANIQSDIGASENSDGDMVVSCSAISSLPDIVFTINGVQYPVPPSAYILQSEGSCISGFQGMNLPTESGELWILGDVFIRQYFTVFDRANNQVGLAPVA Image courtesy of RCSB Protein Data Bank (http://www.pdb.org) 3D structure of pepsin (PDB ID: 1PSN)

Protein Structure Prediction: A Brief Introduction • Experimental methods can provide us the precise arrangement of every atom of a protein. • X-ray crystallography and NMR spectroscopy • X-ray crystallography requires protein or complex to form a reasonably well ordered crystal, a feature that is not universally shared by proteins. • NMR spectroscopy needs proteins to be soluble and there is a limit to the size of protein that can be studied. • Both are time consuming techniques, we cannot hope to use them to solve the structures of all proteins in the universe in the near future. • Problem: How to relate the amino acid sequence of a protein to its 3D structure.

Background – Protein Primary Structure Image courtesy of National Human Genome Research Institute (NHGRI) • Protein primary structures are chains of amino acids • 20 amino acids {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y} • 1san:A • MTYTRYQTLELEKEFHFNRYLTRRRRIEIAHALSLTERQIKIWFQNRRMKWKKENKTKGEPG

Background - Protein Secondary Structure • Secondary structure is normally defined by hydrogen bonding patterns • Amino acids vary in ability to form various secondary structure elements • 8 types of secondary structure defined: {G, H, I, T, E, B, S, -} Image courtesy of Carl Fürstenberg Alpha helices are shown in color, and random coil in white, there are no beta sheets shown >1SAN:A:sequence MTYTRYQTLELEKEFHFNRYLTRRRRIEIAHALSLTERQIKIWFQNRRMKWKKENKTKGEPG >1SAN:A:secstr ----HHHHHHHHHHHHH-SS--HHHHHHHHHHHT--SHHHHHHHHHHHHTTTTTS-TT-S--

Protein Secondary Structure Prediction - Motivation Important research problem in bioinformatics / biochemistry Of high importance for design of drugs and novel enzymes Determination of protein structures by experimental methods is lagging far behind discovery of protein sequences Predicting protein tertiary structure is an even more challenging problem, but more tractable if using simpler secondary structure definitions; focus for current research (tertiary structure of a protein is its three-dimensional structure, as defined by the atomic coordinates)

Protein Secondary Structure Prediction Problem Description • Input (Baldi et al., 2000) • Amino acid sequence, A = a1, a2, … aN • Data for comparison, D = d1, d2, … dN • ai is an element of a set of 20 amino acids, {A,R,N…V} • di is an element of a set of secondary structures, {H,E,C}, which represents helix H, sheet E, and coil C. • Output • Prediction result: X = x1, x2, … xN • xi is an element of a set of secondary structures, {H,E,C} • 3-Class Prediction (Zhang and Zhang, 2003) • Multi-class prediction problem with 3 classes {H,E,C} in which one obtains a 3 x 3 confusion matrix Z = (zij)

Protein Secondary Structure Prediction Problem Description • 3 x 3 matrix (3 classes) Prediction H E C H Z11 Reality E Z22 C Z33 Zij: input predicted to be in class j while in reality belonging to class i Q total = 100 ∑i Zii / N (percentage)

Q3 Score • Q3 = Wαα + Wββ + Wcc Wαα= % of helices correctly predicted Wββ= % of sheets correctly predicted Wcc= % of coils correctly predicted • Example of Q3 calculation Protein: 10% helices, 10% sheets, 80% coils Prediction: 100% coils Q3 = 0% + 0% + 80% = 0.80

Q3 Score • Q3 = Wαα + Wββ + Wcc Wαα= % of helices correctly predicted Wββ= % of sheets correctly predicted Wcc= % of coils correctly predicted • Example of Q3 calculation, length 10 Amino acid (primary structure) sequence (A): MTYTRYQTLE (Secondary structure) data for comparison (D): HHHEEECCCC (Secondary structure) Prediction (M): HHEEECCCCC Q3 = 2/10 + 2/10 + 4/10 = 0.80

Related Work Not easy to evaluate performance of a protein secondary structure prediction method (e.g., different datasets used for training and testing) Rost and Sander (1993a) selected a list of 126 protein domains (RS126); now constitutes comparative standard Cuff and Barton (1999) described development of non-redundant test set of 396 protein domains (CB396) PHD, one of the first methods surpassing the 70% accuracy threshold, uses multiple sequence alignments as input to a neural network (Rost and Sander, 1993b)

Rost’s Neural Network (Rost and Sander 1993a) Image courtesy of Rost and Sander

Rost’s Neural Network (Rost and Sander 1993a) PHD, uses multiple sequence alignments as input to a neural network (Rost and Sander, 1993b)

BLAST-ERT-RICO Given input protein A (amino acid sequence, A = a1, a2, … aN), protein BLAST search (Web-based) performed using A as query sequence BLAST returns list of proteins with significant sequence alignments Suitable proteins chosen to form training dataset for A RT-RICO algorithm generates rules from the training dataset; rules used to predict the secondary structure for protein A Output is predicted secondary structure sequence X

BLAST-ERT-RICO Step 1Online BLAST and PDB Data Match BLAST search (Web crawler program) performed using A as query sequence Returns list of proteins with significant sequence alignments and corresponding BLAST scores; proteins with score ≤ 30 removed from list (test protein A also removed) Some of these proteins may have corresponding secondary structure records in PDB (Berman et al., 2000) Those records retrieved, become inputs to next step, data preparation If a protein from the list does not have known secondary structure record in PDB, will require data from offline preprocessing

BLAST-ERT-RICO Step 2Data Preparation (Math content, skip) For test protein A, there is set of protein primary structure sequence Bi and set of corresponding secondary structure sequence Ci where Bi ∈ {B1, B2, B3, B4, … By},Ci ∈ {C1, C2, C3, C4, … Cy} Primary structure sequence is Bi = bi,1, bi,2, bi,3, … bi, wi Corresponding secondary structure sequence is Ci = ci,1, ci,2, ci,3, … ci, wi B1 to By are not necessarily of same length, because they represent different proteins Each bi,j is an element of a set of 20 amino acids, {A,R,N…V} ci,j is an element of set of 8-state secondary structures, {H, G, I, E, B, T, S, -} (PDB); converted to an element of a set of 4-state secondary structures, {H, E, C, -}

BLAST-ERT-RICO Step 2Data Preparation • Protein primary structure n-residue segments and related secondary structure elements representation (n=9)

BLAST-RT-RICO Step 2Data Preparation (Math content, skip) If Bi is primary structure sequence, Ci is secondary structure sequence, and length of sequence(s) is wi, then each n-residue segment is of form: bi,j-floor(n/2), … bi,j-1, bi,j, bi,j+1, … bi,j+floor(n/2), ci,j; and j has value from ceiling(n/2) to (wi – floor(n/2)) This data preparation step performed for all Bi and Ci pairs, where i is from 1 to y These n-residue segments are main inputs to ERT-RICO rule generation algorithm

BLAST-ERT-RICO Step 3Rule Generation +,+,+,L,+,+,+,+,S,E,84.21,19,16,0.93676815 +,+,+,T,V,+,+,+,+,E,76.47,51,39,2.28337237 Q,A,+,+,+,+,+,+,G,E,100.00,7,7,0.40983607 …… (3,L)(8,S) -> (9,E), 84.21%, occurrences of ((3,L)(8,S)) = 19, occurrences of ((3,L)(8,S) -> (9,E)) = 16, Support % = 0.93676815 (3,T)(4,V) -> (9,E), 76.47%, occurrences of ((3,T)(4,V)) = 51, occurrences of ((3,T)(4,V) -> (9,E)) = 39, Support % = 2.28337237 (0,Q)(1,A)(8,G) -> (9, E), 100.00%, occurrences of ((0,Q)(1,A)(8,G)) = 7, occurrences of ((0,Q)(1,A)(8,G) -> (9, E)) = 7, Support % = 0.40983607 …… • Sample rules generated by ERT-RICO (n=9, m=1708)

BLAST-ERT-RICO Step 4 Prediction • Protein primary structure n-residue segments and related secondary structure elements prediction (n=9) • Here xi is an element of the set {H,E,C,-}. It is then converted to an element of the set {H, E, C}.

Exhaustive RT-RICO (ERT-RICO)Rule Generation Algorithm Most computationally intensive Previously, this research team presented a prediction method, BLAST-RT-RICO Some areas of the algorithm were in need of improvement; most importantly, the time complexity for the rule generation step needed to be reduced RT-RICO has a time complexity of O(m22n), where m is the number of all entities (the number of rows of n-residue segments), and n = |S| (the number of attributes). m2 dominates the time complexity because n is a small value (9 for this case)

Exhaustive RT-RICO (ERT-RICO)Rule Generation Algorithm Sometimes a very large m can cause running time issues When we ran datasets with different n value and t (threshold) value combinations to find the optimal segment length and threshold value, we faced the challenge of running several datasets in a reasonable period of time We developed the Exhaustive RT-RICO algorithm (ERT-RICO), which is a modified version of the old RT-RICO algorithm, and has an improved time complexity of O(mlog(m)2n). mlog(m) dominates the time complexity ERT-RICO has a space complexity of O((2n-1)(20n)(4)); in practice the space required is much smaller than that, due to the fact that different segments generate a large number of duplicate rules

ERT-RICO, Number of All Possible Rules

ERT-RICO Rule Generation Algorithm Space complexity could be an issue; n = 9, need (29-1)(209)(4), around 1.04653 × 1015 counters; we made data structure adjustments (different segments generate lots of duplicate rules) We know all possible values for each position in a segment (hence all possible rules) For an m×(n+1)matrix, each row (segment) is of length n+1 The first n elements are made up of letters from a set of 20 amino acid residues, {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}, and the last element is a letter from a set of four secondary structure states {H, E, C, -} Convert a rule to a numeric value, and convert the number back to the original rule

ERT-RICO, Converting A Rule to A Number

ERT-RICO Rule Generation Algorithm • The ERT-RICO rule generation algorithm finds the set C of all relaxed coverings of R in S (and the related rules), with threshold probability t (0 < t 1), where S is the set of all attributes, and R is the set of all decisions. • The input to ERT-RICO is in the form of an m×(n+1)matrix, where m is the number of all entities (the number of n-residue plus one secondary structure element segments), and n = |S|(the number of attributes). Algorithm 2: ERT-RICO begin for each segment (each row of matrix) for each 2n-1 rules that can be generated from segment generate unique hash key which is a numeric index if hash index does not exist in the hash table then add hash index and hash value (1) to the hash table (hash value = number of occurrences of each rule) else update hash value in the hash table (hash value = hash value + 1) end-if end-for end-for for each key in the hash table generate rule from key (in amino acid and secondary structure letters) calculate confidence and support using hash value and related keys if confidence > t then add rule, confidence, and support to output file end-if end-for end-algorithm.

Conclusion ERT-RICO has an improved time complexity of O(mlog(m)2n) This improvement over RT-RICO’s O(m22n), enabled the research team to run much larger test datasets with different choices of segment length and threshold value At the time of this paper’s submission, preliminary test results showed that BLAST-ERT-RICO achieved a Q3 score of 92.19% on the standard test dataset RS126 The adoption of the ERT-RICO algorithm also resolves the space complexity issues of our earlier implementations

Conclusion The test programs (rule-generation and prediction for RS126 set, at n=9) were written in PERL and executed on a computer with Intel Dual-Core processor, 32 GB of RAM, and Windows 7 OS The total program running time was approximately 21 days (this can be improved in the future) Even with the use of standard test datasets, it is still difficult to compare the accuracies of prediction methods RS126 set is a very representative test dataset; all test proteins can generate a number of significant alignments through BLAST

::: Thank You ::: Leong Lee, Department of Computer Science, Austin Peay State University, Clarksville, Tennessee, USA Jennifer L. Leopold, Department of Computer Science,Ronald L. Frank, Department of Biological Sciences,Missouri University of Science and Technology, Rolla, Missouri, USA

References Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D. J. (1997) ‘Gapped BLAST and PSI-BLAST: a new generation of protein database search programs’, Nucleic Acids Res., Vol. 25, No. 17, pp.3389-402. Andreeva, A., Howorth, D., Chandonia, J. M., Brenner, S. E., Hubbard, T. J., Chothia, C. and Murzin, A. G. (2008) ‘Data growth and its impact on the SCOP database: new developments’, Nucleic Acids Res, Vol. 36 (Database issue), D419-25. Baldi, P., Brunak, S., Chauvin, Y., Andersen, C. A. F. and Nielsen, H. (2000) ‘Assessing the accuracy of prediction algorithms for classification: an overview’, Bioinformatics, Vol. 16, No. 5, pp.412-24. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. and Bourne, P. E. (2000) ‘The Protein Data Bank’, Nucleic Acids Res., Vol. 28, No. 1, pp.235-42. BLAST (2009). BLAST: Basic Local Alignment Search Tool. Obtained through the Internet: http://blast.ncbi.nlm.nih.gov/, [accessed 30/11/2009] Bryson, K., McGuffin, L. J., Marsden, R. L., Ward, J. J., Sodhi, J. S. and Jones, D. T. (2005) ‘Protein structure prediction servers at University College London’, Nucleic Acids Res., Vol. 33(Web Server issue), W36-8. Cuff, J. A. and Barton, G. (1999) ‘Evaluation and improvement of multiple sequence methods for protein secondary structure prediction’, Proteins, Vol. 34, pp.508–519. Cuff, J. A. and Barton, G. (2000) ‘Application of multiple sequence alignment profiles to improve protein secondary structure prediction’, Proteins, Vol. 40, No. 3, pp.502-11. Fadime, U. Y., O¨zlem, Y. and Metin, T. (2008) ‘Prediction of secondary structures of proteinsnext term using a two-stage method’, Computers & Chemical Engineering, Vol. 32, No. 1-2, pp.78-88.

References Frishman, D. and Argos, P. (1997) ‘Seventy-five percent accuracy in protein secondary structure prediction’, Proteins, Vol. 27, pp.329–335. Grzymala-Busse, J. W. (1991) ‘Ch.3. Knowledge Acquisition’, Managing Uncertanity in Expert System, (pp.43-76), Boston: Kluwer Academic. Han, J. and Kamber, M. (2001) Data Mining: Concepts and Techniques, (pp.155-157) Morgan Kaufmann. Hu, H., Pan, Y., Harrison, R. and Tai, P. (2004) ‘Improved protein secondary structure prediction using support vector machine and a new encoding scheme and an advanced tertiary classifier’, IEEE Trans.NanoBiosci., Vol. 3, pp.265–271. Jones, D. T. (1999) ‘Protein secondary structure prediction based on position-specific scoring matrices’, J. Mol. Biol., Vol. 292, No. 2, pp.195-202. Jones, N. C. And Pevzner, P. A. (2004) An Introduction to Bioinformatics Algorithms, MIT Press. Kabsh, W. and Sander, C. (1983) ‘How good are predictions of protein secondary structure?’, FEBS Letters, Vol. 155, pp.179-182. Kim, H. and Park, H., (2003) ‘Protein secondary structure prediction based on an improved support vector machines approach’, Protein Eng., Vol. 16, pp.553-60. King, R. D. and Sternberg, M. J. E. (1996) ‘Identification and application of the concepts important for accurate and reliable protein secondary structure prediction’, Protein. Sci., Vol. 5, pp.2298–2310. Klepeis, J. L. and Floudas, C. A. (2002) ‘Ab initio prediction of helical segments in polypeptides’, J Comput. Chem, Vol. 23, No. 2, pp.245-66.

References Leopold, J. L., Maglia, A. M., Thakur, M., Patel, B. and Ercal, F. (2007) ‘Identifying Character Non-Independence in Phylogenetic Data Using Parallelized Rule Induction From Coverings’, Data Mining VIII: Data, Text, and Web Mining and Their Business Applications, WIT Transactions on Information and Communication Technologies, Vol. 38, pp. 45-54. Levitt, M. and Chothia, C. (1976) ‘Structural patterns in globular proteins’, Nature, Vol. 261, No. 5561, pp.552-8. Lee, L.,Leopold, J. L., Frank, R. L., and Maglia, A. M. (2009) ‘Protein Secondary Structure Prediction Using Rule Induction from Coverings,’ Proceedings of IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology 2009, Nashville, Tennessee, USA, pp. 79-86. Lee, L.,Kandoth, C., Leopold, J. L., and Frank, R. L. (2010a) ‘Protein Secondary Structure Prediction Using Parallelized Rule Induction from Coverings,’ International Journal of Medicine and Medical Sciences, Vol. 1, No. 2, pp. 99-105. Lee, L.,Leopold, J. L., Kandoth, C., and Frank, R. L. (2010b) ‘Protein secondary structure prediction using RT-RICO: a rule-based approach,’ The Open Bioinformatics Journal, Vol. 4, pp. 17-30.. Lee, L., Leopold, J. L., Edgett, P. G., and Frank, R. L. (2010c) ‘Rule Visualization of Protein Motif Sequence Data for Secondary Structure Prediction,’ Proceedings of ANNIE 2010 conference, St. Louis, Missouri, USA. Lee, L., Leopold, J. L., and Frank, R. L. (2011) ‘Protein secondary structure prediction using BLAST and Relaxed Threshold Rule Induction from Coverings,’ Proceedings of IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology 2011, Paris, France, accepted for publication. Lesk, A. M. (2008) Introduction to Bioinformatics, 3rd Edition, Oxford. Maglia, A. M., Leopold, J. L. and Ghatti, V. R. (2004) ‘Identifying Character Non-Independence in Phylogenetic Data Using Data Mining Techniques’, Proc. Second Asia-Pacific Bioinformatics Conference Dunedin, New Zealand.

References Murzin, A. G., Brenner, S. E., Hubbard, T. and Chothia, C. (1995) ‘SCOP: a structural classification of proteins database for the investigation of sequences and structures’, J Mol. Biol, Vol. 247, No. 4, pp.536-40. Nguyen, N. and Rajapakse, J. C. (2007) ‘Two stage support vector machines for protein secondary structure prediction’, Intl. J. Data Mining & Bioinformatics, Vol. 1, pp.248-269. Pawlak, Z. (1984) ‘Rough Classification’, Int. J. Man-Machine Studies, Vol. 20, pp.469-483. Rost, B. and Sander, C. (1993a) ‘Prediction of protein secondary structure at better than 70% accuracy’, J. Mol. Biol.,Vol. 232, pp.584-599. Rost, B. and Sander, C. (1993b) ‘Improved prediction of protein secondary structure by use of sequence profiles and neural networks’, Proc. Natl. Acad. Sci. USA, Vol. 90, pp.7558–7562. Rost, B. (2003) ‘Rising accuracy of protein secondary structure prediction’, In: Chasman, D. (Ed.), Protein structure determination, analysis, and modeling for drug discovery, (pp.207–249), New York: Dekker. Salamov, A. A. and Solovyev, V. V. (1995) ‘Prediction of protein secondary structure by combining nearest-neighbor algorithms and multiple sequence alignments’, J Mol. Biol., Vol. 247, pp.11–15. Tramontano, A. (2006) Protein Structure Prediction, Wiley-vch. Wong, P. C., Whitney, P. and Thomas, J. (1999) ‘Visualizing Association Rules for Text Mining’ Proceedings of the 1999 IEEE Symposium on Information Visualization, pp. 120-123, 152. Zhang, C. T. and Zhang, R. (2003) ‘Q9, a content-balancing accuracy index to evaluate algorithms of protein secondary structure prediction’, Int J Biochem Cell Biol., Vol. 35, No. 8, pp.1256-62.

Exhaustive RT-RICO Algorithm For Mining Association Rules In Protein Secondary Structure

Exhaustive RT-RICO Algorithm For Mining Association Rules In Protein Secondary Structure

Presentation Transcript

Mining Association Rules

Protein Secondary Structure

Mining Association Rules

Protein Secondary Structure Prediction

DATA MINING - ASSOCIATION RULES-

Mining Association Rules

Protein Secondary Structure Prediction

Protein Secondary Structure Prediction

Protein Secondary Structure Prediction Using BLAST and Exhaustive RT-RICO (Relaxed Threshold Rule Induction from Coverin

PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce

Association Rules Mining

Incremental Mining Association Rules

An Efficient Algorithm for Incremental Mining of Association Rules

Secondary Protein Structure

A Classical Apriori Algorithm for Mining Association Rules

A Parameterised Algorithm for Mining Association Rules

Protein Secondary Structure

Protein Secondary Structure Prediction

Protein secondary structure Prediction

Algorithms for Mining Association Rules

Hash-Based Algorithm for Mining Association Rules

Mining Association Rules