1 / 31

Neural Networks in Bioinformatics

Neural Networks in Bioinformatics. I-Fang Chung ifchung@ym.edu.tw Institute of Bioinformatics, YM 4-27-2006. Experience and Education. 1989-2000 Electrical and Control Engineering in NCTU 2000-2003 (Postdoc) ECE: Laboratory of Intelligent Control

velvet
Download Presentation

Neural Networks in Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Neural Networks in Bioinformatics I-Fang Chung ifchung@ym.edu.tw Institute of Bioinformatics, YM 4-27-2006

  2. Experience and Education • 1989-2000Electrical and Control Engineering in NCTU • 2000-2003 (Postdoc) ECE: Laboratory of Intelligent Control • 2003-2004 (Postdoc) Laboratory of DNA Information Analysis of Human Genome Center, Institute of Medical Science, Tokyo University • 2004-nowInstitute of Bioinformatics, Yang-Ming

  3. Outline • Motivation • To solve one problem in bioinformatics • Identification of RNA-Interacting Residues in Protein • Current projects

  4. Neural Networks • Neural networks are constructed to resemble the behavior of human brains (neurons) • Characterizes the ability to learn, recall, and generalize fromtraining patterns x1 Weights wi1 x2 wi2 yi neti a(.) Output path xm wim

  5. y w v x x x n 1 2 Neural Networks (cont’d) • Good at tasks such as pattern matching, classification, function approximation, and data clustering • Good at tasks in bioinformatics such as coding region recognition, protein structure prediction, gene clustering

  6. Basic Principles of Discrimination • Each object associated with a class label (or response) Y  {1, 2, …, K} and a feature vector (vector of predictor variables) of G measurements: X = (X1, …, XG) • Aim:predict Y from X. Predefined Class {1,2,…K} K 1 2 Objects Y = Class Label = 2 X = Feature vector {colour, shape} Classification rule ? X = {red, square} Y = ?

  7. Example Learning set Bad prognosis recurrence < 5yrs Good Prognosis recurrence > 5yrs ? Good Prognosis Matesis > 5 Predefine classes Clinical outcome Objects Array Feature vectors Gene expression new array Reference L van’t Veer et al (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, Jan. Classification rule

  8. Design Issues Human brain Domain knowledge, e.g. biology (molecule, chemistry) Problem definition (desired input/output mapping) Output encoding Neural Network Applications Molecular Structure Sequence discrimination Feature detection Classification Structure prediction DNA:ATGCGCTC Protein:MASSTFYI Pre-Processing : Post-Processing : : Training Data Sets Testing Data Sets System Evaluation Network Architecture Learning Algorithm Parameter adjustment Feature representation (knowledge extraction) Input encoding

  9. Prediction of Protein 2ndStructures Adopted from Qian and Sejnowski, 1988

  10. y1 y2 y3 w x1 x2 x3 Sliding Window Chain_1 2-D info Chain_2 Chain_3 … Amino Acids • Sliding window concept • Considering a piece of strings as inputs • Only looking at central position in a piece of strings to detect what kind of 2-D info. happens

  11. Binary Bit Encoding Method 000001000000000000000 • Input encoding for each input pattern • Unary encoding scheme for protein sequence • 21 binary bits for 20 kinds of amino acid type (1 bit for overlapped terminal) • Input layer with multiple Input patterns • A window size ‘w’ of consecutive residues been considered. • ‘21* w’units for sequence only • Output layer with 3 units • To describewhat kind of 2-D info. Happens (‘1, 0, 0’ for helix, ‘0, 1, 0’ for sheet, ‘0, 0, 1’ for coil) • One hidden layer for non-linear 2-class pattern classification w

  12. More Complex NN Structure: PHD Multiple sequence Alignment, it is a way to compare multiple sequence, the result is called alignment profile. breakthrough:use evolutionary information in MSA instead of single sequence Adopted from Rost and Sander, 1993

  13. Outline • Motivation • To solve one problem in bioinformatics • Identification of RNA-Interacting Residues in Protein • Current projects

  14. Identification of RNA-Interacting Residues in Protein • Task • Predicting putative RNA-interacting sites within a protein chain • Given a protein sequence Finding the RNA-binding positions (residues) • Method • Using feedforward neural network based on sequence profiles • Analyzing and qualifying a large set of the network weights trained on sequence profiles

  15. Data Generation • Source: Protein Data Bank (PDB) • Collect Protein-RNA complexes, resolved by X-ray with ≤ 3.0Å • Remove redundant protein structures with sequence identity over 70% • 86 non-homologous protein chains (21990 residues) • Residues in interaction sites • The closest distance between atoms of the protein and the partner RNA is less than 7Å. • hydrogen bonds, stacking, electrostatic, hydrophobic, and van der Waals, interactions considered • Residues in interaction sites: 21.7% (4782)

  16. y1 y2 w x1 x2 x3 Classifier Chain_1 interaction site or not Chain_2 Chain_3 Amino acids … 2D info. Appearance probability

  17. PSSM • Position Specific Iterative BLAST (PSI BLAST) • A strong measure of residue conservation in a given location • Position specific scoring matrix (PSSM) • A20-dimensional vector representing probabilities of conservation against mutations to 20 different amino acids including itself • The position of the important function of protein will be kept in the course of evolving

  18. Experimental Results (cont’d) • Agreement with structural studies of protein-RNA interactions • Arg, Lys, Ser, Thr, Asp and Glu prefer to be in hydrogen bonding • Phe and Ser are frequently located in van der Waals interacting and stacking interacting • Some conflicting situations • Ala, Leu and Val known to less preferred types in interactions • Asn typically though of one of the most preferred amino acid types in hydrogen bonding Adopted from Jeong and Miyano, 2006

  19. Saliency Factor • Objective: Define a matrix to represent the importance of the presence of specific residues at specific positions • Step1: Normalization of weight xijfor each input unit aij M : the window size, 1 ≤ i ≤ M N : the # of distinct residue symbols, 1 ≤ j ≤ N H : the # of hidden units, 1 ≤ k ≤ H Adopted from Jeong and Miyano, 2006

  20. Saliency Factor (cont’d) • Weight conservation : the amount of weight information represent at each position i in the given window, defined as the difference between the maximum entropy and the entropy of the observed weight distribution • Saliency factor of residue j at windowposition i • New input M : the window size, 1 ≤ i ≤ M N : the # of distinct residue symbols, 1 ≤ j ≤ N H : the # of hidden units, 1 ≤ k ≤ H Adopted from Jeong and Miyano, 2006

  21. Notations • Four kinds of measuring parameters are defined: • True Positive (TP):the number of accurately predicted interaction sites • True Negative (TN):the number of accurately predicted not-interaction sites • False Positive (FP):the number of inaccurately predicted interaction sites • False Negative (FN):the number of inaccurately predicted not-interaction sites • Examples: (1: positive, 0: negative)0101000010011001111000  Observed 1100001110001111110011  Predicted TN FN FP TP

  22. Measuring Performance • Total accuracy: • Percentage of all correctly predicted interaction and not-interaction sites • Accuracy (Specificity): • To measure the probability that how many of the predicted interaction sites are correct • Coverage (Sensitivity): • To measure the probability that how many of the correct interaction sites are predicted • Mattews correlation coefficient (MCC): • Takes into account both under- and over-predictions • ranges between 1 (perfect prediction) and -1 (completely wrong prediction)

  23. Our method ATGpr Receiver Operating Characteristic (ROC) Curve

  24. Experimental Results Adopted from Jeong and Miyano, 2006

  25. Experimental Results (cont’d) Adopted from Jeong and Miyano, 2006

  26. Experimental Results (cont’d) underpredicted interaction overpredicted not-interaction Adopted from Jeong and Miyano, 2006

  27. References • E. Jeong, I F. Chung, and S. Miyano, “Prediction of Residues in Protein-RNA Interaction Sites by Neural Networks,” Proc. of the 14th International Conference on Genome Informatics, pp. 506-507, 2003. • E. Jeong, I F. Chung, and S. Miyano, “A Neural Network Method for Identification of RNA-Interacting Residues in Protein,” Proc. of the 4th International Workshop on Bioinformatics and Systems Biology, pp. 105-116, 2004. • E. Jeong and S. Miyano, “A weighted profile based method for protein-RNA interacting residue prediction,” Trans. on Comput. Syst. Biol., IV, LNBI 3939, pp. 123 - 139, 2006.

  28. Current Projects • To discover the relationship between protein sequence and protein structure • To identification of RNA-interacting residues in protein • To perform protein metal binding residue prediction • To predict the phosphorylation sites • Microarray data analysis • Significant gene selection, clustering, classification • Prediction of the polymorphic short tandem repeats

  29. Mini-Workshop: Knowledge Discovery Techniques for Bioinformatics Dr. Limsoon Wong

  30. Hierarchy of Protein Structure 2nd structure prediction 3rd structure prediction

  31. Protein Secondary Structures Anti-parallel beta sheet Alpha helix loop Parallel beta sheet

More Related