1 / 26

Accurate Identification of disordered protein residues using deep neural network

Accurate Identification of disordered protein residues using deep neural network. The 4 th Annual Conference on Computational Biology and Bioinformatics Speaker: Sumaiya Iqbal Author: Sumaiya Iqbal, Denson Smith, Md Tamjidul Hoque

kelvin
Download Presentation

Accurate Identification of disordered protein residues using deep neural network

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Accurate Identification of disordered protein residues using deep neural network The 4th Annual Conference on Computational Biology and Bioinformatics Speaker: Sumaiya Iqbal Author: Sumaiya Iqbal, Denson Smith, MdTamjidulHoque Computer Science, University of New Orleans, New Orleans, LA 70148 Email: {siqbal1, dsmith8, thoque}@uno.edu

  2. Introduction What is protein disorder? Significance of disordered protein identification Why do we need computational tool for disorder prediction? Our contribution

  3. What is Protein Disorder? • A fully functional protein is usually the one that is appropriately twisted and folded into a specific three dimensional conformation. • However, proteins can misfold, and can be unable to adopt well-defined, stable three dimensional (3D) structures in an isolated state and under different non native environments. • These proteins or partial regions of proteins are called intrinsically disordered proteins (IDPs) or disordered regions in proteins (IDRs)[1, 2] • The coordinates of their backbone atoms have no specific equilibriumstates, and thus adopt dynamic structural ensembles.

  4. Significance of Disordered Protein Identification • IDPs or, IDRs do not follow the well-known paradigm: • Disordered proteins have comparatively higher total surface for interaction with partners than those of the ordered (or, structured) proteins. • IDPs play essential biological functions via protein-protein, protein-nucleic acid and protein-ligand interactions: • Cell cycle control and cellular signal transduction • Transcriptional and translational regulation • Molecular assembly and protein modification etc.

  5. Significance of Disordered Protein Identification (contd.) • Binding regions within IDRs and IDPs become biologically active through disorder to structure transitions, known as induced folding [3 – 9]. • Such binding regions is associated with critical human diseases, including cancer, cardiovascular disease, amyloidoses, neurodegenerative diseases, and diabetes. • About 80% of the human proteinsfound in the available disorder databases, such as DisProt and IDEAL, contain at least one amyloidogenic region that are directly linked with diseases such as Parkinson’s diseases, Alzheimer’s diseases or, type II diabetes. • Thus, identifying protein disorder is essential and assists in effective drug development.

  6. Why Do We Need Computational Tool for Disorder Prediction? • IDPs are abundant in nature and increasing fast in numbers. • The experimental annotation (X-ray crystallography, NMR spectroscopy, ultra violet circular dichroism) of disordered residues is progressing slowly • Costly both in terms of time and money • The computational tools for disordered residue prediction using characteristics features and machine learning algorithms play an alternative and useful role in understanding the functions of protein disorder. • The well-known CASP competitions are biennially assessing their performances [10 – 12].

  7. Our Contribution • We propose a new disorder prediction tool, DisPredict3, from protein sequence alone that involves • Deep neural network (DNN) • A new structural stability measuring feature (PSEE) • An effective feature selection procedure (GFFS) • We investigate the interconnection of disorder and critical binding regionss. • We explore the structural characteristics of disordered regions

  8. Materials & method Dataset preparation Feature set preparation Predictor model development

  9. Dataset Preparation Database of Protein Disorder[14] Protein Data bank [13] (v 5.0) (v 5.1 – 6.02) Training Set Test Set Filter  Remove sequences having 25% similarity using Blastclust[21] Purify  Remove sequences with abnormal amino acids or, unknown annotation SL477 (Short-Long, 477 chains) DD73 (Disorder Dataset, 73 chains)

  10. Feature Set Preparation Position Specific Estimated Energy (1) State of stability of the residue in the 3D conformation computed using contact energy and relative exposure expressed as energy Terminal (1) Flexible terminal region residue indicator Accessible Surface Area (1+1) Solvent Accessible Area information, predicted using SPINE X [23] and REGAd3p[20] Monogram Bigram (1, 20) Conserved amino acid subsequence information computed using PSSM Secondary Structure (3+3) Three different local secondary structure (helix, beta and coil) like tendency, predicted using SPINE X[23] and MetaSSPred[19] Backbone Angle Fluctuations (1, 1) Backbone torsion angle flexibility information, predicted using Davar[22] Amino acid (1) Characterizes specific amino acid type Physical Properties (7) Steric parameter, Polarizability, Volume, Hydrophobicity, Isoelectric point, Helix and Sheet probability Position Specific Scoring Matrix (20) Sequence alignment based evolutionary information computed using PSI-BLAST[21]

  11. Feature Set Preparation (Contd.) 61 features candidate feature set Step 2: Feature selection using genetic forest feature selector GFFS Genetic Algorithm Extra Tree Classifier feature importance Step 3: Include neighboring residue information applying a sliding window of size 21 37 features … … 21 × 37 features

  12. Predictor Model Development Primary Protein Sequence AVEGK… 3 hidden layers 150 hidden nodes Exponential activation function Learning rate 0.1 DisPredict3 A D 0.961054 V D 0.866233 E O 0.477779 G O 0.372149 K O 0.277364 …

  13. RESULTS Performance measures State-of-the-art methods to compare Binary classification performance and comparisons Probability prediction performance and comparisons

  14. Performance measures Table 1: Name and definition of performance measuring parameters

  15. State-of-the-art Methods to Compare • MFDp [15] • Uses support vector machine with linear kernel and combined outputs of 3 predictors • SPINE-D [16] • Uses artificial neural network (2 hidden layers and 51 hidden nodes) • DisPredict [17] • Uses optimized support vector machine with radial basis function as kernel • DisPredict2 [18] • Uses optimized support vector machine with radial basis function as kernel • Uses optimized threshold and new feature selection method

  16. Binary Classification Performance and Comparisons The most useful measure for evaluation of binary predictor Table 2: Binary disorder classification performance comparison on DD73 dataset.

  17. Probability Prediction Performance and Comparisons Figure 1: Binary disorder classification performance comparison on DD73 dataset.

  18. Discussions Dynamic structural characteristics of protein disorder Case studies Future works and conclusions

  19. Structural Characteristics of Disordered Region (PSEE vs. Exposure) Energetically unstable area, preferred by disordered region. Our PSEE feature is useful. We see that disorder can show order-like characteristics ! 1st Quadrant (Disorder Dominated) High relative exposure Less negative energy Energetically stable area, preferred by ordered region. Our PSEE feature is useful. 3rd Quadrant (Order Dominated) Low relative exposure High negative energy Figure 2. Correlation between PSEE and relative exposure of ordered regions (blue circle) and disordered regions (red diamond). The vertical dashed line separates the average PSEE of all ordered and disordered region and the horizontal dash-dotted line separates the ordered and disordered regions with more and less than 25% exposure.

  20. Structural Characteristics of Disordered Region (PSEE vs. Coil tendency) Energetically unstable area, preferred by disordered region. Our PSEE feature is useful. We see that disorder can show order-like characteristics ! 1st Quadrant (Disorder Dominated) High coil probability Less negative energy Energetically stable area, preferred by ordered region. Our PSEE feature is useful. 3rd Quadrant (Order Dominated) Low coil probability High negative energy Figure 3. Correlation between PSEE and coil probability of ordered regions (blue circle) and disordered regions (red diamond). The vertical dashed line separates the average PSEE of all ordered and disordered region and the horizontal dash-dotted line separates the ordered and disordered regions with more and less than 50% coil tendency.

  21. Why Overlap?Noisy Annotation Human host defense cathelicidin LL-37 and its smallest antimicrobial peptide PDB 2KO6A DisProt DP0004_C002 Figure 4. Identical protein chain from PDB and DisProt, showing completely structured (ordered) state and unstructured (disordered) state from PDB and DisProt, respectively.

  22. Why Overlap?Case Study 1: disorder  order Human disordered proteins contain amyloidogenic regions that undergo disorder (without amyloid formation) to order (with amyloid formation) transitions. Beta – 2 – macroglobulin (Homo sapiens) Figure 5: Disorder probability plot for -2-microglobulin. Location: 21 – 119 Disorder probability Mean = 0.324 Standard deviation = 0.181

  23. Why Overlap? (contd.)Case Study 2: disorder  order Human disordered proteins contain amyloidogenic regions that undergo disorder (without amyloid formation) to order (with amyloid formation) transitions. Lysozyme – C (Homo sapiens) Figure 6: Disorder probability plot for Lysozyme C. Location: 19 – 148 Disorder probability Mean = 0.352 Standard deviation = 0.205

  24. Conclusions and Future Works • DisPredict3 gives competitive output in predicting disordered protein residues • The parameters of DNN can be tuned further to have better performance • The new feature (PSEE) and the feature selection steps of DisPredict3 gives two-fold benefits: • The feature space dimension becomes lower with the selected features • Relevant features characterizes the disorder better • It is possible to identify the critical binding regions within disordered regions that undergo disorder to structure transitions using disorder predictor • Thus it will be interesting to further investigate the usability of the tool for binding region or, induced folding region prediction.

  25. Acknowledgement • Bioinformatics and Machine Learning Lab • http://cs.uno.edu/~tamjid/ • This work is supported by the Louisiana Board of Regents (BoR) through the Board of Regents Support Fund, LEQSF (2013-16)-RD-A-19. We gratefully acknowledge BoR.

  26. THANK YOU

More Related