Accurate Identification of disordered protein residues using deep neural network

Accurate Identification of disordered protein residues using deep neural network The 4th Annual Conference on Computational Biology and Bioinformatics Speaker: Sumaiya Iqbal Author: Sumaiya Iqbal, Denson Smith, MdTamjidulHoque Computer Science, University of New Orleans, New Orleans, LA 70148 Email: {siqbal1, dsmith8, thoque}@uno.edu

Introduction What is protein disorder? Significance of disordered protein identification Why do we need computational tool for disorder prediction? Our contribution

What is Protein Disorder? • A fully functional protein is usually the one that is appropriately twisted and folded into a specific three dimensional conformation. • However, proteins can misfold, and can be unable to adopt well-defined, stable three dimensional (3D) structures in an isolated state and under different non native environments. • These proteins or partial regions of proteins are called intrinsically disordered proteins (IDPs) or disordered regions in proteins (IDRs)[1, 2] • The coordinates of their backbone atoms have no specific equilibriumstates, and thus adopt dynamic structural ensembles.

Significance of Disordered Protein Identification • IDPs or, IDRs do not follow the well-known paradigm: • Disordered proteins have comparatively higher total surface for interaction with partners than those of the ordered (or, structured) proteins. • IDPs play essential biological functions via protein-protein, protein-nucleic acid and protein-ligand interactions: • Cell cycle control and cellular signal transduction • Transcriptional and translational regulation • Molecular assembly and protein modification etc.

Significance of Disordered Protein Identification (contd.) • Binding regions within IDRs and IDPs become biologically active through disorder to structure transitions, known as induced folding [3 – 9]. • Such binding regions is associated with critical human diseases, including cancer, cardiovascular disease, amyloidoses, neurodegenerative diseases, and diabetes. • About 80% of the human proteinsfound in the available disorder databases, such as DisProt and IDEAL, contain at least one amyloidogenic region that are directly linked with diseases such as Parkinson’s diseases, Alzheimer’s diseases or, type II diabetes. • Thus, identifying protein disorder is essential and assists in effective drug development.

Why Do We Need Computational Tool for Disorder Prediction? • IDPs are abundant in nature and increasing fast in numbers. • The experimental annotation (X-ray crystallography, NMR spectroscopy, ultra violet circular dichroism) of disordered residues is progressing slowly • Costly both in terms of time and money • The computational tools for disordered residue prediction using characteristics features and machine learning algorithms play an alternative and useful role in understanding the functions of protein disorder. • The well-known CASP competitions are biennially assessing their performances [10 – 12].

Our Contribution • We propose a new disorder prediction tool, DisPredict3, from protein sequence alone that involves • Deep neural network (DNN) • A new structural stability measuring feature (PSEE) • An effective feature selection procedure (GFFS) • We investigate the interconnection of disorder and critical binding regionss. • We explore the structural characteristics of disordered regions

Materials & method Dataset preparation Feature set preparation Predictor model development

Dataset Preparation Database of Protein Disorder[14] Protein Data bank [13] (v 5.0) (v 5.1 – 6.02) Training Set Test Set Filter  Remove sequences having 25% similarity using Blastclust[21] Purify  Remove sequences with abnormal amino acids or, unknown annotation SL477 (Short-Long, 477 chains) DD73 (Disorder Dataset, 73 chains)

Feature Set Preparation Position Specific Estimated Energy (1) State of stability of the residue in the 3D conformation computed using contact energy and relative exposure expressed as energy Terminal (1) Flexible terminal region residue indicator Accessible Surface Area (1+1) Solvent Accessible Area information, predicted using SPINE X [23] and REGAd3p[20] Monogram Bigram (1, 20) Conserved amino acid subsequence information computed using PSSM Secondary Structure (3+3) Three different local secondary structure (helix, beta and coil) like tendency, predicted using SPINE X[23] and MetaSSPred[19] Backbone Angle Fluctuations (1, 1) Backbone torsion angle flexibility information, predicted using Davar[22] Amino acid (1) Characterizes specific amino acid type Physical Properties (7) Steric parameter, Polarizability, Volume, Hydrophobicity, Isoelectric point, Helix and Sheet probability Position Specific Scoring Matrix (20) Sequence alignment based evolutionary information computed using PSI-BLAST[21]

Feature Set Preparation (Contd.) 61 features candidate feature set Step 2: Feature selection using genetic forest feature selector GFFS Genetic Algorithm Extra Tree Classifier feature importance Step 3: Include neighboring residue information applying a sliding window of size 21 37 features … … 21 × 37 features

Predictor Model Development Primary Protein Sequence AVEGK… 3 hidden layers 150 hidden nodes Exponential activation function Learning rate 0.1 DisPredict3 A D 0.961054 V D 0.866233 E O 0.477779 G O 0.372149 K O 0.277364 …

RESULTS Performance measures State-of-the-art methods to compare Binary classification performance and comparisons Probability prediction performance and comparisons

Performance measures Table 1: Name and definition of performance measuring parameters

State-of-the-art Methods to Compare • MFDp [15] • Uses support vector machine with linear kernel and combined outputs of 3 predictors • SPINE-D [16] • Uses artificial neural network (2 hidden layers and 51 hidden nodes) • DisPredict [17] • Uses optimized support vector machine with radial basis function as kernel • DisPredict2 [18] • Uses optimized support vector machine with radial basis function as kernel • Uses optimized threshold and new feature selection method

Binary Classification Performance and Comparisons The most useful measure for evaluation of binary predictor Table 2: Binary disorder classification performance comparison on DD73 dataset.

Probability Prediction Performance and Comparisons Figure 1: Binary disorder classification performance comparison on DD73 dataset.

Discussions Dynamic structural characteristics of protein disorder Case studies Future works and conclusions

Structural Characteristics of Disordered Region (PSEE vs. Exposure) Energetically unstable area, preferred by disordered region. Our PSEE feature is useful. We see that disorder can show order-like characteristics ! 1st Quadrant (Disorder Dominated) High relative exposure Less negative energy Energetically stable area, preferred by ordered region. Our PSEE feature is useful. 3rd Quadrant (Order Dominated) Low relative exposure High negative energy Figure 2. Correlation between PSEE and relative exposure of ordered regions (blue circle) and disordered regions (red diamond). The vertical dashed line separates the average PSEE of all ordered and disordered region and the horizontal dash-dotted line separates the ordered and disordered regions with more and less than 25% exposure.

Structural Characteristics of Disordered Region (PSEE vs. Coil tendency) Energetically unstable area, preferred by disordered region. Our PSEE feature is useful. We see that disorder can show order-like characteristics ! 1st Quadrant (Disorder Dominated) High coil probability Less negative energy Energetically stable area, preferred by ordered region. Our PSEE feature is useful. 3rd Quadrant (Order Dominated) Low coil probability High negative energy Figure 3. Correlation between PSEE and coil probability of ordered regions (blue circle) and disordered regions (red diamond). The vertical dashed line separates the average PSEE of all ordered and disordered region and the horizontal dash-dotted line separates the ordered and disordered regions with more and less than 50% coil tendency.

Why Overlap?Noisy Annotation Human host defense cathelicidin LL-37 and its smallest antimicrobial peptide PDB 2KO6A DisProt DP0004_C002 Figure 4. Identical protein chain from PDB and DisProt, showing completely structured (ordered) state and unstructured (disordered) state from PDB and DisProt, respectively.

Why Overlap?Case Study 1: disorder  order Human disordered proteins contain amyloidogenic regions that undergo disorder (without amyloid formation) to order (with amyloid formation) transitions. Beta – 2 – macroglobulin (Homo sapiens) Figure 5: Disorder probability plot for -2-microglobulin. Location: 21 – 119 Disorder probability Mean = 0.324 Standard deviation = 0.181

Why Overlap? (contd.)Case Study 2: disorder  order Human disordered proteins contain amyloidogenic regions that undergo disorder (without amyloid formation) to order (with amyloid formation) transitions. Lysozyme – C (Homo sapiens) Figure 6: Disorder probability plot for Lysozyme C. Location: 19 – 148 Disorder probability Mean = 0.352 Standard deviation = 0.205

Conclusions and Future Works • DisPredict3 gives competitive output in predicting disordered protein residues • The parameters of DNN can be tuned further to have better performance • The new feature (PSEE) and the feature selection steps of DisPredict3 gives two-fold benefits: • The feature space dimension becomes lower with the selected features • Relevant features characterizes the disorder better • It is possible to identify the critical binding regions within disordered regions that undergo disorder to structure transitions using disorder predictor • Thus it will be interesting to further investigate the usability of the tool for binding region or, induced folding region prediction.

Acknowledgement • Bioinformatics and Machine Learning Lab • http://cs.uno.edu/~tamjid/ • This work is supported by the Louisiana Board of Regents (BoR) through the Board of Regents Support Fund, LEQSF (2013-16)-RD-A-19. We gratefully acknowledge BoR.

THANK YOU

Accurate Identification of disordered protein residues using deep neural network

Accurate Identification of disordered protein residues using deep neural network

Presentation Transcript

Protein Identification

Using Neural Networks for remote OS Identification

Estimation of Oil Saturation Using Neural Network

IDENTIFICATION OF CARDIAC ARRHYTHMIA USING ARTIFICIAL NEURAL NETWORK: A PRACTICAL APPROACH

Identification of protein-protein binding motifs

Neural Network Training Using MATLAB

MUSICAL SCALE IDENTIFICATION USING NEURAL NETWORKS

Identification of protein homology using domain architecture

Identification of Protein Domains

Using Neural Networks for remote OS Identification

Spam Image Identification Using an Artificial Neural Network

Using Neural Networks for remote OS Identification

Scheduling problems using Neural network

Deep Neural Network Language Models

Face Detection Using Neural Network

Protein Identification Using Tandem Mass Spectrometry

Accurate Identification

Automatic Detection of ADHD subjects using Deep Convolutional Neural Network

Accurate, Scalable In-Network Identification of P2P Traffic Using Application Signatures

protein identification

Deep Learning and Neural Network

Deep Neural Network and Its Elements