Protein Domain Boundary Prediction

Protein Domain Boundary Prediction Which model is best? Paul Yoo Advanced Networks Research Group School of Information Technologies The University of Sydney

What is Protein Domain? Domains can be seen as distinct functional and/or structural units of a protein. Independent folding unit of a polypeptide chain also carries specific function. Domains are often identified as recurring (sequence or structure) units, which may exist in various contexts. 1IGR: First three domain protein / 1998 Domain 1: L domain (Magenta) Domain 2: Growth factor receptor domain (Brown) Domain 3: L domain (Green) PDB: 1IGR

Introduction • Domains provide one of the most valuable information for the prediction of protein structure, function, evolution and design. • Since Anfinsen’s (1973) seminal work, many have proposed various structure prediction models from amino acid sequence only. • This study, • - Provides an overview of the modeling methods for protein domain boundary prediction. • - Proposes an new semi-parametric model that can show superior performance to the existing models.

Motivation • Accurate prediction of domain boundaries forms a basis of many types of protein research. • - New proteins such as chimeric proteins can be created as they are composed of multifunctional domains (Suyama & Ohara, 2003). • - The search method for templates used in comparative modeling can also be optimized by the delineation of domain boundaries (Contreras-Moreira & Bates, 2002). • - As for threading methods, the domain boundary prediction can improve its performance by enhancing the signal-to-noise ratio (Wheelan et al., 2000). • - Accurate identification of domain boundaries for homologous domains plays a key role for reliable multiple sequence alignment (Gracy & Argos, 1998).

Problem Statement • Limitations of experimental tools • X-ray crystallography • Nuclear Magnetic Resonance (NMR) • Costly, time consuming, laborious and inefficient • 3D coordinates to 1D amino acids • Assumption: a domain has relatively more contacts within itself than with residues in the remainder of the structure. • High dimensionality of protein data • Bias and variance dilemma of ANN models • Long-range dependencies • Multi-domain benchmark dataset

Outline • High Dimensionality • Bias and Variance Dilemma • Introduction to Improved General Regression Network • Experiment 1: ML Models on Benchmark_2 Dataset • Experiment 2: Domain Predictors on CASP7 Dataset • Long-Range Information • Future Work

High Dimensionality • High dimensionality of protein sequential data • - 10 amino acids represents a search space of possibilities and requires a network with 200 inputs. • Learning in high dimensional space • - Large network training requires large dataset of known examples. • - Computational complexity • - Overfitting problem • Performance of ANNs is dependent upon their input data: • - Better generalization and faster training <- fewer weights to be adjusted by fewer inputs. • - Beyond a certain point, adding new features can actually lead to a reduction in the performance of the classification system.

Bias and Variance Dilemma • Bias: measures the extent to which the estimation function differs from the true function. • Variance: measures the sensitivity of the estimation function to the data sample. • Parametric models tend to have high bias, but low variance (underfitting). • Non-parametric models tend to have low bias, but high variance (overfitting).

Bias and Variance Dilemma cont. • Bias and Variance Tradeoff • Desirable to have both low bias and low variance, but they are incompatible. • Reduce the variance at the cost of increased bias. • Need to find a good tradeoff between the bias and variance (between the states of underfitting and overfitting) • Semiparametric model has theoretically proven to achieve a good tradeoff.

A New Semi-Parametric Model • Semi-Parametric modeling • - SP models take assumptions that are stronger than those of nonparametric models but less restrictive than those of parametric model. • - They avoid most serious practical disadvantages of nonparametric methods but at price of increased risk of specification error. • Improved General Regression Network • - Find the optimal trade-off between parametric and non-parametric models. • - Low learning bias and low generalization variance. • New decision function for IGRN is:

A New Semi-Parametric Model cont. • Reduced Computation • - In GRNN equation, each and every training data pair is incorporated into its architecture. • Each local region of the input space is represented by a centre vector • Semi-Parametric Approximation • - GRNN uses a spherical kernel function as a radial basis function (Non-Parametric). • IGRN more dependent on the Gaussian radial basis function (Semi-Parametric) • Applicability of Boosting Method • High specification error of SP model • Boosting combines base learners to find better fit for the training set by maintaining a set of weights over training samples. • No parameter tuning

Experiment 1: ML Models on Benchmark_2 • New multi-domain benchmark dataset (Benchmark_2) • - Contains proteins of known structure for which three methods (CATH (Pearl et al., 2000), SCOP (Andreeva et al., 2004) and literature) agree on the assignment of the number of domains. • - Comprises 315 polypeptide chains • - Non-redundant: each combination of topologies occurs only once • - All sequences are taken from Protein Data Bank (PDB) • Pre-processing • - Position Specific Scoring Matrix (PSSM) using PSI-BLAST • - Secondary Structure (SSpro, Pollastri et al., 2002) • - Solvent Accessibility (ACCpro, Pollastri et al., 2002) • - Domain Linker Index (DomCut, Suyama & Ohara, 2003)

Experiment 1 cont. • Comparison of Prediction Accuracy and Generalization Variance on Different Window Sizes. • IGRN shows higher learning bias than its original model (GRNN) but low generalization variance. • IGRN II achieves both low learning bias and low generalization variance.

Experiment 2: Domain Predictors on CASP7 • CASP7 Benchmark Dataset • The most widely known benchmark dataset • Comprises 94 polypeptide chains • Different Structural Information • DOMpro: PSSM, Secondary Structure, and Solvent Accessibility • DomPred: Homology (PSSM) and Fold Recognition (Secondary Structure) • DomSSEA: Secondary Structure • DomainDiscovery, IGRN and ML models: • Position Specific Scoring Matrix (PSSM) • Secondary Structure • Solvent Accessibility • Domain Linker Index

Experiment 2 cont. • Predictive Performance Comparison on CASP7 • IGRN II achieved superior predictive performance than existing domain boundary predictors on CASP7. • Structural Information used in IGRN and other ML models more useful than the information used by other predictors.

Experiment 2 cont. • Comparison of prediction scores simulated by IGRN and GRNN on a protein chain, CASP7 target number: T0318 (PDB code: 2HB6) • The protein chain has two domains and its boundary is at the residue 155.

Long-Range Information • The most notable breakthrough: the exploitation of evolutionary information (Rost and Sander 1993). • Machine learning based prediction method (ANNs) on a profile compiled from the multiple sequence alignments. • Increased the prediction accuracy by 6 to 8 percentage points. • Overall three-state accuracy of 70.8% for globular proteins. • Long-range interaction also plays a key role. • The regions of β-sheets, involves long-range interactions between amino acids. • Thus, prediction accuracy for β-strands is less than that for α-helix or coil. • Accurate prediction of β-sheet is useful for a variety of biological problems. • tertiary structure prediction, • elucidating folding pathways, • and designing new proteins. • The problem is: • β-sheet formation is seen as a tertiary structure interaction which brings two or more strands together by hydrogen bonds. THEY CAN BE SITUATED FAR APART IN THE AMINO ACIDS SEQUENCE!

Future Work • Further improve the new semi-parametric model to efficiently capture long-range information. • At the same time, find or develop a new encoding scheme or profiles that contains more structural information.

Thank you!

Protein Domain Boundary Prediction

Protein Domain Boundary Prediction

Presentation Transcript

Protein Structural Prediction

Protein structure prediction

Protein structure prediction

Protein Domain Boundary Prediction

Protein Structure Prediction

Protein structure prediction

Protein Structure Prediction

Protein Structure Prediction

Protein Interaction (domain domain interaction)

Protein structure prediction

Protein domain BioBricks

Transmembrane Protein Prediction

Protein structure prediction

Domain-Based Protein-Protein Interaction Prediction Using Random Decision Forest Framework

Protein Sequence Domain Boundary Detection

Protein Function Prediction Based on Domain Content

Protein Function Prediction

protein domain prediction

Protein Structure Prediction

Protein Structure Prediction

Protein Structure Prediction