Machine learning as applied to structural bioinformatics results and challenges l.jpg
Sponsored Links
This presentation is the property of its rightful owner.
1 / 45

Machine Learning as Applied to Structural Bioinformatics: Results and Challenges PowerPoint PPT Presentation


  • 162 Views
  • Uploaded on
  • Presentation posted in: General

Machine Learning as Applied to Structural Bioinformatics: Results and Challenges. Philip E. Bourne University of California San Diego [email protected] The Current Situation. Structure contributes greatly to our understanding of living systems

Download Presentation

Machine Learning as Applied to Structural Bioinformatics: Results and Challenges

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Machine Learning as Applied to Structural Bioinformatics: Results and Challenges

Philip E. Bourne

University of California San Diego

[email protected]

DIMACS - Machine Learning in Bioinformatics


The Current Situation

  • Structure contributes greatly to our understanding of living systems

  • We are locked into thinking about structure in specific ways which limits our view

    • All too often we consider structure as a static entity

    • The view at left is not how another protein or a small molecule ligand sees PKA

  • We are still not very good at certain problems …

DIMACS - Machine Learning in Bioinformatics


Example Unsolved Problems that Machine Learning Can Address

  • Predicting flexibility and disorder in protein structure

  • Predicting sites of protein-protein and protein-ligand interaction

  • Predicting protein function

  • Defining domain boundaries from sequence

  • Predicting secondary, tertiary and quaternary structure

  • Predicting what will crystallize

DIMACS - Machine Learning in Bioinformatics


Example Unsolved Problems that Machine Learning Can Address

  • Predicting flexibility and disorder in protein structure

  • Predicting sites of protein-protein and protein-ligand interaction

  • Predicting protein function

  • Defining domain boundaries from sequence

  • Predicting secondary, tertiary and quaternary structure

  • Predicting what will crystallize

* Will talk about this

* Will offer as a challenge

DIMACS - Machine Learning in Bioinformatics


The Current Situation: The Potential “Training Set” is Growing Quickly

  • High level of redundancy as measured by sequence or structure

  • Structure space is clearly very finite, but not clear how much is covered

  • Increase in functionally uncharacterized structures

  • Complexity is increasing, but still lack complexes

  • Structures predominantly 1 and 2 domains

  • Lack membrane proteins

  • In summary the training set is still not truly representative but structural genomics will improve this situation

DIMACS - Machine Learning in Bioinformatics


Predicting Functional Flexibility

Jenny Gu

Gu, Gribskov & Bourne PLoS Computational Biology 2006 Early On-line Release

DIMACS - Machine Learning in Bioinformatics


If we believe that the 3-dimensional structure of a protein is defined by its 1-dimensional sequence then why not its flexibility?

Spectrum of Protein Order and Disorder

Ordered

Structures

Disordered

Structures

DIMACS - Machine Learning in Bioinformatics


Bridging the Sequence-flexibility Gap

Generalize sequence - flexibility relationship to identify local protein regions important for allostery

DIMACS - Machine Learning in Bioinformatics


The Training Dataset

The dataset contains the following qualities:

  • Non-redundant sequences

    • training set with sequences containing ≤ 10% identity.

  • With good quality structures

    • R-factor < 0.30

  • At high resolution

    • Resolution < 2.0 Å.

      Total number of proteins in dataset: 1277 sequences

DIMACS - Machine Learning in Bioinformatics


Obtaining Protein Dynamic Information

Protein structures treated as a 3-D elastic network.

Bahar, I., A.R. Atilgan, and B. Erman

Direct evaluation of thermal fluctuations in proteins using a single-parameter harmonic potential.

Folding & Design, 1997. 2(3): p. 173-181.

DIMACS - Machine Learning in Bioinformatics


Defining the Target Features

Gaussian Network Model:

  • Models protein structure as a 3-D elastic network.

    • Each Cais a node in the network.

    • Each node undergoes Gaussian-distributed fluctuations influenced by neighboring interactions within a given cutoff distance. (7Å)

  • Decompose protein fluctuation into a summation of different modes.

Bahar, I., A.R. Atilgan, and B. Erman

Direct evaluation of thermal fluctuations in proteins using a single-parameter harmonic potential.

Folding & Design, 1997. 2(3): p. 173-181.

DIMACS - Machine Learning in Bioinformatics


Side Note: Gaussian Network Model vs Molecular Dynamics

  • GNM relatively cause grained

  • GNM fast to compute vs MD

    • Look over larger time scales

    • Suitable for high throughput

DIMACS - Machine Learning in Bioinformatics


Functional Flexibility Score

  • Utilize correlated movements to help define regional flexibility with functional importance.

Functionally Flexible Score

For each residue:

Find Maximum and Minimum Correlation

Use to scale normalized fluctuation to determine functional importance

DIMACS - Machine Learning in Bioinformatics


Example: Identifying Functional Flexible Regions (FFR) in HIV Protease

Correlated modes (yellow)

Anti-correlated (blue)

Normalized scores – single chain

Gu, Gribskov & Bourne

PLoS Comp. Biol.. 2006 Early Release


Identifying Regions in Bovine Pancreatic Trypsin Inhibitor and Calmodulin

DIMACS - Machine Learning in Bioinformatics


How to Represent the Protein Sequence?

  • Residues characterized as FFs or not – approx 20% of residues with lengths typically 9+/-11

  • The longer the protein the longer the FFR

  • We use hidden Markov models to represent each protein sequence in the training dataset.

  • Hidden Markov models captures evolutionary information along with the probability of finding one of the 20 amino acids in each position of the sequence.

  • Use probability states as input features in the first layer of an architecture containing two SVM layers.

DIMACS - Machine Learning in Bioinformatics


Architecture of Wiggle

Captures

Evolutionary

Effects

Captures

Local

Effects

(smoothing)

9*29 features

used for each residue

DIMACS - Machine Learning in Bioinformatics


Null Model* for

FFR Regions

Generating Additional Input FeaturesModified Bootstrapping – for Tripeptides – Accounts for Nearest Neighbors Effects

Sample

with replacement

199515 times

Pooled

Patterns

(window size : 3)

Null Model* for

Non-FFR Regions

Sample

with replacement

44645 times

Calculate Z score and P value

for each pattern

with respective null models

* Generate 10,000 Null Models

DIMACS - Machine Learning in Bioinformatics


Architecture of Wiggle

Captures

Evolutionary

Effects

Captures

Local

Effects

(smoothing)

9*29 features

used for each residue

DIMACS - Machine Learning in Bioinformatics


Predictors Trained on the Entire Dataset Perform Poorly on Smaller Proteins.

False Positive

False Negative

The characteristics of small

proteins are different –

eg percent of complexes

DIMACS - Machine Learning in Bioinformatics


Partition Training Set Based on Sequence Length

>200 AA Long

<200 AA Long

  • Prediction performance of SVM trained on a partitioned dataset (solid lines) is compared to that was trained on the entire dataset (dashed line).

  • Prediction quality improved when dataset is partitioned. Most notably for proteins up to 200 amino acid residues long. Slight improvements observed for proteins longer than 200 residues.

DIMACS - Machine Learning in Bioinformatics


Performance of Wiggle Predictors

Wiggle

Accuracy: 66.01%

Precision: 37.11%

Recall: 70.49%

Wiggle 200

Accuracy: 76.46%

Precision: 48.99%

Recall: 78.27%

DIMACS - Machine Learning in Bioinformatics


Case Study: PvuII Endonuclease

(homodimer for DNA specific cleavage)

  • Identify known loop for minor grove recognition

  • Identify hinge residues not previously seen

  • Important result for mutagenesis studies

FF SCORE

Wiggle 200

DIMACS - Machine Learning in Bioinformatics


Conclusions for Wiggle

  • FFRs can be measured from structure

  • With some empirical effort these data can be used as input to an SVM to predict FFRs from sequence alone

  • Useful for:

    • Improving docking studies

    • Better understand protein function

    • Engineer more or less stable proteins

    • ……

Gu, Gribskov & Bourne 2006

PLoS Comp. Biol.. 2006 Early Release

DIMACS - Machine Learning in Bioinformatics


Exploiting Sequence and Structure Homologs to Identify Protein-Protein Binding Sites

JoLan Chung

Chung, Wang & Bourne 2006 Proteins: Structure, Function and Bioinformatics, 62(3) 630-640

DIMACS - Machine Learning in Bioinformatics


Methods to Identify Protein-protein Binding Sites

  • Docking

  • Threading and homology modeling

  • Evolutionary tracing

  • Correlated mutations

  • Properties of patches

  • Hydrophobicity

  • Neural networks and support vector machines (SVM)

DIMACS - Machine Learning in Bioinformatics


Structurally Conserved Surface Residues?

  • None of the above methods consider the residues which are spatially conserved on the surfaces of structure homologs

  • These residues are reported to correspond to the energy hot spots on protein interfaces and can be derived from multiple structure alignments

DIMACS - Machine Learning in Bioinformatics


Method: Incorporate Structural Conservation to Predict the Interface Residue Using SVM

Sequence + structure information

Support vector machine

Binding site location

DIMACS - Machine Learning in Bioinformatics


Derive the Structurally Conserved Residues

  • The structural conservation scores were derived from multiple structural alignments and weighted by the normalized B-factors to consider the structure flexibility that will result in a bad alignment (could use FFRs in the future)

  • Each position in the alignment has a structural conservation score, which represents the conservation in 3D space

  • A position has a high conservation score if the aligned residues are spatially conserved

DIMACS - Machine Learning in Bioinformatics


Structurally Conserved Residues and Interface Residues

E.g. Residues with the top 20% of structure conservation scores (red) mapped to adrenodoxin (Adx, PDB code 1E6E:B) and known to bind adrenodoxin reductase (AR, blue).

DIMACS - Machine Learning in Bioinformatics


Training Dataset

  • 274 non-redundant chains of heterocomplexes (<30% sequence identity) extracted from the PDB

  • Each of these chains was accompanied with a structure alignment with at least 4 members

DIMACS - Machine Learning in Bioinformatics


SVM Training

A surface residue

Sequence profile + ASA + Structural conservation score

in a window of 13 residues

(The residue to be predicted and 12 spatially nearest surface residues)

Support vector machine classifier

Interface or non-interface residue ?

DIMACS - Machine Learning in Bioinformatics


SVM Training

  • Each residue was encoded as a feature vector with 13×21 dimensions: (the surface residue to be predicted + 12 nearest neighbors) x (20 amino acids + accessible surface area)

  • Implemented using SVMlightwith the radial basis function as a kernel. (γ = 0.01, regularization parameter C =10)

  • A set of non-interface surface residues was randomly selected to make the ratio of positive and negative data 1:1

  • 3 fold cross-validation was performed

DIMACS - Machine Learning in Bioinformatics


The Performance of Various Predictors

Predictor 1: Sequence profile + ASA.Predictor 2: Sequence profile + ASA + structural conservation scorePredictor 3: Sequence profile + ASA + raw structural conservation score without weighted by the normalized B-factor Predictor 4: Sequence profile + ASA+ normalized B-factor

DIMACS - Machine Learning in Bioinformatics


The Performances of the Predictors

Precise prediction: at least 70% interface residues were identified

Correct prediction: at least 50 % interface residues were identified

Partial prediction: some but less than 50 % interface residues were identified

Wrong prediction: no interface residues were identified

DIMACS - Machine Learning in Bioinformatics


Predicted Binding Sites - Example 1

Protein : domain 1 of the human coxsackie and adenovirus receptor (CAR D1)

  • Mediate adenoviruses and coxsackie virus B infection

  • CAR is an integral membrane protein expressed in a broad range of human and murine cell type. CAR D1 is one of its two extracellular domains

    Binding partner: knob domain of the adenoviruses serotype 12 (Ad12)

DIMACS - Machine Learning in Bioinformatics


Predicted Binding Sites - Example 2

Protein : adrendoxin (Adx)

  • In mitochondria of the adrenal cortex, the steroid hydroxylating system requires the transfer of electrons from the membrane-attached flavoprotein AR via the soluble Adx to the membrane-integrated cytochrome P450 of the CYP 11 family

    Binding partner: adrenodoxin reductase (AR)

DIMACS - Machine Learning in Bioinformatics


Predicted Binding Sites - Example 3

Protein : fibroblast growth factor receptor 2 (FGFR2) Ser252Trp Mutant

  • Apert syndrome (AS) is caused by substitution of one of two adjacent residues, Ser252Trp or Pro253Arg

    Binding partner: fibroblast growth factor (FGF2)

DIMACS - Machine Learning in Bioinformatics


Conclusions – Protein-protein Binding Sites

  • Incorporating the structural conservation score improved the prediction performance of SVM significantly

  • This study is an initial trial that exploits multiple structure alignment for the large scale prediction of functional regions

  • We need better algorithms for multiple structure alignment (we have one benchmark for anyone interested)

  • This method can be used to guide experiments, such as site-specific mutagenesis, or combined with docking procedures to limit the search space

DIMACS - Machine Learning in Bioinformatics


General Conclusions

  • Using known features of protein structure these can be mapped to the corresponding sequences and used to train an SVM

  • Having evaluated the SVM in a cross validation tests the performance can be determined

  • Good performance is shown in training for both flexibility and sites of protein-protein interaction

  • These predictors are currently being used to solve real biological problems

  • Can this approach be applied to other aspects of structure?

DIMACS - Machine Learning in Bioinformatics


1d0gt

1aoga

1ytf

Experts: 3

PUU: 1

Experts: 2

PUU: 1

PUU: 4

Experts: 3

1dgk

PUU: 6

Experts: 4

A.

B.

C.

1fohb

D.

E.

PUU: 2

Experts: 3

Consider Domain Definitions:

Holland et al. 2006 JMB Early Release

Veretnik et al. 2004 JMB339(3), 647-678


Challenge – Defining Domain Boundaries from Sequence

  • A domain is the unit of currency of proteins – domain structures define function, indicate evolutionary relationships etc…

  • Domain prediction from structure easier than from sequence, but still not a solved problem

  • Recently developed an accurate test set of domain definitions and boundaries: http://pdomains.sdsc.edu

  • Good luck!

Benchmark Data Available See:

Holland et al 2006 JMB Early Release

DIMACS - Machine Learning in Bioinformatics


Acknowledgements

  • Functional Flexibility

    • Jenny Gu & Michael Gribskov

  • Protein-protein Interactions

    • JoLan Chung & Wei Wang

  • Domain Definitions

    • Stella Veretnik, Tim Holland, Ilya Shindalov, Nick Alexandrov, Abdur Sikur

  • Funding, NSF, NIH

DIMACS - Machine Learning in Bioinformatics


The structural conservation score

  • Raw structural conservation score

    where

    if a is not gap and b is not gap

    otherwise

    where N is the total number of aligned structures, si(x) is the amino acid at position x

    in the ith structure in the alignment, m is a modified PET substitution matrix calculated by Valdar et al.

DIMACS - Machine Learning in Bioinformatics


The structure conservation score

  • The B-factors determined by X-ray crystallographic experiments provide an indication of the degree of mobility and disorder of an atom in a protein structure

  • Raw structural conservation scores were weighted by the normalized B-factors (Bnorm, i) to consider the structure flexibility

    where

DIMACS - Machine Learning in Bioinformatics


  • Login