Machine learning as applied to structural bioinformatics results and challenges l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 45

Machine Learning as Applied to Structural Bioinformatics: Results and Challenges PowerPoint PPT Presentation


  • 155 Views
  • Uploaded on
  • Presentation posted in: General

Machine Learning as Applied to Structural Bioinformatics: Results and Challenges. Philip E. Bourne University of California San Diego [email protected] The Current Situation. Structure contributes greatly to our understanding of living systems

Download Presentation

Machine Learning as Applied to Structural Bioinformatics: Results and Challenges

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Machine learning as applied to structural bioinformatics results and challenges l.jpg

Machine Learning as Applied to Structural Bioinformatics: Results and Challenges

Philip E. Bourne

University of California San Diego

[email protected]

DIMACS - Machine Learning in Bioinformatics


The current situation l.jpg

The Current Situation

  • Structure contributes greatly to our understanding of living systems

  • We are locked into thinking about structure in specific ways which limits our view

    • All too often we consider structure as a static entity

    • The view at left is not how another protein or a small molecule ligand sees PKA

  • We are still not very good at certain problems …

DIMACS - Machine Learning in Bioinformatics


Example unsolved problems that machine learning can address l.jpg

Example Unsolved Problems that Machine Learning Can Address

  • Predicting flexibility and disorder in protein structure

  • Predicting sites of protein-protein and protein-ligand interaction

  • Predicting protein function

  • Defining domain boundaries from sequence

  • Predicting secondary, tertiary and quaternary structure

  • Predicting what will crystallize

DIMACS - Machine Learning in Bioinformatics


Example unsolved problems that machine learning can address4 l.jpg

Example Unsolved Problems that Machine Learning Can Address

  • Predicting flexibility and disorder in protein structure

  • Predicting sites of protein-protein and protein-ligand interaction

  • Predicting protein function

  • Defining domain boundaries from sequence

  • Predicting secondary, tertiary and quaternary structure

  • Predicting what will crystallize

* Will talk about this

* Will offer as a challenge

DIMACS - Machine Learning in Bioinformatics


The current situation the potential training set is growing quickly l.jpg

The Current Situation: The Potential “Training Set” is Growing Quickly

  • High level of redundancy as measured by sequence or structure

  • Structure space is clearly very finite, but not clear how much is covered

  • Increase in functionally uncharacterized structures

  • Complexity is increasing, but still lack complexes

  • Structures predominantly 1 and 2 domains

  • Lack membrane proteins

  • In summary the training set is still not truly representative but structural genomics will improve this situation

DIMACS - Machine Learning in Bioinformatics


Predicting functional flexibility l.jpg

Predicting Functional Flexibility

Jenny Gu

Gu, Gribskov & Bourne PLoS Computational Biology 2006 Early On-line Release

DIMACS - Machine Learning in Bioinformatics


Spectrum of protein order and disorder l.jpg

If we believe that the 3-dimensional structure of a protein is defined by its 1-dimensional sequence then why not its flexibility?

Spectrum of Protein Order and Disorder

Ordered

Structures

Disordered

Structures

DIMACS - Machine Learning in Bioinformatics


Bridging the sequence flexibility gap l.jpg

Bridging the Sequence-flexibility Gap

Generalize sequence - flexibility relationship to identify local protein regions important for allostery

DIMACS - Machine Learning in Bioinformatics


The training dataset l.jpg

The Training Dataset

The dataset contains the following qualities:

  • Non-redundant sequences

    • training set with sequences containing ≤ 10% identity.

  • With good quality structures

    • R-factor < 0.30

  • At high resolution

    • Resolution < 2.0 Å.

      Total number of proteins in dataset: 1277 sequences

DIMACS - Machine Learning in Bioinformatics


Obtaining protein dynamic information l.jpg

Obtaining Protein Dynamic Information

Protein structures treated as a 3-D elastic network.

Bahar, I., A.R. Atilgan, and B. Erman

Direct evaluation of thermal fluctuations in proteins using a single-parameter harmonic potential.

Folding & Design, 1997. 2(3): p. 173-181.

DIMACS - Machine Learning in Bioinformatics


Defining the target features l.jpg

Defining the Target Features

Gaussian Network Model:

  • Models protein structure as a 3-D elastic network.

    • Each Cais a node in the network.

    • Each node undergoes Gaussian-distributed fluctuations influenced by neighboring interactions within a given cutoff distance. (7Å)

  • Decompose protein fluctuation into a summation of different modes.

Bahar, I., A.R. Atilgan, and B. Erman

Direct evaluation of thermal fluctuations in proteins using a single-parameter harmonic potential.

Folding & Design, 1997. 2(3): p. 173-181.

DIMACS - Machine Learning in Bioinformatics


Side note gaussian network model vs molecular dynamics l.jpg

Side Note: Gaussian Network Model vs Molecular Dynamics

  • GNM relatively cause grained

  • GNM fast to compute vs MD

    • Look over larger time scales

    • Suitable for high throughput

DIMACS - Machine Learning in Bioinformatics


Functional flexibility score l.jpg

Functional Flexibility Score

  • Utilize correlated movements to help define regional flexibility with functional importance.

Functionally Flexible Score

For each residue:

Find Maximum and Minimum Correlation

Use to scale normalized fluctuation to determine functional importance

DIMACS - Machine Learning in Bioinformatics


Example identifying functional flexible regions ffr in hiv protease l.jpg

Example: Identifying Functional Flexible Regions (FFR) in HIV Protease

Correlated modes (yellow)

Anti-correlated (blue)

Normalized scores – single chain

Gu, Gribskov & Bourne

PLoS Comp. Biol.. 2006 Early Release


Slide15 l.jpg

Identifying Regions in Bovine Pancreatic Trypsin Inhibitor and Calmodulin

DIMACS - Machine Learning in Bioinformatics


How to represent the protein sequence l.jpg

How to Represent the Protein Sequence?

  • Residues characterized as FFs or not – approx 20% of residues with lengths typically 9+/-11

  • The longer the protein the longer the FFR

  • We use hidden Markov models to represent each protein sequence in the training dataset.

  • Hidden Markov models captures evolutionary information along with the probability of finding one of the 20 amino acids in each position of the sequence.

  • Use probability states as input features in the first layer of an architecture containing two SVM layers.

DIMACS - Machine Learning in Bioinformatics


Architecture of wiggle l.jpg

Architecture of Wiggle

Captures

Evolutionary

Effects

Captures

Local

Effects

(smoothing)

9*29 features

used for each residue

DIMACS - Machine Learning in Bioinformatics


Slide18 l.jpg

Null Model* for

FFR Regions

Generating Additional Input FeaturesModified Bootstrapping – for Tripeptides – Accounts for Nearest Neighbors Effects

Sample

with replacement

199515 times

Pooled

Patterns

(window size : 3)

Null Model* for

Non-FFR Regions

Sample

with replacement

44645 times

Calculate Z score and P value

for each pattern

with respective null models

* Generate 10,000 Null Models

DIMACS - Machine Learning in Bioinformatics


Architecture of wiggle19 l.jpg

Architecture of Wiggle

Captures

Evolutionary

Effects

Captures

Local

Effects

(smoothing)

9*29 features

used for each residue

DIMACS - Machine Learning in Bioinformatics


Predictors trained on the entire dataset perform poorly on smaller proteins l.jpg

Predictors Trained on the Entire Dataset Perform Poorly on Smaller Proteins.

False Positive

False Negative

The characteristics of small

proteins are different –

eg percent of complexes

DIMACS - Machine Learning in Bioinformatics


Partition training set based on sequence length l.jpg

Partition Training Set Based on Sequence Length

>200 AA Long

<200 AA Long

  • Prediction performance of SVM trained on a partitioned dataset (solid lines) is compared to that was trained on the entire dataset (dashed line).

  • Prediction quality improved when dataset is partitioned. Most notably for proteins up to 200 amino acid residues long. Slight improvements observed for proteins longer than 200 residues.

DIMACS - Machine Learning in Bioinformatics


Performance of wiggle predictors l.jpg

Performance of Wiggle Predictors

Wiggle

Accuracy: 66.01%

Precision: 37.11%

Recall: 70.49%

Wiggle 200

Accuracy: 76.46%

Precision: 48.99%

Recall: 78.27%

DIMACS - Machine Learning in Bioinformatics


Case study pvuii endonuclease l.jpg

Case Study: PvuII Endonuclease

(homodimer for DNA specific cleavage)

  • Identify known loop for minor grove recognition

  • Identify hinge residues not previously seen

  • Important result for mutagenesis studies

FF SCORE

Wiggle 200

DIMACS - Machine Learning in Bioinformatics


Conclusions for wiggle l.jpg

Conclusions for Wiggle

  • FFRs can be measured from structure

  • With some empirical effort these data can be used as input to an SVM to predict FFRs from sequence alone

  • Useful for:

    • Improving docking studies

    • Better understand protein function

    • Engineer more or less stable proteins

    • ……

Gu, Gribskov & Bourne 2006

PLoS Comp. Biol.. 2006 Early Release

DIMACS - Machine Learning in Bioinformatics


Exploiting sequence and structure homologs to identify protein protein binding sites l.jpg

Exploiting Sequence and Structure Homologs to Identify Protein-Protein Binding Sites

JoLan Chung

Chung, Wang & Bourne 2006 Proteins: Structure, Function and Bioinformatics, 62(3) 630-640

DIMACS - Machine Learning in Bioinformatics


Methods to identify protein protein binding sites l.jpg

Methods to Identify Protein-protein Binding Sites

  • Docking

  • Threading and homology modeling

  • Evolutionary tracing

  • Correlated mutations

  • Properties of patches

  • Hydrophobicity

  • Neural networks and support vector machines (SVM)

DIMACS - Machine Learning in Bioinformatics


Structurally conserved surface residues l.jpg

Structurally Conserved Surface Residues?

  • None of the above methods consider the residues which are spatially conserved on the surfaces of structure homologs

  • These residues are reported to correspond to the energy hot spots on protein interfaces and can be derived from multiple structure alignments

DIMACS - Machine Learning in Bioinformatics


Method incorporate structural conservation to predict the interface residue using svm l.jpg

Method: Incorporate Structural Conservation to Predict the Interface Residue Using SVM

Sequence + structure information

Support vector machine

Binding site location

DIMACS - Machine Learning in Bioinformatics


Derive the structurally conserved residues l.jpg

Derive the Structurally Conserved Residues

  • The structural conservation scores were derived from multiple structural alignments and weighted by the normalized B-factors to consider the structure flexibility that will result in a bad alignment (could use FFRs in the future)

  • Each position in the alignment has a structural conservation score, which represents the conservation in 3D space

  • A position has a high conservation score if the aligned residues are spatially conserved

DIMACS - Machine Learning in Bioinformatics


Slide30 l.jpg

Structurally Conserved Residues and Interface Residues

E.g. Residues with the top 20% of structure conservation scores (red) mapped to adrenodoxin (Adx, PDB code 1E6E:B) and known to bind adrenodoxin reductase (AR, blue).

DIMACS - Machine Learning in Bioinformatics


Training d ataset l.jpg

Training Dataset

  • 274 non-redundant chains of heterocomplexes (<30% sequence identity) extracted from the PDB

  • Each of these chains was accompanied with a structure alignment with at least 4 members

DIMACS - Machine Learning in Bioinformatics


Svm training l.jpg

SVM Training

A surface residue

Sequence profile + ASA + Structural conservation score

in a window of 13 residues

(The residue to be predicted and 12 spatially nearest surface residues)

Support vector machine classifier

Interface or non-interface residue ?

DIMACS - Machine Learning in Bioinformatics


Svm training33 l.jpg

SVM Training

  • Each residue was encoded as a feature vector with 13×21 dimensions: (the surface residue to be predicted + 12 nearest neighbors) x (20 amino acids + accessible surface area)

  • Implemented using SVMlightwith the radial basis function as a kernel. (γ = 0.01, regularization parameter C =10)

  • A set of non-interface surface residues was randomly selected to make the ratio of positive and negative data 1:1

  • 3 fold cross-validation was performed

DIMACS - Machine Learning in Bioinformatics


Slide34 l.jpg

The Performance of Various Predictors

Predictor 1: Sequence profile + ASA.Predictor 2: Sequence profile + ASA + structural conservation scorePredictor 3: Sequence profile + ASA + raw structural conservation score without weighted by the normalized B-factor Predictor 4: Sequence profile + ASA+ normalized B-factor

DIMACS - Machine Learning in Bioinformatics


Slide35 l.jpg

The Performances of the Predictors

Precise prediction: at least 70% interface residues were identified

Correct prediction: at least 50 % interface residues were identified

Partial prediction: some but less than 50 % interface residues were identified

Wrong prediction: no interface residues were identified

DIMACS - Machine Learning in Bioinformatics


Slide36 l.jpg

Predicted Binding Sites - Example 1

Protein : domain 1 of the human coxsackie and adenovirus receptor (CAR D1)

  • Mediate adenoviruses and coxsackie virus B infection

  • CAR is an integral membrane protein expressed in a broad range of human and murine cell type. CAR D1 is one of its two extracellular domains

    Binding partner: knob domain of the adenoviruses serotype 12 (Ad12)

DIMACS - Machine Learning in Bioinformatics


Slide37 l.jpg

Predicted Binding Sites - Example 2

Protein : adrendoxin (Adx)

  • In mitochondria of the adrenal cortex, the steroid hydroxylating system requires the transfer of electrons from the membrane-attached flavoprotein AR via the soluble Adx to the membrane-integrated cytochrome P450 of the CYP 11 family

    Binding partner: adrenodoxin reductase (AR)

DIMACS - Machine Learning in Bioinformatics


Slide38 l.jpg

Predicted Binding Sites - Example 3

Protein : fibroblast growth factor receptor 2 (FGFR2) Ser252Trp Mutant

  • Apert syndrome (AS) is caused by substitution of one of two adjacent residues, Ser252Trp or Pro253Arg

    Binding partner: fibroblast growth factor (FGF2)

DIMACS - Machine Learning in Bioinformatics


Conclusions protein protein binding sites l.jpg

Conclusions – Protein-protein Binding Sites

  • Incorporating the structural conservation score improved the prediction performance of SVM significantly

  • This study is an initial trial that exploits multiple structure alignment for the large scale prediction of functional regions

  • We need better algorithms for multiple structure alignment (we have one benchmark for anyone interested)

  • This method can be used to guide experiments, such as site-specific mutagenesis, or combined with docking procedures to limit the search space

DIMACS - Machine Learning in Bioinformatics


General conclusions l.jpg

General Conclusions

  • Using known features of protein structure these can be mapped to the corresponding sequences and used to train an SVM

  • Having evaluated the SVM in a cross validation tests the performance can be determined

  • Good performance is shown in training for both flexibility and sites of protein-protein interaction

  • These predictors are currently being used to solve real biological problems

  • Can this approach be applied to other aspects of structure?

DIMACS - Machine Learning in Bioinformatics


Slide41 l.jpg

1d0gt

1aoga

1ytf

Experts: 3

PUU: 1

Experts: 2

PUU: 1

PUU: 4

Experts: 3

1dgk

PUU: 6

Experts: 4

A.

B.

C.

1fohb

D.

E.

PUU: 2

Experts: 3

Consider Domain Definitions:

Holland et al. 2006 JMB Early Release

Veretnik et al. 2004 JMB339(3), 647-678


Challenge defining domain boundaries from sequence l.jpg

Challenge – Defining Domain Boundaries from Sequence

  • A domain is the unit of currency of proteins – domain structures define function, indicate evolutionary relationships etc…

  • Domain prediction from structure easier than from sequence, but still not a solved problem

  • Recently developed an accurate test set of domain definitions and boundaries: http://pdomains.sdsc.edu

  • Good luck!

Benchmark Data Available See:

Holland et al 2006 JMB Early Release

DIMACS - Machine Learning in Bioinformatics


Acknowledgements l.jpg

Acknowledgements

  • Functional Flexibility

    • Jenny Gu & Michael Gribskov

  • Protein-protein Interactions

    • JoLan Chung & Wei Wang

  • Domain Definitions

    • Stella Veretnik, Tim Holland, Ilya Shindalov, Nick Alexandrov, Abdur Sikur

  • Funding, NSF, NIH

DIMACS - Machine Learning in Bioinformatics


The structural conservation score l.jpg

The structural conservation score

  • Raw structural conservation score

    where

    if a is not gap and b is not gap

    otherwise

    where N is the total number of aligned structures, si(x) is the amino acid at position x

    in the ith structure in the alignment, m is a modified PET substitution matrix calculated by Valdar et al.

DIMACS - Machine Learning in Bioinformatics


The structure conservation score l.jpg

The structure conservation score

  • The B-factors determined by X-ray crystallographic experiments provide an indication of the degree of mobility and disorder of an atom in a protein structure

  • Raw structural conservation scores were weighted by the normalized B-factors (Bnorm, i) to consider the structure flexibility

    where

DIMACS - Machine Learning in Bioinformatics


  • Login