Ioerger lab bioinformatics research l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 23

Ioerger Lab – Bioinformatics Research PowerPoint PPT Presentation


  • 127 Views
  • Uploaded on
  • Presentation posted in: General

Ioerger Lab – Bioinformatics Research. Pattern recognition/machine learning issues of representation effect of feature extraction, weighting, and interaction on performance of induction algorithm Applications in Structural Biology molecular basis of biology: protein structures

Download Presentation

Ioerger Lab – Bioinformatics Research

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Ioerger lab bioinformatics research l.jpg

Ioerger Lab – Bioinformatics Research

  • Pattern recognition/machine learning

    • issues of representation

    • effect of feature extraction, weighting, and interaction on performance of induction algorithm

  • Applications in Structural Biology

    • molecular basis of biology: protein structures

    • predicting structures

    • tools for solving structures (X-ray crystallography, NMR)

    • stability, folding, packing, motions

    • drug design (small-molecule inhibitors)

    • large datasets exist – exploit them – find the patterns


Textal automated crystallographic protein structure determination using pattern recognition l.jpg

TEXTAL - Automated Crystallographic Protein Structure Determination Using Pattern Recognition

Principal Investigators: Thomas Ioerger (Dept. Computer Science)

James Sacchettini (Dept. Biochem/Biophys)

Other contributors: Tod D. Romo, Kreshna Gopal, Erik McKee,

Lalji Kanbi, Reetal Pai & Jacob Smith

Funding: National Institutes of Health

Texas A&M University


X ray crystallography l.jpg

X-ray crystallography

  • Most widely used method for protein modeling

  • Steps:

    • Grow crystal

    • Collect diffraction data

    • Generate electron density map (Fourier transform)

    • Interpret map i.e. infer atomic coordinates

    • Refine structure

  • Model-building

    • Currently: crystallographers

    • Challenges: noise, resolution

    • Goal: automation


X ray crystallography4 l.jpg

X-ray crystallography

  • Most widely used method for protein modeling

  • Steps:

    • Grow crystal

    • Collect diffraction data

    • Generate electron density map (Fourier transform)

    • Interpret map i.e. infer atomic coordinates

    • Refine structure

  • Model-building

    • Currently: crystallographers

    • Challenges: noise, resolution

    • Goal: automation


Overview of textal l.jpg

Overview of TEXTAL

  • Automated model-building program

  • Can we automate the kind of visual processing of patterns that crystallographers use?

    • Intelligent methods to interpret density, despite noise

    • Exploit knowledge about typical protein structure

  • Focus on medium-resolution maps

    • optimized for 2.8A (actually, 2.6-3.2A is fine)

    • typical for MAD data (useful for high-throughput)

    • other programs exist for higher-res data (ARP/wARP)

Electron density map

(or structure factors)

Protein model

(may need refinement)

TEXTAL


Slide6 l.jpg

Crystal

Collect data

Electron density map

Diffraction data

LOOKUP: model side chains

CAPRA: models backbone

SCALE MAP

TRACE MAP

CALCULATE FEATURES

PREDICT Cα’s

BUILD CHAINS

PATCH & STITCH CHAINS

REFINE CHAINS

Model of backbone

Model of backbone & side chains

POST-PROCESSING

SEQUENCE ALIGNMENT

REAL SPACE REFINEMENT

Corrected & refined model


Slide7 l.jpg

F=<1.72,-0.39,1.04,1.55...>

F=<1.58,0.18,1.09,-0.25...>

F=<0.90,0.65,-1.40,0.87...>

F=<1.79,-0.43,0.88,1.52...>


Examples of numeric density features l.jpg

Examples of Numeric Density Features

  • Distance from center-of-sphere to center-of-mass

  • Moments of inertia - relative dispersion along orthogonal axes

  • Geometric features like “Spoke angles”

  • Local variance and other statistics

Features are designed to be rotation-invariant, i.e. same

values for region in any orientation/frame-of-reference.

TEXTAL uses 19 distinct numeric features to represent

the pattern of density in a region, each calculated over

4 different radii, for a total of 76 features.


Slide9 l.jpg

The LOOKUP Process

Find optimal

rotation

Database

of known

maps

Two-step filter:

1) by features

2) by density

correlation

“2-norm”: weighted Euclidean

distance metric for retrieving matches:

Region in map to

be interpreted


Slide10 l.jpg

SLIDER: Feature-weighting algorithm

  • Euclidean distance metric used for retrieval:

  • relevant features – good, irrelevant features – bad

  • Goal: find optimal weight vector w the generates highest probability of hits (matches) in top K candidates from database

  • Concept of Slider:

    • adjust features so the most matches are ranked higher than mismatches

Slider Algorithm(w,F,{Ri},matches,mismatches)

choose feature fF at random

for each <Ri,Rj,Rk>, Rjmatches(Ri),Rkmismatches(Ri)

compute cross-over point li where:

dist’(Ri,Rj)=dist’(Ri,Rk)

dist’(X,Y)= l(Xf-Yf)2+(1-l)dist\f(X,Y)

pick l that is best compromise among li

ranks most matches above mismatches

update weight vector: w’update(w,f,l), wf’=l

repeat until convergence


Quality of textal models l.jpg

Quality of TEXTAL models

  • Typically builds >80% of the protein atoms

  • Accuracy of coordinates: ~1Å error (RMSD)

    • Depends on resolution and quality of map


Slide12 l.jpg

Closeup of b-strand (TEXTAL model in green)


Deployment l.jpg

Deployment

  • September 2004: Linux and OSX distributions

    • Can be downloaded from http://textal.tamu.edu

    • 40 trial licenses granted so far

  • June 2002: WebTex (http://textal.tamu.edu)

    • Till May 2005: TB Structural Genomics Consortium members only

    • Recently open to the public

    • users upload data; processed on server; can download results

    • 120 users from 70 institutions in 20 countries

  • July 2003: Model building component of PHENIX

    • Python-based Hierarchical ENvironment for Integrated Xtallography

    • Consortium members:

      • Lawrence Berkeley National Lab

      • University of Cambridge

      • Los Alamos National Lab

      • Texas A&M University


Intelligent methods for drug design l.jpg

Intelligent Methods for Drug Design

  • structure-based:

    • given protein structure, predict ligands that might bind active site

  • other methods:

    • QSAR, high-throughput/combi-chem, manual design using 3D

  • Virtual Screening

    • docking algorithm + large library of chemical structures

    • sort compounds by interaction energy

    • purchase top-ranked hits and assay in lab

    • looking for mM inhibitors (leads that can be refined)

    • goal: enrichment to ~5% hit rate


Virtual screening l.jpg

Virtual Screening

  • diversity

  • ZINC database: ~2.6 million compounds

    • purchasable; satisfy Lipinski’s rules

  • docking algorithms:

    • FlexX, DOCK, GOLD, AutoDock, ICM...

    • search for position and conformation of ligand

  • scoring function

    • electrostatic + steric + desolvation

    • entropy effects?

  • major open issues:

    • active site flexibility, charge state, waters, co-factors

    • works best with co-crystal structures (already bound)


Grid at texas a m l.jpg

Grid at Texas A&M

gridmaster.tamu.edu

DOCK binaries +

receptor files +

20 ligands at a time

West Campus

Library

typical configuration:

2.8 GHz dual-core

Pentium CPUs

running Windows XP

Blocker

Zachary

~1600 computers

in student labs on TAMU

campus (Open-Access Labs)

GridMP software

by United Devices

(Austin, TX)


Data mining of results l.jpg

Data Mining of Results

  • promiscuous binders

  • clusters of related compounds

  • patterns of contacts within active site

  • hydrogen-bonding interactions

  • adjust weights of scoring function for unique properties of each site

    • open/closed, hydrophobic/charged...

  • ideas for active site variations

  • development of pharmacophore search patterns


Current screens in sacchettini lab l.jpg

Current Screens in Sacchettini Lab

  • proteins related to tuberculosis (Mycobacterium)

    • focus on unique pathways involved in dormancy/starvation

      • glyoxylate shunt – slow-growth metabolic pathway

      • cell-wall biosynthesis (unique mycolic acid layer in tb.)

      • biosynthesis of amino acids/co-factors that humans get from diet

    • isocitrate lyase

    • malate synthase

    • PcaA: mycolic acid cyclopropane synthase

    • ACPS: acyl-carrier protein synthase

    • InhA: enoyl-acyl reductase (target of isoniazid)

    • KasB: fatty-acid synthase

    • BioA: biotin (co-factor) synthase

    • PGDH: phospho-glycerol dehydrogenase (serine biosynthesis)

  • Related proteins in malaria, SARS, shigella


Conclusions l.jpg

Conclusions

  • Many opportunities for research in Structural Bioinformatics

    • large datasets

    • significant problems

  • Provides challenges for machine learning

    • drives development of novel methods, especially for dealing with noise, sampling biases, extraction of features...

  • Requires inherently interdisciplinary approach

    • training in biochemistry; knowledge of molecular interactions

    • understanding chemical intuition; use of visualization tools

    • insights about strengths and limitations of existing methods

  • Requires collaboration to construct appropriate representations to enable learning algorithms to find patterns

    • translate expectations about what is relevant, dependencies, smoothing, sources of noise...


  • Login