In silico biology computational path toward holistic understanding of living cells
This presentation is the property of its rightful owner.
Sponsored Links
1 / 19

In silico biology: computational path toward holistic understanding of living cells PowerPoint PPT Presentation


  • 120 Views
  • Uploaded on
  • Presentation posted in: General

In silico biology: computational path toward holistic understanding of living cells. Andrey A. Gorin Computer Science and Mathematics Oak Ridge National Laboratory [email protected] Dynamics Kinetics. Function. Models. Structure. Comparative Proteomics. Genes. Comparative Genomics.

Download Presentation

In silico biology: computational path toward holistic understanding of living cells

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


In silico biology computational path toward holistic understanding of living cells

In silico biology: computational path toward holistic understanding of living cells

Andrey A. Gorin

Computer Science and Mathematics

Oak Ridge National Laboratory

[email protected]


Motivation predictive biology

Dynamics

Kinetics

Function

Models

Structure

Comparative Proteomics

Genes

Comparative Genomics

Molecular Simulation

NetworkSimulation

Structure

Prediction

1000 TB

0 TB

1 TB

10 TB

100 TB

Regulatory Region

Coding Region

P

O

GenA

GenB

GenC

Motivation – Predictive Biology

Developments in Biological Sciences:

  • Experimental: From Reduction to Systems Science

  • Computational: From Validation to Prediction

    Development in Technologies:

  • High Throughput Experimentation

  • High Performance Computing

    Uniqueness of Biology:

  • First Principles Approach is Impossible/Impractical

  • Enormous Multitude of Scales

  • Descriptive Models from Diverse Data

  • Large Uncertainties in Data


Scientific drivers

Scientific Drivers

  • Bio-remediation

  • Environmental restoration using microbial processes requires study of:

    • Microbial attachment to mineral environments

    • Uptake of contaminant ions by microbes and metal reduction at the microbial membrane

    • Conversion of toxic chemicals by microbial systems

  • Bio-energy

  • Development of integrated experimental an computational approaches for:

    • Feedstock optimization with the goal of better cellulose deconstruction by special bacterial systems

    • Understanding of microbial communities in single batch processes (harsh environments)

    • Enzyme or regulatory circuit design to increase desirable output


Exploring protein dimension of bio universe

Exploring Protein Dimension of Bio Universe

Entire proteome is analyzed in a few hours

1 out of 105-1011 must be selected as the correct peptide

Mass spectrometry process


De novo platform probability profile method

0.5

1.

0.8

Chain

Confidence

Database

FPVW1

>KKRRHA…LKAAKHREVFKR

FPVW2

>KAH….………..

FPVW3

……….

Random

Match Model

Memory

Indices

De novo Platform: Probability Profile Method

We made several advancements in the understanding of the “mass spec” mathematics. Taken together they lead to conceptually novel platform.

Peak

Assignment

m1 m2 m3 m4 …

[510] VDDLSSLT [305]


Results capabilities in mass spectrometry

Results: Capabilities in Mass Spectrometry

Advances in the mathematical understanding and dramatic acceleration of fundamental operations lead to principally new capabilities

  • Output gains, and especially in highly confident identifications. Our method gives several times more of highly confident identifications

  • Capability to detect unexpected phenomena in the samples. We have found novel biological phenomena in the legacy data and uncovered mistakes in the data sets regarded as benchmarks.

Deamidations (6 spectra)

IHPFAQTQSLVYPFPGPIPN IHPFAQTQSLVYPFPGPIPD

Incorrect peptides (7 spectra)

VIPAADLSQQISTAGTEASGTGNMK->VIPAADLSEQISTAGTEASGTGNMK

Disulphide bond (1 spectra)

AAANFFSASCVPCADQSSFPK

De novo methods can improve even manually verified benchmark data sets obtained by the existing technology.


Model dependent science questions

Model-dependent Science Questions

Many fundamental questions in systems biology are hampered by the lack of reliable predictive models.

Network Models Reliant:

Structural Models Reliant:

  • What are biochemical or regulatory functions of the proteins that are shown to be important for hydrogen production?

  • What are the hot spots of cellulases that could impair their binding?

  • What are the components of hydrogen producing protein assemblies?

  • What biochemical processes in a microbe are related to its traits (hydrogen or ethanol producer, ethanol resistant)?

  • How does a bacteria degrade lignin or cellulose?

  • What are the mechanisms behind the conversion of toxic waste to nontoxic substances by bacteria?


Predictive model building

NMR

X-ray

Genomics

Neutron Scattering

Imaging

Structural Models

3-d Structure Protein Docking Protein-RNA Protein-DNA Protein-Ligand

Interactomics

Genomics

Metabolomics

Quantitative Proteomics

Transcriptomics

Network/Pathway Models

Metabolic Regulatory SignalingProtein Interaction

Predictive Model Building


Protein complexes in genomics gtl

Protein Complexes in Genomics:GTL

GTL is focused on protein interactions that make life work

Express Bait

Protein

Is the interaction real or an artifact?

What is the structure of the protein

complex?

What is its function?

What is its dynamic mechanism?

Can we answer these questions at scale?

Exogenous / Endogenous

Pulldown

Mass

Spectrometry

Analysis

Putative Interacting Proteins at High Throughput

Xray Diffraction


Modeling protein structures and complexes

Statistical potentials

Modeling Protein Structures and Complexes

Combinatorial and optimization techniques are applied for two areas: development of knowledge based potentials and analysis of ultra large structural sets.

Discovery of protein complexes

?

Geometry and bioinfo libraries

Shared memory indices

Protein folding

Parallel implemenations

Ligand binding

Graph algorithms


Computational algorithms in structure modeling

Finding

Common Motifs

ROSETTA Monte Carlo protein folding

Known structures

Knowledge-based Energy Tables

100 GB

Search, Optimization, Enumeration

Energy Optimization

3 GB – 5 TB

Finding

MaximalCliques

Merging & Scoring

Decoy structures

Cliques of

structures

Search, Optimization,

Enumeration

103-105

(104~50TB)

Search

10 GB –

500 TB

Native Structures

Computational Algorithms in Structure Modeling

Multitude of combinatorial optimization problems with different data access patterns.

Example: Ab Initio Prediction of Protein 3-d Structure


Results protein docking

40 decoys with the theoretical probability near 0.8 have on average 60% of native contacts

Up to 1000

threads

Results: Protein Docking

  • Full implementation for Bayesian potential energy functions for protein docking

  • The size of the docking benchmarking set (~1,200) is unprecedented in the field. High quality results were obtained for over 70% of all tested complexes (native in top 5), indicating that the method is very efficient.

  • Successfully modeled protein complexes using the structures from other organisms

  • Docking calculations are scalable to 1000s processor due to surface patch approach for parallelization


Future protein docking

Future:Protein Docking

  • Develop further Protein Interface Server (PINS) database (pins.ornl.gov).

  • Develop theory and computational implementation for docking potentials taking into account orientation of the contacting residues (Bayesian). Petascale implementations for docking platforms.

  • Improve mathematical methods for prediction validation.

  • Analyze and annotate PINS interfaces related to the metabolic functions involving carbon-processing pathways in cellulose-degrading bacteria. Construct predictions for the organisms important for Bioenergy Centers in cases when full or partial genome sequence are available.


Predictive model building1

NMR

X-ray

Genomics

Neutron Scattering

Imaging

Structural Models

3-d Structure Protein Docking Protein-RNA Protein-DNA Protein-Ligand

Interactomics

Genomics

Metabolomics

Quantitative Proteomics

Transcriptomics

Network/Pathway Models

Metabolic Regulatory SignalingProtein Interaction

Predictive Model Building


In silico biology computational path toward holistic understanding of living cells

LDRDD07-014Modeling Cellular Mechanisms for Efficient Bioethanol Production through Petascale Analysis of Biological Networks

Andrey Gorin, Nabeela Ahmad, Andrew Bordner, Robert Day,

Jessie Gu, Guruprasad Kora, Chongle Pan,

Byung-Hoon Park, Nagiza Samatova, Edward Uberbacher, Cray. Inc

Oak Ridge National Laboratory, FY2007-FY2008


Future graph algorithms for networks

Future: Graph Algorithms for Networks

  • Analysis of graphs reflecting physical interactions in structures

    • Very wide set of problems, but with a uniform set of “translation” rules

    • Enormously large graphs

    • Directly connected to modeling petascale applications

  • Developing graph representations for real cellular subsystems.

    Nodes: genes, proteins, DNA elements, metabolites.

    Edges: translation, positive regulation, production of metabolite

  • Flexible representations

    Integration of many tools to annotate 5’ regions

    Promotor alignment

    Models to predict transcription patterns based on promotor models


Future mining networks for bioenergy

Future: Mining Networks for Bioenergy

  • Switch grass paralogs/homologs has to be identified in collections of >250,000 partial gene data (EST) from rice, maize, sorghum genomes

  • It is expected that 6000 Populus lines will be passed through cell wall phenotyping pipeline (expression arrays, proteomics data, etc)

  • Huge data sets are already (e.g. Obauashi et al. (2007) 1,388 expression arrays)

(From Bioenergy Center proposal)


Future taking it to petascale

Future: Taking It to Petascale

  • Port to NLCF platforms

    • Scale Maximal Clique Enumeration from 100s to 1000s processors

    • Introduce similar implementations for other codes in pGraph and BioGraphE

  • Optimize performance on NLCF architectures:

    • Deliver efficient MPI-based multi-core implementations

    • Minimize thread locking/unlocking overheads

    • Exploit data locality in job redistribution strategies

    • Minimize I/O overheads via buffering, data compression, hierarchical parallel I/O.


Potential areas of joint work

Potential Areas of Joint Work

Mathematics

  • Statistics of pattern recognition problems

  • Combinatorial optimization methods beyond dynamic programming

  • Bayesian system with strong feature correlations

    Computer science

  • Graph algorithms working for VERY large graphs 104-105 nodes and across many graphs (103) in parallel

  • Flexible software systems for representation transcription regulation networks in the bacterial cells (4 genes)

  • Petascale implementation strategies: MPI implementations, load balancing, etc

    Biological Applications:

  • Novel applications in mass-spectrometry (e.g. gene finding in new genomes, special gene expression in stem cells, mass-spec of cross-linked systems)

  • Expression array analysis and metabolic pathway construction for bacteria involved in bioenergy processing


  • Login