Grids and biology
This presentation is the property of its rightful owner.
Sponsored Links
1 / 47

Grids and Biology PowerPoint PPT Presentation


  • 66 Views
  • Uploaded on
  • Presentation posted in: General

Grids and Biology. Professor Carole Goble University of Manchester, UK BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK 28 th October 2002. Grids and Biology. A take on the Grid Issues in Bioinformatics for Grid Various BioGrids Applicability of Grid to Biology

Download Presentation

Grids and Biology

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Grids and biology

Grids and Biology

Professor Carole Goble

University of Manchester, UK

BBSRC Bioinformatics and eScience Grant Holders Workshop, Warwick, UK

28th October 2002


Grids and biology1

Grids and Biology

A take on the Grid

Issues in Bioinformatics for Grid

Various BioGrids

Applicability of Grid to Biology

Reality check


What is the grid

What is the Grid?

“ Grid computing [is] distinguished from conventional distributed computing by its focus on large-scale resource sharing, innovative applications, and, in some cases, high-performance orientation...we review the "Grid problem", which we define as flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions, and resources - what we refer to as virtual organizations."

From "The Anatomy of the Grid: Enabling Scalable Virtual Organizations" by Foster, Kesselman and Tuecke


What is the grid1

What is the Grid?

  • Resource sharing & coordinated problem solving in dynamic, multi-institutional virtual organizations

  • On-demand, ubiquitous access to computing, data, and services

  • New capabilities constructed dynamically and transparently from distributed services

  • No central location, No central control, No existing trust relationships, Little predetermination

  • Uniformityfor Pooling Resources

  • Virtual pools of resources: databases, clusters….


Biology as a grid application

Biology as a Grid Application

  • Informational Science

  • Large Scale

  • Distributed

  • No one organisation owns it all


Motivation

ESTs

Motivation

Metabolic Pathways

Pharmacogenomics

Human Genome

Combinatorial

Chemistry

Computational

Load

Genome Data

Moores Law

1990

2000

2010


Biomedical computation

BioMedical Computation

[Rick Stevens, Argonne Labs]


Biomedical data high complexity and large scale

Proteins

sequence

2º structure

3º structure

DNA sequences

alignments

Biomedical Data: High Complexity and Large Scale

[Rick Stevens, Argonne Labs]

billions

Protein-Protein

Interactions

metabolism

pathways

receptor-ligand

4º structure

Physiology

Cellular biology

Biochemistry

Neurobiology

Endocrinology

etc.

Polymorphism

and Variants

genetic variants

individual patients

epidemiology

millions

millions

Hundredthousands

ESTs

Expression patterns

Large-scale screens

Genetics and Maps

Linkage

Cytogenetic

Clone-based

MPMILGYWDIRGLAHAIRLLLEYTDSSYEEKKYT...

billions

...atcgaattccaggcgtcacattctcaattcca...

millions


Biogrid projects

myGrid

BioGrid Projects

  • EUROGRID BioGRID

  • Asia Pacific BioGRID

  • North Carolina BioGrid

  • Bioinformatics Research Network

  • Osaka University BioGrid

  • Indiana University BioArchive BioGrid

  • myGrid

  • BioSim

  • e-Protein

  • ObiGrid


Today s grid

A Single System Image

Transparent wide-area access to large data banks

Transparent wide-area access to applications on heterogeneous platforms

Transparent wide-area access to processing resources

Security, certification, single sign-on authentication, AAA

Grid Security Infrastructure,

Data access,Transfer & Replication

GridFTP, Giggle

Computational resource discovery, allocation and process creation

GRAAM, Unicore, Condor-G

Today’s Grid


Immediate benefits

Immediate benefits

  • Uniform file views of directories, regardless of platform

  • Grid-based data transfer libraries for faster access to large files, reducing need for mirror-site servers.

  • Replication to support mirroring

  • Grid APIs provide a job manager with metadata about services to the user. Evaluate the quality of service providers based on factors that may include more than just server performance and availability.

  • Grid-aware applications -- split sequence reference libraries among several servers, where BLAST comparisons can be conducted in parallel.

  • Shielding from a variety of low-level computing problems would otherwise have to address themselves.


Grid landscape

Grid Landscape

Computationally Intensive

Collaborative

Visualisation

Data Intensive

Knowledge Intensive


Grid landscape1

Grid Landscape

Computationally Intensive

Collaborative

Visualisation

Data Intensive

Knowledge Intensive


Classical grids

Classical Grids emphasise sharing of physical resources.

Existing Grid middleware (e.g. Globus, Condor, Unicore) allows resource discovery, resource allocation, data movement, certification …

Classical Grids


High performance bioinformatics software

High Performance Bioinformatics Software

[Jack da Silva, NCSC, Paracel]


European datagrid

European DataGrid


Grids and biology

Managed access to specialist remote resources


Grids and biology

  • Access portal for biomolecular modeling resources.

  • Interfaces to enable chemists and biologists to be able to submit work to HPC facilities

  • Visualization of electrostatic field generated by a molecule.

    dr Krzysztof Nowinski (ICM)


Biogrid system

Biogrid system

SCORE

Management Station

SCORE

Management Station

Myrinet-2000

Connected to

Grid system3

Grid system 1

Express5800/ISS for PC-Cluster

Xeon2.2G x 8 + Management node1

Flat Neighborhood networks

1000Base-SX

Grid system 2

NEC Blade Server78node(156CPU)

1000Base-T x 12

Data Grid Disk

Express5800/140Ra-4 x3


Remote control of instruments

(Chicago)

STAR TAP

(UC San Diego)

SDSC

Osaka University

Tokyo XP

TransPACAPAN

vBNS

JGN

UHVEM

(Osaka, Japan)

NCMIR

(San Diego)

Remote control of instruments

  • Sharing of UHVEM(Ultra High Voltage Electron Microscopy) in Osaka University with NCMIR (National Center for Microscopy and Imaging Research)

    • 3 Million electron volts

    • the most powerful microscopy


Home computers evaluate aids drugs

Home ComputersEvaluate AIDS Drugs

  • Community =

    • 1000s of home computer users

    • Philanthropic computing vendor (Entropia)

    • Research group (Scripps)

  • Common goal= advance AIDS research

From Steve Tuecke 12 Oct. 01


Matlab

Matlab

Geodise releasein November 02

[email protected]

  • Matlab and toolboxes for mathematical computation, analysis, visualization, and algorithm development:

MATLAB is an intuitive language and a technical computing environment. It provides core mathematics and advanced graphical tools for data analysis, visualization, and algorithm and application development. With more than 600 mathematical, statistical, and engineering functions, engineers and scientists rely on the MATLAB environment for their technical computing needs.”

(www.mathworks.com)

CROSS PLATFORM/ OS


Grids and biology

BioSim -- Molecular simulations as a tool for protein structure analysis

[Sansom]

synchrotron

compute GRID

MD database

novel biology…

  • Overall vision – simulation as an integral component of structural genomics

  • Needs both capacity (many systems) and capability (large systems - HPCx)

  • Molecular Dynamics database (distributed)


Grid landscape2

Grid Landscape

Computationally Intensive

Collaborative

Visualisation

Data Intensive

Knowledge Intensive


Visualization bioinformatics

[Rick Stevens Argonne Labs]

Visualization + Bioinformatics

Visualization

Environment

Bioinformatic

Analysis Tools

Microbiology &

Biochemistry

Genome Visualization Tools

Function Assignment

Whole Genome Analysis

Metabolic Reconstruction

Enzymatic Constants

Metabolic ***

Network Visualization Tools

Stoichiometric Representation

& Flux Analysis

Proteomics

Interactive Stoichiometric

Graphical Tools

Dynamic Simulation

Whole Cell Visualizations

Image/Spectra Augmentations

Laboratory Verification


X ray microtomography

X-ray microtomography

  • Scientific discovery can be enhanced by closely coupling computation and experiment. Simulation, visualization and data gathering coupled

  • X-ray microtomography produces 3D X-ray attenuation maps of specimens at a microscopic level

  • Expensive synchrotron beam time resources optimally used to obtain sufficient resolution for simulation


Interactive steering

Interactive Steering

  • User steers calculation from laptop

  • Controlled steering on supercomputers

  • Visualization and computation use large scale machines accessed via Grid.

Enables controlled simulation using knowledge and skills of trained scientist.


Scalable molecular dynamics

Scalable molecular dynamics

  • Structure of a protein in a fluid medium

  • Calculation takes into account forces between protein and ambient medium (in this case water molecules)

  • Run on world largest academic computer, LeMieux at PSC (6 Tflops theoretical peak)


Grid landscape3

Grid Landscape

Computationally Intensive

Collaborative

Visualisation

Data Intensive

Knowledge Intensive


Grids and biology

UCSF

UIUC

From Klaus Schulten, Center for Biomollecular Modeling and Bioinformatics, Urbana-Champaign


Grids and biology

http://www.ks.uiuc.edu/Research/biocore/


Grid landscape data

Grid Landscape: DATA!!

Computationally Intensive

Collaborative

Visualisation

Data Intensive

Knowledge Intensive


Information weaving and question answering

Information Weaving and Question Answering

  • Large amounts of different kinds of data & many applications.

  • Highly heterogeneous.

    • Different types, algorithms, forms, implementations, communities, service providers

  • High autonomy.

  • Highly complex and inter-related, & volatile.


Grids and biology

proteome sequences

sequences

SCOP

CATH

PDB

NRPROT

INTERPRO

TM, CC, LC, SIG & MOTIFS

PSIBLAST & HHMs

PDB hit

noPDB hit

3D modelling x 2

fold recognition x 2

structure-based function prediction

structural and functional annotation

[Mike Sternberg]

Annotation Pipeline


Mygrid

myGrid

RASMOL

  • Personalised extensible environments for data-intensivein silico experiments in biology

  • Straightforward discovery, interoperation, deployment & sharing of services

    • Service-oriented architecture

  • Integration and Information

    • Workflow & Databases

  • Experimentation

    • Provenance, propagating change, personalisation

For bioinformaticians who are building tools and using or providing services


Discoverynet

DiscoveryNet

  • Bio Chip Applications

    Protein-folding chips: SNP chips, Diff. Gene chips using LFII

    Protein-based fluorescent micro arrays

1-1000

10-1000

>10000

Data Quality

Visualisation

Structuring

Clustering

Distributed

Dynamic

Knowledge

Management

http://www.discovery-on-the.net/

High Throughput Sensing (HTS) Applications

Large-scale Dynamic Real- time Decision support

Large-scale Dynamic System Knowledge Discovery

Based on Kensington

Discovery Platform

Grid-based Knowledge Discovery

Grid-based Data Mining, Collaborative Visualisation

Information Structuring

Information Integration & Composition,

Semantics & Domain-based Ontologies, Sharing

Distributed Data Engineering

Data Registration, Data Normalisation, Data Quality

Based on Globus & ORB Infrastructure

High Throughput Computing Services

Utilising Grid Infrastructure for HT Computing

Grid Basic Infrastructure

Globus/Condor/SRB


Grid evolution

Grid Evolution

  • 1st Generation Grid

    • Computationally intensive, file access/transfer

    • Bag of various heterogeneous protocols & toolkits

    • Recognises internet, Ignores Web

    • Academic teams

  • 2nd Generation Grid

    • Data intensive -> knowledge intensive

    • Services-based architecture

    • Recognises Web and Web services

    • Global Grid Forum

    • Industry participation

We are here!


Grids and biology

A Grid of resources, not just compute resources but databases, digital libraries, instruments, workflows, documents …

A Grid vs The Grid

NovartisGrid

BioSimGrid

MouseGrid

Logical

Grid Middleware

These configurations are dynamic

Resources discovered, combined, used and disbanded as and when needed or available.

Gigabit IP Network

Physical

Node

Node

Node

Geographically

(e.g. UKGrid)

Node


A configuration of resources

A configuration of resources

services

  • Not just compute services but databases, digital libraries, instruments, workflows, documents …

Open Grid Service Architecture

OGSA

Grid Services

Web Services

Grid Technology


Bio services

Bio Services

  • Drug Discovery

  • Microbial Engineering

  • Molecular Ecology

  • Oncology Research

Domain Oriented Services

  • Integrated Databases

  • Sequence Analysis

  • Protein Interactions

  • Cell Simulation

Basic BioGrid Services

Grid Resource Services

  • Compute Services

  • Pipeline Services

  • Data Archive Service

  • Database Hosting

  • Workflow Enactment

  • Event notification

Common Services

Base Services

Fabric Services


What we need to create

What We Need to Create

  • Grid Bio applications enablement software layer

    • Provide application’s access to Grid services

    • Provides OS independent services

  • Grid enabled version of bioinformatics data management tools (e.g. DL, SRS, etc.)

    • Need to support virtual databases via Grid services

    • Grid support for commercial databases

  • Bioinformatics applications “plug-in” modules

    • End user tools for a variety of domains

    • Support major existing Bio IT platforms


Requirements for the biogrid

Requirements for the BioGrid

  • Open and extendable architecture

    • Enable tie in to service stack at appropriate points

    • Not just access via Portals

  • Leverage scripting tools in wide use for Bioinformatics

    • Create BioGrid services bindings for PERL and Python

  • Address data federation and integration

    • Leverage work of IBM, Lion BioSciences, DAS, BioMOBY, etc.

  • Match the biology workflow and tool chain

    • Create high-level BioGrid services to address critical stages in existing workflow

    • Support composibility of new BioGrid tools with existing tool chain elements


Some biogrid challenges

Some BioGrid Challenges

  • Scalable human bioinformatics expertise

    • Best people working on the important problems

    • Exploit collaboration technology to create world class teams

  • Robust local bioinformatics computing environment

    • Best systems administrators and high-end technologies

    • Embed local resources into the Grid via portal technologies

  • Access to leading edge bioinformatics software and databases customized to user needs

    • Core content from top scientists and developers

    • Integrated access to biological databases

  • Worldwide access to robust computing and database infrastructure

    • Leverage Grid technology to provide worldwide access

    • Integrate purpose built systems and service providers


Reality checks

Reality Checks!!

  • The Technology is Ready

    • Not true — its emerging

      • Building middleware, Advancing Standards, Developing, Dependability

      • Building demonstrators.

      • The computational grid is in advance of the data intensive middleware

      • Integration and curation are probably the obstacles

      • But!! It doesn’t have to be all there to be useful.

  • We know how we will use grid services

    • No — Disruptive technology

      • Lower the barriers of entry.


Reality checks1

Reality Checks!!

  • It’s the only game

    • Not true — I3C, BioMOBY, bioDAS, OMG LSR

      • Grid and Web service merge makes integration likely.

  • One Size Fits All

    • Not true

      • Addressed by a minimum set of composable virtual services, But starting with Globus

  • It’s only for “big” science

    • No — “small” science collaborates too!

  • Biology is not unique!

    • AstroGrid


Not a silver bullet

Not a silver bullet!

Its just middleware not magic

  • Data quality

  • Content management of databases (controlled vocabularies)

  • Provenance and versioning policies

  • Appropriate use of tools

  • Computational inaccessibility of free text annotation

  • Database accessibility through means other than point and click web interfaces.

    Independent of the Grid!


Life sciences grid lsg

Life Sciences Grid (LSG)

http://people.cs.uchicago.edu/~dangulo/LSG/


  • Login