Overview ist 2001 38344
1 / 19

Overview IST-2001-38344 - PowerPoint PPT Presentation

  • Uploaded on

Overview IST-2001-38344. Cells are a collection of protein nanomachines. A biological challenge. To build models of protein complexes & understand the function of each component, based upon available evidence.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Overview IST-2001-38344' - joshua

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Overview ist 2001 38344



A biological challenge
A biological challenge

  • To build models of protein complexes & understand the function of each component, based upon available evidence.

  • However, to build evidence for each protein interaction, a biologist must find, integrate, compare & then validate the results from a number of separate resources.

Genomics proteomics

DNA ‘chips’


HTP Sequencing


Gene prediction


Domain analysis


Genomics & Proteomics






Genomics proteomics1



Expression Space

Literature Space

Genomics & Proteomics

The need for computerised information systems
The need for computerised information systems

  • New HTP methods produce orders of magnitude more data than before:

    • More than is interpretable manually.

    • Data are stored in a (semi-)structured format.

  • Much knowledge is in literature & patents:

    • 13,000,000 abstracts in MEDLINE.

    • Knowledge is stored in an unstructured format.

  • Solution: computerised information systems:

    • Enable data mining & visualisation of integrated resources, with text analysis.

Components of biogrid
Components of bioGrid

  • Gene expression:

    • ExpressionSpace:

      • Clustering of microarray data.

      • May require large memory.

  • Protein interaction:

    • PSIMAP:

      • Predict interactions between protein domains.

      • May pre-compute as relatively unchanging.

  • Literature:

    • GoPubMed-D:

      • Organises corpus of documents into the GO ontology.

      • Lexical analysis requires lengthy compute.

Biogrid an integrated platform for gene expression data protein interaction data and literature

Expression Space: Space Explorer

Interaction Space: PSIMAP







Literature Space: Classification Server

bioGrid: An integrated platform for gene expression data, protein interaction data, and literature

Workflow for use case part i
Workflow for use case - Part I

  • Search literature for papers about the experimental system studied:

    • Microarray & mitochondria.

  • Upload the gene expression data set.

  • Cluster the gene expression data set.

  • Identify a cluster that contains genes of interest, e.g. energy production.

  • Examine the expression profiles of the genes in the cluster.

Workflow for use case part ii
Workflow for use case - Part II

  • Calculate an induced PSIMAP graph for the genes in the expression cluster.

  • Explore PSIMAP graph & nodes.

  • For pairs of genes predicted to interact:

    • Search literature for papers citing both genes.

    • Classify literature to assess possible function or metabolic processes of genes.

  • Assimilate evidence for components of a protein complex.

Distributed technology implementation
Distributed technology implementation

  • Globus, Unicore, Legion, …

    • Are geared towards computational complexity, not semantic complexity.

  • BioGrid’s approach:

    • Agent-based approach.

    • Integration of rules, reasoning, and messaging in a Java-environment.

    • Using meta-model.

  • Advantage:

    • Easy to maintain, easy to use, includes code distribution, architecture independent, geared towards farms of local and remote machines.

Prova aa

  • Extensions to Prova for rule-based agent scripting.

  • Prova-AA introduces:

    • Messaging (local, JMS, and JADE).

    • Reaction rules.

    • Context-dependent inline reactions for asynchronous messaging.

    • Embedding of Prova agents in Java and Web app’s.

  • Advantages:

    • Cooperating agents vs. GRID RPC.

    • Ease of development and maintenance.

    • Platform independence and portability.

    • High level specification of communication protocols.

    • Native syntax integration with Java.

    • Low-cost creation of distributed workflows. And ad-hoc networks of computation nodes.

Overview ist 2001 38344

Distributed GoPubMed-D (2/3)

BioGrid Prototype integrates with GoPubMed-D via embedded Prova-AA JADE agent.

Overview ist 2001 38344

Distributed computation with Prova-AA agents

A flexible solution for a self-managing self-balancing distributed computation:

  • Manager and Workers architecture based on Prova-AA agents with Java computation modules.

  • Loosely synchronous interaction.

  • Minimal compact coding (30 lines for Manager and 20 lines for Worker).

  • Manager does not need to keep a registry of the Workers that can join in at any time.

  • Computation is divided in small atomic subtasks (4 or 5 proteins).

  • Manager dispatches a new subtask asynchronously upon receiving a ready message from a Worker.

  • Worker computes a subtask and responds with the results in a reply message and a new ready message.

  • Workers compute subtasks at their own pace so load balancing is automatic.

  • Workers extended with routing capabilities are available.

  • Can be easily extended with failover capabilities.

Building an information system for biology is non trivial
Building an information system for biology is non-trivial

  • Molecular biology resources:

    • Are heterogeneous in content:

      • Genomics, proteomics, literature.

    • Exist in a large number:

      • Public, commercial, organisational, personal.

    • Variable quality: Curated vs. automatic.

    • Have different interfaces: Web, SQL, SOAP, etc.

    • Are geographically distributed w/o yellow pages.

    • Store data in different formats - few standards.

    • Change rapidly.

    • Confidentiality & IPR protection.

    • Are too large to transport conveniently.

Technology challenges in building biogrid
Technology challenges in building bioGrid

  • Semantic Complexity:

    • Computer does not “understand” data.

    • DBs and systems cannot inter-operate.

  • Computational complexity:

    • Generating protein interaction map takes ca. 1 day.

    • Analysing large sets of gene expression data can take up to an hour.

    • Analysis of large text bodies complex.

Social challenges in building grid
Social challenges in building Grid

  • Over-hyped & scepticism.

  • Technology stability & reliability.

  • Security.

  • Usability.

  • Peer-reviewed results in major biomedical journals:

    • Science, Nature, Cell, BMJ, Lancet, etc.