1 / 35

Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

Proteus, a Grid based Problem Solving Environment (PSE) for Bioinformatics : Architecture and Experiments. Presenter: Michael Robinson Agnostic: Javier Munoz Advanced Topics in Software Engineering CIS 6612 Florida International University

enya
Download Presentation

Authors: Mario Cannataro 1 , Carmela Comito 2 , Filippo Lo Schiavo 1 , and

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Proteus, a Grid based Problem Solving Environment (PSE) for Bioinformatics: Architecture and Experiments Presenter: Michael Robinson Agnostic: Javier Munoz Advanced Topics in Software Engineering CIS 6612 Florida International University July 31, 2006 Authors: Mario Cannataro1, Carmela Comito2, Filippo Lo Schiavo1, and Pierangelo Veltri1 (February 2004) 1 University of Magna Graecia of Catanzaro, Italy 2 University of Calabria, Italy

  2. Organization Abstract ~60% is about Bioinformatics Proteus Architecture First Test Implementation Results of First Test Conclusion and Future Work

  3. Abstract • Live sciences Bioinformatics Computer Science • Data Files sizes • Computer power

  4. The Partners • What is Livesciences • What is Bioinformatics • Other Sciences used in Bioinformatics • What is Computer Science

  5. Human Genome • The sum total of DNA in an organism is its genome. • The Human Genome Project (HGP) an international effort, began in October 1990, and was completed in 1999, 2003, 2004. (http://www.pbs.org/wgbh/nova/genome/program.html) • Project goals were to: • Determine the complete sequence of the 3 billion DNA bases • Identify all human genes • And make them accessible for further biological study

  6. Human Genome • The bacterium E. coli and others were used to help develop the technology and interpret human gene function. • The Human Genome Project was sponsored by: The U.S. Department of Energy and The U.S. National Institutes of Health http://www.preventiongenetics.com/edu/genetics_nutshell.htm

  7. DNA (ACGT) • Humans have from 10 to 100 trillion cells • Each Human cell has about 3 billion nucleotides • We have approximately 30,000 genes • Of the three billion letters of DNA that we have, only 1 to 1.5 percent of it is gene the rest is STUFF”. • The functions are unknown for over 50% of known genes

  8. DNA (ACGT) Human Genome • 3,000,000,000 ~ dna bases • 30,000,000 ~ bases in genes • 2,970,000,000 ~ stuff • adenine (A) forms a base pair with thymine (T) guanine (G) forms a base pair with cytosine (C)

  9. Similarities to Human DNA

  10. The gene sizes • Largest known human gene is dystrophin at 2.4 million bases. • Chromosome 21 is the smallest human chromosome. Three copies of this autosome causes Down syndrome, the most frequent genetic disorder associated with significant mental retardation. Academic groups from Germany and Japan mapped and sequenced it, it has 33,546,361 bp of DNA Analysis of the chromosome revealed: • 127 known genes, • 98 predicted genes, • and 59 pseudogenes. • Smallest bacterial genome, Mycoplasma genitalium size of 580 kbp

  11. Bioinformatics • DNA RNA PROTEINS MUTATIONS, ILLNESSES MEDICATIONS CLONING

  12. DNA (ACGT) • Pseudomonas Aeruginosas PA01 6,264,403 bases, 5565 genes • complement(6264226..6264360) 6264181 gcttgtcccg gtcgaagtcc cgactcacca cccgtaccgg ataaatcaga cggtcagacg 6264241 cttacggcct ttggcgcgac gacgcgacag aacctgacgg ccgttcttgg tggccatacg 6264301 ggcgcggaaa ccgtggacgc gagcgcgctt gagggtgctg ggttggaaag tacgtttcat 6264361 gattcggtac ctgggttgac gacttgaggt cgcagtgacc ccg

  13. RNA • In RNA, thymine is replaced by uracil (U). DNA 6264181 gcttgtcccg gtcgaagtcc cgactcacca cccgtaccgg ataaatcaga cggtcagacg 6264241 cttacggcct ttggcgcgac gacgcgacag aacctgacgg ccgttcttgg tggccatacg 6264301 ggcgcggaaa ccgtggacgc gagcgcgctt gagggtgctg ggttggaaag tacgtttcat 6264361 gattcggtac ctgggttgac gacttgaggt cgcagtgacc ccg RNA 6264181 gcuugucccg gucgaagucc cgacucacca cccguaccgg auaaaucaga cggucagacg 6264241 cuuacggccu uuggcgcgac gacgcgacag aaccugacgg ccguucuugg uggccauacg 6264301 ggcgcggaaa ccguggacgc gagcgcgcuu gagggugcug gguuggaaag uacguuucau 6264361 gauucgguac cuggguugac gacuugaggu cgcagugacc ccg

  14. Amino Acids

  15. Proteins (sequences) DNA 6264181 gcttgtcccg gtcgaagtcc cgactcacca cccgtaccgg ataaatcaga cggtcagacg 6264241 cttacggcct ttggcgcgac gacgcgacag aacctgacgg ccgttcttgg tggccatacg 6264301 ggcgcggaaa ccgtggacgc gagcgcgctt gagggtgctg ggttggaaag tacgtttcat 6264361 gattcggtac ctgggttgac gacttgaggt cgcagtgacc ccg RNA 6264181 gcuugucccg gucgaagucc cgacucacca cccguaccgg auaaaucaga cggucagacg 6264241 cuuacggccu uuggcgcgac gacgcgacag aaccugacgg ccguucuugg uggccauacg 6264301 ggcgcggaaa ccguggacgc gagcgcgcuu gagggugcug gguuggaaag uacguuucau 6264361 gauucgguac cuggguugac gacuugaggu cgcagugacc ccg PROTEIN MKRTFQPSTLKRARVHGFRARMATKNGRQVLSRRRAKGRKRLTV

  16. Proteins: Pattern Matching G-H-E-X(2)-G-X(4,5)-[GA]

  17. Proteins: Structures • Chemical properties that distinguish the 20 different amino acids cause the protein chains to fold up into specific three-dimensional structures that define their particular functions in the cell

  18. Reality • Somewhere in this dense chemical forest are genes involved in deafness, Alzheimer, cancer, cataracts, etc. But where? This is such a maze scientists need a map. • Out of three billion base pairs in our DNA, just one single letter can make a difference.

  19. Data Locations • GenBank in the US, 1974 1997 = 1.26 gigabases http://www.ncbi.nlm.nih.gov/ 2004 = 39 gigabases 2005 = 100 gigabases • EMBL in England, 1980 http://www.ebi.ac.uk/embl/ • DDBJ in Japan, 1984 http://www.ddbj.nig.ac.jp/

  20. Some Databases • The Swiss Institute of Bioinformatics maintains the following databases: Ashbya Genome Database Cancer Immunome Database Eukaryotic Promoter Database (EPD) GermOnline MyHits PROSITE Swiss-Prot and TrEMBL SWISS-2DPAGE SWISS-MODEL Repository

  21. Specialization • Plasmodb http://www.plasmodb.org/plasmo/home.jsp parasitic eukaryote Plasmodium the causative agent of the disease Malaria. apibugz@delphi.pcbi.upenn.edu

  22. Proteus General Architecture

  23. Proteus’ Software Modules

  24. Some Taxonomies of the Bioinformatics Ontology

  25. Snapshot of the Ontology Browser

  26. Human Protein Clustering Workflow

  27. Snapshot of VEGA: Workspace 1 of the Data Selection Phase

  28. Software Installed in the Example Grid

  29. Snapshot of the Ontology Browser

  30. Snapshot of the Ontology Browser

  31. Snapshot of the Ontology Browser

  32. Snapshot of VEGA: Workspace 1 of the Pre-processing Phase

  33. Conclusions and Future WorkExecution Times of the Application

  34. References On the paper the authors cited 27 references

  35. Questions Thank you

More Related