1 / 34

The Global Bio Grid

The Global Bio Grid. Virginia Center for Grid Research. Andrew Grimshaw University of Virginia January, 2006. Why Bio Grids? Grid Basics The Global Bio Grid. In ten years the world will be very different. Think back ten years. No web Wide-spread internet was new

Download Presentation

The Global Bio Grid

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Global Bio Grid Virginia Center for Grid Research Andrew Grimshaw University of Virginia January, 2006

  2. Why Bio Grids? • Grid Basics • The Global Bio Grid

  3. In ten years the world will be very different.

  4. Think back ten years. • No web • Wide-spread internet was new • Human Genome Project still far from completion • Science (biology) done primarily in individual labs

  5. Today • Billions a year in e-commerce • Internet everywhere • Broadband to your home • Wireless becoming pervasive • Pervasive device are proliferating – motes • Sequencing of organisms a daily event. Bioinformatics hitting the main stream

  6. Tomorrow • $1000/sequnce for humans – becomes standard clinical practice • “Biology is becoming an information science” (Large Scale Biomedical Science: Exploring Strategies for future research, Institute of Medicine, National Research Council, 2003) • Global interconnected networks – grids • Provide transparent, secure, access to data, applications, and on-demand compute. • Research using not just your data, but all trusted data, not just your applications, but any trusted application. • Implications for progress are significant.

  7. There are a number of “catches” • So much data! • So many organizations with so little trust! • So much complexity!

  8. An IT guys view • Data is all over, of all different forms, with lots of different policies • Need to get the right data in the right place at the right time • Ontology problem – how do we compare, integrate, the databases • Need to understand semantics, automatically transform • Semantics • Knowledge Discovery – “mining”

  9. This is where grids enter the picture(we do the plumbing)

  10. Some lessons learned • 10+ years in academic and commercial grids • All/most problems are not technical • Users don’t want change! • Too many grids are technology centric • Must keep “activation energy low” • Need a user-centric approach • There are at least four classes of users • Wide variance in computational savvy

  11. What is a Grid? A grid is all about gathering together resources and making them accessible to users and applications. A grid enables users to collaborate securely by sharing processing, applications,work flows and processes, and data across heterogeneous systems and administrative domains for collaboration, faster application execution, and easier access to data. The emphasis is on secure access to a wide variety of resources

  12. Grid System Characteristics of Grid systems Numerous Resources Ownership by Mutually Distrustful Organizations & Individuals Connected by Heterogeneous, Multi-Level Networks Different Security Requirements & Policies Required Different Resource Management Policies Potentially Faulty Resources Geographically Separated Resources are Heterogeneous

  13. Characteristics of a Grid system Numerous Resources Ownership by Mutually Distrustful Organizations & Individuals Connected by Heterogeneous, Multi-Level Networks Different Security Requirements & Policies Required Different Resource Management Policies Potentially Faulty Resources Geographically Separated Resources are Heterogeneous

  14. What grids are not • The solution to all problems • Clusters of machines • SETI@home • Any one particular technology

  15. Users Users Provide shared services Access Data Run programs Collaborate Grid Site 0 Site 1 Site 2 Site 3 HPSS Cluster Cluster Users view

  16. Grid Computing Scenarios Legion Grid Software – Compute and Data Grid Partner Grids • Multiple owners, sites, domains • Multiple file systems • Internet connectivity Campus/Enterprise Grids • Multiple owners, domains • Multiple file systems • WAN connection Cluster Grids • Single owner, department, project • Single domain, file system • LAN connection Desktop Cycle Aggregation • Limited acceptance in commercial enterprises

  17. Standards • Global Grid Forum – ggf.org • OGSA – Open Grid Services Architecture • Web-Services based IPC • WSRF and possibly other • OGSA-BES – Basic Execution Service • OGSA-ByteIO – file IO • WS-Naming – abstract name to EPR • RNS-lite – Resource Name Space

  18. The Global Bio Grid

  19. GBG concept • Federated access to multiple • Data sources • Public databases • Commercial databases • In-house databases, annotations, etc. • Application suites (including processes and workflows) • Compute resources • Shared among collaborative research teams • Multiple research locations • Virtual organizations • Built on evolving computing standards (GGF, I3C, WS-*)

  20. Global Bio Grid • Datagrid using Avaki DG technology • Working on ADG available free for “.edu” • UVA, NCBIO, U-Texas, Texas Tech • Already operational • Flat file and relational • Working on an OGSA-compliant implementation • Compute grid at UVA on-line • 64 dual processor Opteron’s available • Sunfires • Hundreds of Windows machines • Legion 1.8 based – moving towards OGSA-compliant services • Applications • Biomarker • Searching pub med • Hospital info integration

  21. Three resource classes illustrate the Grid-effect • Data • Processing • Applications

  22. Data • Suppose you have collaborators with critical databases (clinical, protein, other) that you need to use. • You use a number of databases that change on a regular basis. • You want to “mine” heterogeneous data sets (relational, flat-file, XML, …) in different locations – say in a hospital • Want to produce, consume, or share derivative data products, e.g., the result of a set of joins and data transformation steps. • This applies to business data (BI/EII) as well as life science data

  23. PDB NCBI EMBL SEQ_1 Data • DataGrid: Unifying fabric for data access • Transparent access to multiple DBs • Multiple domains • Highly-secure, flexible access control • Automatic cache management and coherence Public DB Public DB Public DB SEQ_1 SEQ_3 SEQ_2 APP 1 APP 2 Biology Biochemistry Partner Institution Partner Institution Research Institution

  24. Three Concrete Examples • KDS – “data mining” on widely separated data sets such as PubMed. • “Map” UniProt datasets into data grid • Researchers no longer need to spend time downloading latest • Extended Hospital

  25. Non-related Hospitals Research Authorized Family Clinics / Large Practices Department Domain Department Domain Department Domain Data Data Data Emergency vehicles Insurance companies Extended Hospital Data Warehouse HOSPITAL

  26. Processing • Classic high-throughput computing • Suppose you have thousands of computationally intensive jobs to run • SW, CHARMm, Sequest, a.out • Your usage is bursty – need a lot over short period of time, but often have idle resources • You wish you had more!

  27. PDB Cluster 1 NCBI Cluster 2 EMBL Cluster N SEQ_1 Processing Data Public DB Public DB Public DB Compute Grid: Shared access to processing • Flexible, location-independent access to virtually unlimited processing, on-demand • Scheduling, usage, management policies • System detects, recovers from job failures • Heterogeneous platform support • Usage accounting, as required SEQ_1 SEQ_3 SEQ_2 APP 1 APP 2 Biology Biochemistry Partner Institution Partner Institution Research Institution

  28. Concrete Examples • Biomarkers project wants to run Sequest-2 using public databases • Charmm/Amber • Gnomad (Altman et al) • BLAST, FASTA, …. • Autodock

  29. Applications • Suppose you want to use applications or workflows developed, maintained, and supported by others – without the hassle of installing all of them on your gear. • Suppose you want to couple multiple applications developed at different institutions together.

  30. PDB NCBI EMBL PDB Cluster 1 APP 1 NCBI SEQ_N APP 2 Cluster 2 Data EMBL APP N Cluster N SEQ_1 Applications Processing Data Public DB Public DB Public DB Grid users share applications, employing multiple data & processing resources • Flexible binary management • No need to recompile applications • Securely share applications • Restrict who gains access • Restrict where apps run SEQ_1 SEQ_3 SEQ_2 APP 1 APP 2 Biology Biochemistry Partner Institution Partner Institution Research Institution

  31. PDB Cluster 1 APP 1 NCBI APP 2 Cluster 2 EMBL APP N Cluster N SEQ_1 Applications Processing Data Public DB Public DB Public DB Better Research, Faster • Secure, wide-area access to global breadth of consistent, current data • Access to vast processing power • Ability to securely share proprietary data and applications, as needed SEQ_1 SEQ_3 SEQ_2 APP 1 APP 2 Biology Biochemistry Partner Institution Partner Institution Research Institution

  32. Now & Future! Today 60’s to 80’s Grid & WS Low Level Network Programming 50’s Batch OS Multi-User Timeshare Bare Metal Programming Summary Evolution in action

  33. Summary • Grids will have a huge impact on the life sciences • Prototype GBG operational • Applications are underway • We’re always looking for new applications

More Related