Applications of Grids to biomedicine

Applications of Grids to biomedicine Ignacio Blanquer Vicente Hernández

Objectives Introduction and motivation • Describe some challenges faced in the adaptation of biomedical applications to Grids and the proposed solutions. • Discuss on open research lines and needs from the area of Healthgrids. • Share requirements that are common to many other research areas.

Contenidos • Introduction and motivation. • Two significant areas • Medical Imaging. • Bioinformatics. • Conclusions.

The Grid and High Performance computing Research Group Introduction and motivation Epidem. Bioinf. Engineering • A Research Group on the area of ICT and Computational Sciences, created in 1986 by Vicente Hernández and currently comprising 29 members (http://www.grycap.upv.es). • The group started its activity in the area of basic research in Numeric Computing. • This research was evolving to the area of Parallel and Distributed Computing. • Further, the group focus on Grid Technologies. • Currently, the main target of the research activity of the group is e-Science. MedicalImaging Proteomics Fotonics e-Gov. e-Science e-Infrastructur. Middleware Grid Technologies Parallel Computing DistributedComputing Numeric Computing

The Grid and High Performance computing Research Group Introduction and motivation • The GRyCAP has participated in 19 European Projects, acting as a coordinator in 10 of them (7 of them in the biomedical sector). • The GRyCAP is integrated in the Institute for Applications of Advanced Information and Communication Technologies (ITACA) and in the Biomedical Research Network Centre (CRIB) . • Vicente Hernández is currently the Scientific Coordinator of the Spanish Network for e-Science.

What is e-Science?

E-Science as “enhanced” Science Introduction and motivation • "e-Science is about global collaboration in key areas of science, and the next generation of infrastructure that will enable it.“ • "e-Science will change the dynamic of the way science is undertaken.“ John TaylorDirector General of Research CouncilsOffice of Science and Technology

e-Science as a new view of science Introduction and motivation • Large challenges with large social impact • The famous “Data deluge” • Virtual laboratories as alternative to experimental ones. • Leading role of simulation • Interdisciplinarity. • Virtual Research Communities. revolution in science & engineering, research & education networking grids instrumentation computing data curation… value added of distributed collaborative research (virtual communities) Application pull Technology push Source: Mario Campolargo, Former Head of Unit on e-Infrastructures, European Commission

… and what is biomedicine?

The Biomedical Research Community • Biomedicine integrates many different disciplines related with health, life sciences and biochemistry. • It comprises the storage, management and processing of data related with the physiology and structure of living beings. • So the Biomed Community is wide, heterogeneous and has many different challenges.

Technological needs in Biomedicine Introduction and motivation Technologies HT / HP Computing • Service • Integration • Data storage • and Integration Authorization and Security • Collaborative • Work Bioinformatics MedicalImaging Disciplines BiomedicalSimulation Epidemiology Grids & e-Science 2009 15-19/6/2009 Santander

Challenges in biomedicine Introduction and motivation • Increasing computing requirements • Genetics, Genomics, Proteomics, Metagenomics, etc. • Medical Imaging postprocessing. • Biomedical simulation (VPH). • Data storage and Integration • Integrate distributed storage resources. • Efficient storage for read-only bulk data. • Authorization and Security • Virtual Organizations. • Encrypted Storage. • Collaborative Work • Natural way to develop distributed applications on common and distributed data.

Healthgrid association • International non-profit organisation http://wwww.healthgrid.org • With the objective of contributing to structuring the European Research Area on Health. • To create partnerships to benefit high education and research centres in the field of Grid and distributed computing for health, in a broad sense and in an International context. • Coordinators of the White Paper of Health Grids. http://www.healthgrid.org/download.php. • Organisers of a yearly conference • Last edition (7th) was in Paris in 2010.

SHARE Project Introduction and motivation • SHARE: “Supporting and structuring Healthgrid Activities & Research in Europe: Developing a Roadmap” • An European Project (www.eu-share.org) aiming at developing a Roadmap for • Objectives, Ciurrent Situation, Deficiencies, Barriers, Opportunities, Milestones and stakeholders fro the adoption of Grids to Health. • The Roadmap has covered the study of Technological, Legal. Ethical and Economic Issues. • Available at www.healthgrid.org

Do we have tools to tackle these challenges?

E-Infrastructures . . . . . . . Introduction and motivation • E-Infrastructures are supporting e-Science. • Build-up on top of the scientific networks with the aim of sharing resources to enable sharing data, applications and scientific results. • It is not only a matter of technology, and require the definition of policies and agreements. Sharing and federating scientific data Sharing computers, instruments and applications Linking at the speed of the light

240 sites 45 countries 80,000 CPUs 5 PetaBytes >5000 users >100 VOs >100,000 jobs./day • Arqueology • Astronomy • Astrophysics • Civil Protection • Computational Chemistry • Earth Sciences • Finances • Fusion • Geophysics • High Energy Physics • Life Sciences • Biomedicine • Multimedia • Material Sciences • … 32% Source: R. Jones, EGEE Dir. EGEE Bioinformatics Activity, Dr. C. Blanchet (Resp.)

The Biomed VO in EGEE Introduction and Motivation • 253 Registered users. • Authorised for and important share of the resources available (around 25%). • 8 Million jobs and 25 million CPU hours since the starting of the accounting.

E-Infrastructures: ES-NGI Introduction and Motivation • National (Spanish) Grid Infrastructure • Coordinated by the CSIC – IFCA (Isabel Campos). • Evolving to become an ICTS. • Currently comprising 5725 cores shared by 10 spanish centres. • Supporting 7 thematic virtual organisations. • In close cooperation with Portugal.

Significant Use Cases Medical imaging

Challenges on Medical Imaging Medical Imaging use case • Large content provider • At the end of 2010 it will use the 30% of world global storage1. • Intensive computing processing. • Data critical – Privacy • Privacy management has been a major concern on highly distributed systems such as Grids2. • Evidence-base medicine • Need for storing and organising knowledge. • Need for computer aided workflows of post-processing. 1 Fact sheet: Information Storage Trends, IBM. 2 SHARE HealthGrid Roadmap

Challenges in Medical Imaging Medical Imaging use case • Medical Images are large and thus post-processing is computationally intensive, exceeding in many cases the resources of hospitals. • Key information in medical images can be difficult to observe, even for trained specialists. • Training is mainly based on evidence. • The data produced yearly in a medium-sized hospital, is on the order of terabytes, so efficient organisation of the data for research is difficult. • Data is stored distributed, but consolidated access is difficult or inexistent. • Privacy is a key issue dealing with patient data, and even more with medical images. Grids & e-Science 2009 15-19/6/2009 Santander

Ciberinfraestructura Valenciana para Imagen Médica Oncológica Medical Imaging use case • Platform developed in the frame of the project “Creación de una Ciberinfraestructura para el Aprendizaje, Investigación y Estudio Epidemiológico del Cáncer Mediante Imágenes Médicas”. • Coordinated by the UPV with the participation of 5 Hospitals from the Comunidad Valenciana and British Telecom. • The platform uses TRENCADIS, developed by the UPV, to organise medical data by virtual communities. • It enables the creation and search of structured reports and associated images. Spain

Technical Design PrinciplesTOWARDS A GRID ENVIRONMENT FOR PROCESSING AND SHARING DICOM OBJECTS (TRENCADIS) • Open and Standard Software Architecture, based on the Web Service Resource Framework (WSRF) and implemented using Globus 4. • Integrates different local storages of DICOM objects from several centres. • Different storage resources are virtualized providing a common interface. • It organizes data by communities. • Indexing is performed locally and central services simply keep references to the storages where information relevant to a each community is stored.

TRENCADIS Middleware: Data Model Semantic Organisation Users organise themselves on Virtual Communities. From the studies available, only those matching the selection criteria of the Virtual Community profile are accessible. From the images available to a Virtual Community, a user can create an experiment with the studies matching a set of restrictions. From this experiment, more detailed views can be obtained. The criteria for the selection of the relevant information relies on the DICOM tags of the images and the Structured Reports. Medical Imaging use case View: (e.g. Patients Between 1 and 2 Years with hetrogeneous Supratentorial Findings) Experiment: (e.g Neuroblastoma) User Comunity (e.g. Paediatric Neuroimaging) Global Database: All the Images and Reports Shared 25

Trencadis sw architecture Server Services Applications Virtual Storage DICOM-SR Storage Processing Plug-in Communic. Informat. Server Mw Components Grid Ontolog. Volume Rendering DICOM Storage. Image Downl. Ontology Server DCM Package https VOMS Server CORE Mw GridFtp DICOM Gateway SQL Gateway Execution Service Keys Server SSL Fabric DICOM Workstation SQL PACS Computing Farm Code Server

Conventional Report vs. Structured Report Observation of an irregular espiculated mass of 7mm. of maximum diameter, Located on the inter-quadrant inferior line of the left breast. The lesion comprises heterogeneous, linear, grained and branched microcalcifications Apparently a malignant lesion Maglinancy Based on: Mass Size: 7mm. Shape: Irregular Margins: Espiculated Associated Calcification Tipus: Heterogeneous Tipus: Branched Distribution: Grouped. Medical Imaging use case Free text. There is no structure defined and agreed and not associated to images. Structure and lexicon defined and agreed. Associated to images. 27

Increase of Expressivity Medical Imaging use case Problem: Retrieve all the ID Reports from the Patients that have a Staging Value of II with an Heterogeneous Secondary Lesion. SELECT Patients.SRID FROM (Patients INNER JOIN Stages ON Patients.PID = Stages.PID) INNER JOIN Lesions ON (Stages.PID = Lesions.PID) AND (Stages.IDS = Lesions.IDS) WHERE ((Stages.IDS="StageII") AND (Lesions.IDL="2") AND (Lesions.CONTENT = “Heterogeneous")) Find /root/STAGE_II/Findings/Lesion_2 ‘like (CONTENT, ”Heterogeneous”)’ 28

Fetching DICOM-SR Objects Medical Imaging use case

Security model Medical Imaging use case • Authorisation and Authentication • Users require X509 certificates and use proxies for a Single Sign-on. • Roles and groups in VOMS credentials are defining access permissions. • Privacy • Every transaction uses Secure Protocols. • Data is encrypted in Grid stores to prevent being accessed by users with administrative privileges. • Encryption keys are fragmented and stored replicated in Key stores.

Patient-based stores Medical Imaging use case • Commercial Providers • AccelaRAD : seemyradiology.com • eMiX: www.eMix.com • Candelis: www.candelis.com

Significant Use Cases Bioinformatics

Current users of e-Infrastructures Bioinformatics Use Case • Used to work on the Internet • GeneBank, PDB, SWISSPROT, KEGG, NCBI portal. • Users of a wide set of common tools • BLAST, ClustalW, ePCR, SAGE, EMBOSS , Phylip, ... • Open-Source and Open-Data community. • Currently, it is a significant computing Intensive community • >5-10% of the European HPC resources. • >5% of the EGEE infrastructure (largest non LHC community).

Metagenomic analysis interest is increasing Bioinformatics Use Case • Metagenomic analysis is the study of samples with the genomes of different living beings. • The Genomic Standards Consortium has set-up a working group called the M5 (metagenomics, metadata, meta-analysis, multiscale models, and meta-infrastructure). • It aims to improve data-sharing and interoperability in metagenomics studies. • This imply standardisation of annotation and tools to enable setting up workflows on computational infrastructures at large.

One relevant area is metagenomics Bioinformatics Use Case in two years. At current database sizes all-versus-all com-parisons are already impossible without a supercomputer. Development of more efficient algorithms will help, but this will not solve the basic problem of too little comput-ing power. Individual access to supercomputers or cloud computing would help, at least temporarily.

Computing needs Bioinformatics Use Case • Annotation of large samples of metagenomes require a large amount of computational resources that exceeds the resources used in sequential approaches. • Computing the homologous search using BLAST of the Metagenome of the Sargassos’ sea with respect to the Gene Bank Non Redundant Database involves: • 811.372 Sequences of the Sargassos’ sea sample. • 7.166.228 Sequences in the NCBI NR. • 1.4 CPU*years. • Around 60 Gbytes of secondary storage. • However, this process is highly parallel, which could reduce significantly the response time required.

Design of experiments Bioinformatics Use Case • Selection of resources. • Replication of reference databases close to the computing elements to be used. • Execution of a single job per block. • Collection of results. nr nr

Data partitioning Bioinformatics Use Case • Predictability is a fundamental factor to take decissions for an efficient scheduling and resubmission of jobs • Otherwise, it is difficult to know if a job is stacked or simply performing slowly. • Resources with a limitation on the wall time that cannot be guaranteed could be excluded from an experiment. • The choice of partitioning strategy and its size has a strong impact on performance • The execution time depend on the size of the input data and the size of the reference database. • The benchmarking information published by the resources is however not reliable enough.

Data partitioning Bioinformatics Use Case —Blocks per size — Blocks per sequences Results obtained with a partitioning schema of 610 blocks of approximately equal size Time (minutes) % Completed jobs

Obtained results Bioinformatics Use Case Alternative: Sendredundantjobs

Obtained results Bioinformatics Use Case • The execution of the experiment in a single computer would have been useless • It will have required 512 days, and the results would have been fairly out-to-date. • The experiment on the Grid took 13 days • Crunching factor of 40. • But much more larger if considering part of the experiment (90% in 7 days) • Crunching factor of 80. • If every job would have successfully ended, speed-up could have been incremented in a factor 140 at 90%. • Important focus on reducing failure rates.

Quality of the phylogenetic annotation of bacteries • Bioinformatics Use Case • Comparative phylogenetic experiment on a soil sample with respect to different releases of the NR Gene Bank Database. • Many of the associations of sample fragments to biological families have changed, even recently. • The changing rate does not decreases as time goes by, being increased in many cases. • This reveals that the complete diversity of such communities is not sufficiently well described on current data bases.

Using simultaneously different infrastructures Cross Comparison among biological families

Objective Bioinformatics Use Case • Taxonomical and functional characterisation of a bacterial population from the digestive tube of Ostrinina nubilalis. • Comparison between individuals breed in captivity and those collected directly in the countryside.

Material and methods Bioinformatics Use Case • Execution • blastx (blastall 2.2.22), expected value: 0,01, blast hits: 5000. • Reference database • nr_nEuk_Pstipitis_Scerevisiae_Dhansenii.fna. (8,1 GB formatted with formatdb). • It consists on the nr database plus the genomes of Pichia stipitis, Saccharomyces cerevisiae S288c and Debaryomyces hansenii CBS767.. • Input sequences • 538480 sequences with a size of 229 MB and an average length of 363,45 nucleotides. • Obtained through pyrosequencing 454 (Titanium).

Results Bioinformatics Use Case • Submission using the three available e-Infrastructures • EGEE (VO: biomed), EELA (VO: prod.vo.eela-eu.eu), NGI (VO: vo.blast.es-ngi.eu) • 600 Jobs split by job size (390KB) with an average duration of : 13,11 hours. • Distributed accordingly to the availability of resources: • EGEE(240), EELA(120), NGI(240). • Results • Estimated sequential time: 6113 hours (8 months). • Makespan: 5 days and 10 hours. • Approximated speed-up: 47X. • Percentage of resubmission: 18%. • .Total output: 26 GB.

Experiment progression Bioinformatics Use Case

Experiment throughput Bioinformatics Use Case

But things are getting worse… Bioinformatics Use Case

Or even worser! Bioinformatics Use Case

Applications of Grids to biomedicine