Data Integration in Grid Environments Alex Poulovassilis, Birkbeck, U. of London

Data Integration in Grid Environments Alex Poulovassilis, Birkbeck, U. of London

Lecture Overview Part I: • Grid Computing • Grid Architectures • Grid Standards Part II: • The ISPIDER project – a grid application in Bioinformatics • The AutoMed project • OGSA-DAI and OGSA-DQP middleware • DAI/DQP/AutoMed interoperability in ISPIDER

Part I • Grid Computing • Grid Architectures • Grid Standards

What is Grid Computing? • the term first arose in the mid 1990s and it is also known as utility computing • the world is full of computing resources connected by networks, but their distribution, heterogeneity and autonomy make it hard for such resources to be shared • the development of grid computing has been motivated by the need for flexible, secure and coordinated resource sharing to solve large-scale computing problems • this resource sharing is between dynamic collections of individuals, institutions and resources, which collectively form a Virtual Organisation (VO)

What is Grid Computing? • resource sharing includes shared usage of hardware, software, data resources, sensor networks, etc. • this sharing is necessary in order to solve in a collaborative fashion large scale computing problems arising in science, engineering and business • the sharing is controlled, with providers of resources defining what may be shared and under what conditions

How did grid computing come about? • arose in academia (e-science), with Ian Foster and Carl Kesselman leading the development of the Globus toolkit • was then picked up by industry e.g. SUN, IBM, HP, Oracle • driving forces were: (a) computationally-intensive scientific problems e.g. simulations (b) scientific problems involving huge quantities of data e.g. analysis of large data sets • leading to so-called Computational Grids and Data Grids • grid computing may be viewed as an extension of the WWW (information sharing) to sharing of general computing resources

The international grid community • the Global Grid Forum (GGF) was formed in 2000 as a community of researchers, users and vendors aiming to exchange ideas on grid development and deployment, and to develop specifications for grid standards • the Enterprise Grid Alliance (EGA) was formed in 2004 as a non-profit, vendor organisation formed to develop grid computing in industry • GGF and EGA merged in 2006 into the new Open Grid Forum • there has been much funding of grid research and development in the EU, regionally, and nationally over the past decade

How is grid different from distributed computing? • in grid computing, the focus is on large-scale resource sharing, requiring authentication, authorisation, resource discovery, resource scheduling and costing • in grid applications, presentation services access the functionality provided by service-oriented grid middleware, which virtualises the dynamic deployment of the actual computing resources • by contrast, in client-server or multi-tier applications, the presentation, application and back-end services and resources are separated, but are fixed and their deployment is known

What is a Grid Architecture? • was originally envisaged as being organised into layers: • Application • Collective: see next slide • Resource: protocols for initiation, monitoring and control of the computing and data resources • Connectivity: communication and authentication protocols • Fabric: the physical computing and data resources • the components in each layer share common characteristics and build on the services provided by lower layers • the Resource and Connectivity services can be implemented over lower-level resources at the Fabric layer • and can in turn support a range of higher-level services at the Collective and Application layers

Collective Layer • comprises global protocols and services that capture interactions across collections of resources e.g. • directory services to search for resources by name or attributes such as type, availability, load • allocation, scheduling and brokering services to request allocation of one or more resources for a specific task, and scheduling of tasks on resources • monitoring and diagnostic services to monitor the execution of tasks • workload management systems - for specifying and executing workflows consisting of multiple tasks • accounting and payment services gathering resource usage information

Subsequent developments in Grid Architectures • no longer a layered architecture but a service oriented architecture (SOA) e.g. the Open Grid Services Architecture • services are loosely coupled peers that can interact with each other to achieve a given capability e.g. • a service may extend the capabilities of another service in order to provide its own functionality • a service may compose the capabilities of other services to provide higher-level functionality

This leads to a 3-tier Architecture • Applications • Service pool • Resources • the service pool is the grid middleware • below it are the actual physical resources • above it are the applications which access the service pool • the location and nature of the actual resources is transparent to the applications • thus the grid middleware allows resource virtualisation • applications use services as and when needed • and pay only for this usage, hence also the term utility computing

Different types of grids • departmental grids: • built on clusters or groups of clusters owned by one department of an enterprise • enterprise grids: • sharing of common resources by many departments of an enterprise • partner grids: • involving several partner institutions, known to each other and with common goals • open grids: • anyone can join and become a resource provider and/or resource user

Status of grid computing today • numerous commercial departmental and enterprise grids are in operation today • e.g. for drug discovery, stock market trading, integrated circuit design, enterprise resource planning • also many international partner grids • e.g. in high energy physics, life sciences, design and engineering, computational chemistry, astrophysics, earth sciences • software to support open grids is also emerging

Grid Standards • Numerous Grid products are available from various organisations and vendors • these need to be able to interoperate, and this is made possible by the development of standards e.g. the Open Grid Services Architecture (OGSA) • OGSA is based on Web Services • web standards of relevance to the grid include: HTTP (transport), XML (data format), SOAP (message syntax), WSDL (web service definition), UDDI (web service registry), WS-Security, BPEL (workflow definition)

Grid Standards • a key area in which Grid requirements have motivated new WS standards is in the representation and manipulation of state • standard WSs are stateless from the point of view of the requester of the service • OGSA assumes that service interfaces and service behaviours are defined as in WSRF(Web Service Resource Framework) • WSRF defines how state should be modelled, accessed and managed; how services should be grouped; and how faults should be modelled • Also, WS-Notification defines notification mechanisms that support subscription to, and notification of, changes to services and to their state

OGSA vs other Web Service environments • apart from state, the other major difference of OGSA compared with other WS environments is that grid environments are not static: • in contrast to standard WSs, grid services can be created and deployed dynamically • the set of available resources and their load at any time may be highly variable, while still requiring application requirements and SLAs (service-level agreements) to be met • failures to meet SLAs or occurrences of faults may require dynamic restart of executions on other alternative resources • there is thus a need for monitoring of grid applications and for responding dynamicallyto their needs until completion

Implementing Grid services • associated with a Grid Service are a set of Service Data Elements (SDEs) • these are XML documents and represent information about grid service instances, allowing their discovery and management • each Grid Service “port type” has an associated set of SDEs • different types of Grid Service are realised by providing different sets of port types

Background Reading for Part I • Grid Cafe, http://www.gridcafe.org/ • The EGEE project, www.eu-egee.org/ • Worldwide LHC (Large Hadron Collider) Computing Grid, http://lcg.web.cern.ch/LCG/public/ • Open Grid Forum, www.ogf.org • Globus toolkit, www.globus.org/alliance/publications/

Part II • The ISPIDER project – a grid application in Bioinformatics • The AutoMed project • OGSA-DAI and OGSA-DQP middleware • DAI/DQP/AutoMed interoperability in ISPIDER

The ISPIDER Project • Partners: Birkbeck, EBI, Manchester, UCL • Requirements: • There are vast amounts of heterogeneous proteomics data being produced via a variety of new techniques • Proteomics is the study of the protein complement of the genome • It is targeted at the elucidation of biological function from genomic data • There is a need for interoperability between autonomous proteomics data resources • And for complex analyses over integrated virtual resources

Biological data: Genes  Proteins  Biological Function Genome: DNA sequences of 4 bases (A,C,G,T) RNA: copy of DNA sequence Protein: sequence of 20 amino acids Biological Processes FUNCTION A gene Permanent copy Temporary copy Product (each triple of RNA bases encodes an amino acid) Job This slide is adapted from Nigel Martin’s Lecture Notes on Bioinformatics

Aims of ISPIDER • Hence, the development of a Proteomics Grid Infrastructure, using existing proteomics resources and developing new ones; also developing new proteomics clients for querying, visualisation, workflow etc. • The development of such a system is beneficial for a number of reasons: • Access to more data sources yields more reliable analyses • Integrating resources increases the breadth of information available for the biologist • Enables new analyses to be undertaken which would have been prohibitively difficult or impossible with just the individual resources

ISPIDER Architecture

Some ISPIDER data resources • gpmDB See http://gpmdb.thegpm.org • a publicly available database with more than 2 million proteins and almost 470,000 unique peptide identifications • provides access to a wealth of peptide identifications from a range of different laboratories and instruments • PEDRo http://pedrodb.man.ac.uk:8080/pedrodb • provides access to a collection of descriptions of experimental data sets in proteomics • PepSeeker http://nwsr.smith.man.ac.uk/pepseeker • developed as part of the ISPIDER project and targeted at the identification stage of the proteomics pipeline • currently holds over 50,000 proteins and 50,000 unique peptide identifications

myGrid / DQP / AutoMed Middleware • myGrid: provides a workflow environment over web/grid services, allowing high-level integration of data and applications for in-silico experiments in biology • OGSA-DQP: provides distributed query processing over Grid enabled data resources • AutoMed: provides heterogeneous data integration functionality over distributed data sources (the AutoMed project partners are Birkbeck and Imperial College) • ISPIDER research: integration of AutoMed and DAI/DQP (topic of this lecture); also integration of AutoMed and myGrid workflows

Motivation for AutoMed • Data Integration(DI) is the process of creating an integrated resource which • combines data from a variety of autonomous data sources • in order to support new queries and analyses • the data sources may be heterogeneous in terms of their: • data model, query interfaces, query processing capabilities, database schema or data exchange format, data types used, nomenclature adopted • this poses several challenges, leading to several methodologies, architectures and systems being developed to support DI • these aim to abstract out data transformation and aggregation logic from application programs into generic data integration software

AutoMed • Supports a metamodel, the Hypergraph Data Model (HDM), in terms of which higher-level modelling languages can be defined – so extensible with new modelling languages • After a modelling language has been specified in terms of the HDM, a set of primitive schema transformations become available for schemas expressed in that language • Schemas can be incrementally transformed and integrated by applying to them a sequence of primitive transformations • Schemas may or may not have data associated with them: so virtual, materialised (data warehousing) or hybrid integration can be supported • Transformations are accompanied by queries, allowing data and query translation between source and target schemas

AutoMed Architecture Distributed Data Sources Schema and Transformation Repository Wrapper Schema Transformation and Integration Tools Global Query Processor Model Definitions Repository Global Query Optimiser Model Definition Tool Schema Evolution Tool

Global Query Processing in AutoMed • We handle query language heterogeneity by translation into/from a intermediate query language– IQL • A query Q expressed in a high-level query language such as SQL on a global schema S would first be translated into IQL • For example, the following IQL query on a global schema retrieves all identifications for the protein with accession number ENSP00000339074: [id | {id,an} <- <<Protein,accession_number>>; an=`ENSP00000339074'] • View definitions are then derived from the transformation pathways between S and the data source schemas (in this case gpmDB, PEDRo and PepSeeker) • These view definitions are substituted into Q,reformulating it into an IQL query over source schema constructs

Global Query Processing in AutoMed (cont’d) • E.g. for Q as above the reformulated query is: [id | {id,an} <- [{id2lsid [`pepseeker.proteinhit:', toString d], x}| {d,x}<- distinct [{k,x}|{k,x}<- <<proteinhit,ProteinID>>]] ++ [{id2lsid [`pedro.protein:', toString d], x}| {d,x}<- <<protein,accession_num>>] ++ [{id2lsid [`gpmdb.proseq:', toString d],x}| {d,x}<-<<proseq,label>>]; an=`ENSP00000339074']

Global Query Processing (cont’d) • Query optimisation then occurs • One goal of this is to generate the largest possible sub-queries that can be submitted to data source Wrappers for translation into the data source query languages and evaluation by the data sources • Query evaluation then follows, during which the AutoMed Evaluator submits to Wrappers sub-queries that they are able to translate into the data source query language (currently, AutoMed supports wrappers for SQL, OQL, XPath, XQuery and flat-file data resources) • The Wrappers submit sub-queries to data sources, and translate sub-query results back into the IQL type system • The Evaluator then undertakes any further necessary query evaluation to combine sub-query results

OGSA-DAI and OGSA-DQP • OGSA-DAI (Data Access and Integration) • delivers data access, transport and metadata services for the grid • there are other OGSA services that focus on data derivation, consistency and replication services • OGSA-DQP (Distributed Query Processing) • provides services for the compilation, optimisation and distributed evaluation of queries over grid data resources accessed via OGSA-DAI

OGSA-DAI functionality • provides a consistent interface to data resources regardless of the underlying technology e.g. relational (Oracle, DB2, MySQL) or XML (Xindice; eXist) • OGSA-DAI extends standard Grid Services with several new port types, including • Grid Data Service (GDS)

OGSA-DAI functionality • Grid Data Service (GDS): • accepts requests, in the form of XML documents, instructing the Grid Service instance to interact with a database in order to create, retrieve, update or delete data • its primary operation is perform through which such requests are passed to the GS • a request may consist of a collection of linked activities e.g. a data access, followed by a data translation, followed by a data delivery • all of these can be bundled into one request in order to reduce the number of round trips required between the client and the service

OGSA-DQP functionality This implements the GDS and GDT port types from OGSA-DAI and also adds two new port types: GDQS and GQES. Grid Distributed Query Service(GDQS): • can interact with known registries to obtain the schemas of data resources and also information about computational resources • this set-up phase occurs once in the lifetime of a GDQS instance • clients can then submit a query to the GDQS via the GDS port-type, using a perform call • this is compiled, optimised and partitioned into a distributed query execution plan each of whose partitions will be scheduled for execution at different GQESs (see below) • the GDQS uses this information to create the necessary GQES instances on their designated execution nodes, and hands over to each GQES the partition assigned to it

OGSA-DQP functionality Grid Query Evaluation Service (GQES): • each GQES instance is an execution node in a distributed query execution plan • it is responsible for that part of the execution plan allocated to it by the GDQS • it implements a physical algebra over other Grid Data Services • encapsulated within these other GDSs are the data resources whose schemas were imported during the GDQS set-up phase

DAI/DQP/AutoMed Interoperability • Data sources wrapped with OGSA-DAI • AutoMed-DAI wrappers extract data sources’ metadata • Semantic integration of data sources using transformation pathways • IQL queries submitted to an integrated schema are reformulated to IQL queries on the data sources, using the transformation pathways • Submitted to DQP for evaluation (not AutoMed)

The AutoMed-DAI Wrapper • The AutoMed-DAI wrapper requests the schema of the data source using an OGSA-DAI service • The service replies with the source schema encoded an in XML response document • The AutoMed-DAI wrapper creates the corresponding schema in the AutoMed repository

The AutoMed-DQP Wrapper • The AutoMed-DQP wrapper undertakes two tasks: • needs to inform AutoMed of the subset of IQL that it is capable of translating into OQL • is responsible for making interactions with OGSA-DQP transparent to the remainder of the AutoMed infrastructure • On receiving an IQL query, the AutoMed-DQP wrapper first translates it into the equivalent OQL query • The OQL query is then sent to OGSA-DQP for evaluation • The reply from OGSA-DQP is in the form of an XML response document containing the query results • The AutoMed-DQP wrapper translates these results into the IQL type system, and returns the result to AutoMed's evaluator for any further necessary evaluation

Background Reading for Part II • OGSA Version 1.0 document, January 2005 • “Service-Based Distributed Querying on the Grid” by Alpdemir et al., Proc. of the 1st International Conference on Service Oriented Computing", 2003, pp 467-482 • “The design and implementation of grid database services in OGSA-DAI” by Antonioletti et al., Concurrency - Practice and Experience, Vol 17, No 2-4, 2005, pp 357-376

Data Integration in Grid Environments Alex Poulovassilis, Birkbeck, U. of London