Data Management in a Grid Environment - theory and practical examples - PowerPoint PPT Presentation

data management in a grid environment theory and practical examples n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Data Management in a Grid Environment - theory and practical examples PowerPoint Presentation
Download Presentation
Data Management in a Grid Environment - theory and practical examples

play fullscreen
1 / 57
Data Management in a Grid Environment - theory and practical examples
100 Views
Download Presentation
ursa
Download Presentation

Data Management in a Grid Environment - theory and practical examples

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam Data Management in a Grid Environment - theory and practical examples Kerstin Kleese van Dam et. al., CCLRC e-Science Centre k.kleese@dl.ac.uk http://www.e-science.clrc.ac.uk

  2. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam • Council for the Central Laboratory of the Research Councils • One of Europe’s largest Research Support Organisations, providing large scale experimental, data and computing facilities primarily to the UK research community both in academia and industry. Annually supporting around 12000 scientists from all major scientific domains. 1800 members of staff over three sites: • Rutherford Appleton Laboratory in Oxfordshire • Daresbury Laboratory in Cheshire • Chilbolton Observatory in Hampshire • Large quantities of data associated with the various facilities. Houses 1 World Data Centre, 3 National Data Centres and a range of community based data services. • http://www.cclrc.ac.uk

  3. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam • CCLRC e-Science Centre Early involvement in e-Science (from 1999 Data Grid / WOS onwards). Centre established in 2000, since 2001 with direct governmental funding, additional funding through participation in other projects. Currently housing UK Grid Support Centre (together with Manchester + Edinburgh) and BBSRC Grid Support Centre. Involved in DataGrid, GridPP, AstroGrid and NERC DataGrid Currently 40 permanent members of staff, 10 in the data management group. http://www.escience.clrc.ac.uk

  4. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam Data Management Group

  5. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam Current e-Science Projects of the Data Management Group Working on collaborations with partners inside CCLRC, the UK and internationally CLRC DataPortal Integration of ISIS and BADC operational Data Catalogues Environment from the Molecular Level NERC DataGrid e-Science Technologies for the Simulation of Complex Materials Extensions of the Storage Resource Broker (SRB) together with SDSC Earth Science Portal Project Database service for CCLRC and related e-Science projects

  6. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam Data Management

  7. Your personal e-Science Interface where ever you are. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam Currently the scientist has to take care of his data, providing the binding link between different areas of work. In the future we hope that e-Science technologies provide scientists with a more helpful environment …

  8. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam Issues Data capture from instruments and computers Data Storage Annotating data Data Discovery Association of data with appropriate applications Conversion of data from one application to the other Merging of data from different sources

  9. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam Data capture from instruments and computers In a Grid environment the Scientists will ultimately have little control where he will carry out his experiment or calculation and where therefore his data will be. Capture Data Capture Information about the environment Direct where output goes

  10. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam Data Capture from Experimental Facilities (1) Instruments produce varying amounts of data, ranging from small (e.g. temperature readings at a station) to large (e.g. LHC with several Tbytes per second). Each instrument will produce data in its own format, often incompatible with anything else. Most facilities provide their own short term storage, but will neither annotate nor manage the data. The collection of environmental information is often limited, much of the information is still recorded in lab notice books. Correction values or error margins related to the instrument are not linked to the collected data.

  11. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam Data Capture from Experimental Facilities (2) - Requirements Generalised description of data format (possible standardisation for instruments of the same type). Automatic capture of environment information including Instrument scientists if necessary. Automatic linking of data about the environment and the raw data produced by the instrument. Automatic insertion of both types of data into interim or final data repository. Automatic linking of the donated data to existing related information e.g. proposal, other experiments of the same project.

  12. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam Finally Integrated with other Facility Data within and outside CCLRC via Instances of the CCLRC DataPortal software. Data Capture from Experimental Facilities (3) - Examples Collection of Raw data from the Instrument, Detector specific Information for this experiment etc. ICAT - CLRC ISIS Catalogue http://www.isis.rl.ac.uk/dataanalysis Integrate Raw Data with original Proposal Information and Log files of the Instrument Scientists See also: Comb-e-Chem - http://www.combechem.org

  13. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam Data Storage The Grid environment provides access to a multitude of storage systems, often hiding the type of system behind services interfaces. Where is the data How can I manage it On which media is my data (access time) How can it be accessed Where are replicas of my data

  14. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam Data Storage (2) - Requirements Easy overview where your data is on the Grid Support to manage your data (transfers/replicas) Access and access control to your data where ever it is Support to share your data Two possible solutions: Globus Data Management tools - example ESG http://www.earthsystemsgrid.org Storage Resource Broker (SRB) from the San Diego Super Computing Centre http://www.npaci.edu/DICE/SRB

  15. ... Client’s site client client logical query Metadata catalog Replica catalog logical files Request Interpreter Request Executer request planning site-specific files site-specific files requests Network Weather Service pinning & file transfer requests network HRM DRM DRM ... tape system Disk Cache Disk Cache Disk Cache N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam Metadata Catalogue for Data Discovery within one Virtual Organisation Typical Analysis Scenario and the use of Storage Resource Managers (SRM) Request goes out to Disk and Hierarchical Storage Resource Managers Replica Catalogue keeps track of all replica’s of specific datasets within one Virtual Organisation The Network Weather Service helps to plan fastest Access routes to the data

  16. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam Storage Resource Broker (1) Professional Data Storage Management System initially developed in the mid 90’s by the San Diego Super Computing Centre. http://www.npaci.edu/DICE/SRB/. Current version supports many platforms and authentication methods. Web services Interfaces.

  17. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam Storage Resource Broker Devise Interface Modules to wide range of platforms – easy to extend to new systems SRB External Interface Modules: MySRB (web based), Command line Interface, C and Fortran API’s – Password and Certificate authorisation MCAT provides links between logical to physical data location, replica and versioning. MCAT can be run on a variety of Relational Databases. Integrated access to data on PC, UNIX, LINUX, DB and Tape Store http://www.npaci.edu/dice/srb/mySRB/mySRB.html also used in the BIRN project http://www.nbirn.net/

  18. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam Functions including ingestion, movement and replication of data. Providing access to data for others Version of Data Type of Data Replica or Original Data Physical Data Location and Type of Resource

  19. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam

  20. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam

  21. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam Biomedical Informatics Research Network

  22. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam Annotating Data Data without further information is only of short and very limited use. Information about the data itself Information about the where, why, who and when Information about the environment in which the data was captured Related Information Example: CLRC Scientific Metadata Schema http://www.e-science.clrc.ac.uk/Activity/ACTIVITY=DataPortal;SECTION=5;

  23. Discovery Excavation Experimenter Data curator General community Wider science community Specialistuser N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam Diversity: Users & Searches

  24. Science Metadata Model Social Science ISIS SRS HEP Space Science Earth Science N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam General Scientific Metadata A generic metadata model for all scientific applications with Specialisation for each domain Can answer questions across domains Can answer questions about specific domains

  25. Keywords providing a index on what the study is about. Provenance about what the study is, who did it and when. Conditions of use providing information on who and how the data can be accessed. Detailed description of the organisation of the data into datasets and files. Locations providing a navigational to where the data on the study can be found. References into the literature and community providing context about the study. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam CLRC DataPortal - Scientific Metadata Model Metadata Object Topic Study Description Access Conditions Data Description Data Location Related Material

  26. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam Data Discovery Most data is currently ‘discovered’ by word of mouth from friends and colleagues or sheer luck. Discovery Browsing Selection Comparison Access Example: CLRC DataPortal http://esc.dl.ac.uk:9000/index.html

  27. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam Different Levels of Metadata supporting Discovery and Selection A -Metadata – can be derived from the data itself B -Metadata – A summary of all other types of metadata C -Metadata – All related metadata, papers, pictures, related studies D -Metadata – User provided information on what, who, what and when

  28. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam CLRC DataPortal • The DataPortal currently allows access to selected metadata and data from four facilities. The first three housed by CLRC: • The Synchrotron Radiation Department (SRD) • The Neutron Spallation Source (ISIS) • The British Atmospheric Data Centre (BADC) • Max-Planck Institute for Meteorology (MPIM) You will be able to assess the available data via the basic search. If you are not one of our partners, but would like to try the system you can use one of our test accounts: Login , using 'dpuser' for your username and password. http://esc.dl.ac.uk:9000/index.html

  29. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam DataPortal Architecture The major functions of the DataPortal (DP) are grouped into modules, each module has a grid services interface to communicate with the other DP services and in some cases also with outside services like Visualisation or HPC Portal. The Soap protocol is used for communication and WSDL to describe the various services. We do not change any local metadata system, but use our own wrappers to translate our general query format into the local syntax. Replies from the resources will be XML files compliant with the CLRC Scientific Metadata Format: (http://www-dienst.rl.ac.uk/library/2002/tr/dltr-2002001.pdf) The UK e-Science Grid CA provides Globus x509 certificates for the UK e-Science community. The CA is located at RAL and is being run as part of the Grid Support Centre funded by the Research Councils' Core e-Science programme. (http://www.grid-support.ac.uk/) The implementation of the core modules as grid services allows the DataPortal to be a truly distributed application and allows several instances of the DataPortal to logically combined thus extending any user query.

  30. CLRC DataPortal Server Other Instances of the CLRC DataPortal Server XML wrapper XML wrapper XML wrapper Local metadata Local metadata Local metadata Local data Local data Local data Facility 1 Facility N Facility 1 ... N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam General CLRC DataPortal Architecture

  31. Data Transfer External Data File Store(s) Authentication & Authorisation DataPortal Web Interface Service Look Up Certification Authority DataPortal Permanent Repository Session Management Query & Reply Shopping Cart The Shopping Cart allows registered users to permanently store and annotate pointers to the external data files and data sets. Facilities XML Wrappers Facilities Access Control Facility Administration allows external facilities to advertise their grid services to the DataPortal. Facility Administration N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam DataPortal Architecture (2) Accessing DataPortal either via Web Interface or Web Services Interfaces e.g. Query and Reply Authenticate and Authorise user by checking certificate validity and check with associated facilities for general access rights Query Generation, Selection of Suitable Facilities to Query. Farm out query to selected Facilities in parallel and collect and collate results As well as interacting with the DataPortal via the Web Interface users can also run queries by directly calling the Query & Reply service assuming that they are properly authenticated. Other services are also externally visible, for example the Shopping Cart. Put interesting Data in your personal, permanent Shopping Cart, which you can share with others as required. Use the Data Transfer Service to send your data on to a chosen application or service

  32. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam Choose Facilities of Interest Select Discipline and reduce Search Field

  33. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam

  34. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam

  35. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam

  36. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam Annotate your Search Results Specific Services associated with this data Forgotten where your data came from?

  37. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam Association of data with appropriate applications The scientists will need to be able to link to all his favourite applications for analysis, simulation and visualisation, but he also needs to be informed about suitable other program’s. Suitable applications Correct Format Suitable for your environment Availability

  38. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam HPCGrid Services Portal This is a pilot project funded by the CLRC e-Science Centre to develop a Web portal to search for resources and submit HPC applications to a computational Grid in the UK. It will form the basis of application portals for the UK e-Science Grid and "thematic Grids" for e.g. NERC DataGrid and HPCI Consortia. This project is a collaboration with the San Diego Supercomputer Centre who have developed the GridPortPortal and HotPage software for the NPACI HPC Grid, and with the University of Lecce, Italy who have developed the Grid Resource broker. http://esc.dl.ac.uk/HPCPortal/

  39. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam HPC Grid Services Portal Provides a portal for HPC resources which can be customised for domain-specific applications. Original collaboration with San Diego Supercomputer Center, now University of Texas (Mary Thomas). Similar functionality to HotPage and GridPort (SDSC): Single sign-on using a digital certificate (GSI) Resource monitoring and Discovery (Globus) Application Discovery (search engine) Personal "desktop" workspace File transfer (Globus) and Job Submission (Globus)

  40. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam InfoPortal Searching for Applications on the UK Level 2 Grid HPCPortal DataPortal

  41. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam Chose Application: DLPOLY Resulting Findings for DLPOLY

  42. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam Summary Description Web Service Address for DLPOLY code Information about the systems the code is installed and available for use Link to job submission

  43. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam

  44. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam All machines on the UK level 2 Grid and their availability

  45. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam Conversion of data from one application to the other The scientists will need to be able to pass data from one application to the next seamlessly and with minimum interference on their part. Determining Data Formats Data Schema Interchange/Conversion Example: e-Materials Project

  46. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam The CLRC DataPortal Related Projects E-SCIENCE TECHNOLOGIES IN THE SIMULATION OF COMPLEX MATERIALS A combination of novel computational and computer science methodologies and teams will be used to develop GRID e-Science technologies to deliver new simulation solutions to problems and fields relating to combinatorial materials science and polymorph prediction. The project will exploit the latest developments in scientific simulation methodologies (both electronic structure and force field based) and hardware ranging from desktop to HPC. It will establish a field tested integrated data and computing e-Science infrastructure customised for these key areas of current materials science. This infrastructure will, among others, enable the automatic submission of simulation, triggered by the identification of knowledge gaps in the database in response to user queries. Furthermore, the automatic integration of experimental and computational results for screening applications will be supported.

  47. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam The Science: Filtering Two point displacement method used to build up dynamical matrix. Single point energy calculation at each displacement +ve and –ve in x, y, and z. Purely SiO4 zeolite Metal substitution with addition of proton Calculation of Vibrational Freqs • Information of Interest • Structure • Total energy • Binding Energy • HOMO/LUMO • Population Analysis • Vibrational Freqs Increase quality of calculation for best candidates Add probe

  48. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam The Computation ChemShell 1. Micro iterations to relax shells wrt forces from QM region. RMS criteria (x) tested for further movement of shells. GAMESS-UK GULP 2. Energy and gradients passed from GAMESS-UK to GULP and then final forces passed back to ChemShell (newopt module), which performs geometry optimisation. RMS=x ChemShell Optimiser Maxg and maxs < 0.01 3. Optimisation is considered complete when both max gradient and max step are below set criteria. GAMESS-UK GULP ChemShell

  49. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam CML – Chemical Markup Languages CML is a new approach to managing molecular information. It has a large scope as it covers disciplines from macromolecular sequences to inorganic molecules and quantum chemistry. CML is new in bringing the power of XML to the management of chemical information. CML and associated tools allows for the conversion of current files without semantic loss into structured documents, including chemical publications, and provides for the precise location of information within files. Developed by Peter Murray-Rust and Henry S. Rzepa. http://www.xml-cml.org As an addition they are also looking at: CCML – a Computational Chemical Markup Language

  50. N+N meeting Australia 2003 e-Science Centre Kerstin Kleese van Dam <document>- <!--CML document - caffeine - karne - 7/8/00 --> - <!--file converted from: MDL .mol -->- <cml title="caffeine" id="cml_caffeine_karne" xmlns="x-schema:cml_schema_ie_02.xml">- <molecule title="caffeine" id="mol_caffeine_karne" convention="mol"> <formula>C8 H10 N4 O2</formula> <string title="CAS">58-08-2</string> <string title="ACX">I1001269</string> <string title="DOT">UN 1544</string> <string title="RTECS">EV6475000</string> <float title="molecule weight">194.19</float> <float title="melting point" units="degC">238</float> <float title="specific gravity">1.23</float> <string title="water solubility" units="g/100 mL" convention="g per 100 mL at 23 degC">1-5</string> <string title="comments">White powder or white glistening needles usually melted together. LIGHT SENSITIVE</string>- <list title="alternate names">