1 / 41

Architecture of gLite Data Management System

Architecture of gLite Data Management System. Tony Calanducci INFN Catania International Summer School on Grid Computing 2006 Ischia (Naples), July 09-21th 2006. Outline. Grid Data Management Challenge Storage Elements and SRM File Catalogs and DM tools Metadata Services

adanne
Download Presentation

Architecture of gLite Data Management System

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Architecture of gLite Data Management System Tony Calanducci INFN Catania International Summer School on Grid Computing 2006 Ischia (Naples), July 09-21th 2006

  2. Outline • Grid Data Management Challenge • Storage Elements and SRM • File Catalogs and DM tools • Metadata Services • File Transfer Services ISSGC’06, Ischia, 09-21 July 2006

  3. The Grid DM Challenge • Need common interface to storage resources • Storage Resource Manager (SRM) • Need to keep track where data are stored • File and Replica Catalogs • Need scheduled, reliable file transfer • File transfer service • Heterogeneity • Data are stored on different storage systems using different access technologies • Distribution • Data are stored in different locations – in most cases there is no shared file system or common namespace • Data need to be moved between different locations ISSGC’06, Ischia, 09-21 July 2006

  4. Introduction • Assumptions: • Users and programs produce and require data • the lowest granularity of the data is on the file level (we deal with files rather than data objects or tables) • Data = files • Files: • Mostly, write once, read many • Located in Storage Elements (SEs) • Several replicas of one file in different sites • Accessible by Grid users and applications from “anywhere” • Locatable by the WMS (data requirements in JDL) • Also… • WMS can send (small amounts of) data to/from jobs: Input and Output Sandbox • Files may be copied from/to local filesystems (WNs, UIs) to the Grid (SEs) ISSGC’06, Ischia, 09-21 July 2006

  5. gLite Grid Storage Requirements • Def: The Storage Element is the service which allows a user or an application to store data for future retrieval • Manage local storage (disk) and/or interface to complex Mass Storage Systems (disk arrays and tape libraries) like • HPSS, CASTOR, DiskeXtender (UNITREE), … • Offer a unique virtual file system even if it uses different storage techologies (array of disks and tapes), hiding the details to the users (providing an SRM interface) • Support basic file transfer protocols • GridFTP mandatory (GSI enabled FTP) • Others if available (https, ftp, etc) • Support a native I/O (remote file) access protocol • POSIX (like) I/O client library for direct access of data ISSGC’06, Ischia, 09-21 July 2006

  6. SRM in an example She is running a job which needs: Data for physics event reconstruction Simulated Data Some data analysis files She will write files remotely too They are at CERN In dCache They are at Fermilab In a disk array They are at Nikhef in a classic SE ISSGC’06, Ischia, 09-21 July 2006

  7. SRM in an example dCache Own system, own protocols and parameters I talk to them on your behalf I will even allocate space for your files And I will use transfer protocols to send your files there You as a user need to know all the systems!!! classic SE Independent system from dCache or Castor SRM Castor No connection with dCache or classic SE ISSGC’06, Ischia, 09-21 July 2006

  8. Storage Resource Management • The SRM(Storage Resource Manager) is a protocol for Storage Resource Management. • it does not do any data transfer. • used to ask a Mass Storage System (MSS) to make a file ready for transfer, or to create space in a disk cache to which a file can be uploaded • The actual transfer is done using the file transfer protocol supported by the backend MSS • Storage resource management needs to take into account • Transparent access to files (migration to/from disk pool) • File pinning • Space reservation • File status notification • Life time management • The SRM (Storage Resource Manager)is a single interface that takes care of local storage interaction and provides a Grid interface to the outside world • In gLite, interactions with the SRM interface are hidden by higher level tools (DM tools and APIs) ISSGC’06, Ischia, 09-21 July 2006

  9. gLite Storage Element ISSGC’06, Ischia, 09-21 July 2006

  10. Files Naming conventions • Logical File Name (LFN) • An alias created by a user to refer to some item of data, e.g. “lfn:/grid/gilda/tony/simple2.dat” • Globally Unique Identifier (GUID) • A non-human-readable unique identifier for an item of data, e.g. “guid:3a69a819-2023-4400-a2a1-f581ab942044” • Site URL (SURL) • Gives indication on which place (Storage Element) the file is actually found. • Understood by the SRM interface • “srm://aliserv6.ct.infn.it/dpm/ct.infn.it/home/gilda/generated/2006-07-10/filef7a916f7-159b-48df-9159-877f2d3c6f58” • Transport URL (TURL) • Temporary locator of a replica+access protocol: understood by the backend MSS “gsiftp://aliserv6.ct.infn.it/aliserv6.ct.infn.it:/gpfs/dpm/gilda/2006-07-10/filef7a916f7-159b-48df-9159-877f2d3c6f58.46193.0” ISSGC’06, Ischia, 09-21 July 2006

  11. SRM Interactions Client 4 SRM 1 2 3 5 Storage • The client asks the SRM for a file providing an SURL (Site URL) • The SRM asks the storage system to provide the file • The storage system notifies the availability of the file and its location • The SRM returns a TURL (Transfer URL), i.e. the location from where the file can be accessed • The client interacts with the storage using the protocol specified in the TURL ISSGC’06, Ischia, 09-21 July 2006

  12. gLite UI What is a file catalog File Catalog SE SE SE ISSGC’06, Ischia, 09-21 July 2006

  13. The LFC (LCG File Catalog) • It keeps track of the location of copies (replicas) of Grid files • LFN acts as main key in the database. It has: • Symbolic links to it (additional LFNs) • Unique Identifier (GUID) • System metadata • Information on replicas • One field of user metadata ISSGC’06, Ischia, 09-21 July 2006

  14. LFC commands Summary of the LFC Catalog commands ISSGC’06, Ischia, 09-21 July 2006

  15. LFC C API Low level methods (many POSIX-like): lfc_setacl lfc_setatime lfc_setcomment lfc_seterrbuf lfc_setfsize lfc_starttrans lfc_stat lfc_symlink lfc_umask lfc_undelete lfc_unlink lfc_utime send2lfc lfc_access lfc_aborttrans lfc_addreplica lfc_apiinit lfc_chclass lfc_chdir lfc_chmod lfc_chown lfc_closedir lfc_creat lfc_delcomment lfc_delete lfc_deleteclass lfc_delreplica lfc_endtrans lfc_enterclass lfc_errmsg lfc_getacl lfc_getcomment lfc_getcwd lfc_getpath lfc_lchown lfc_listclass lfc_listlinks lfc_listreplica lfc_lstat lfc_mkdir lfc_modifyclass lfc_opendir lfc_queryclass lfc_readdir lfc_readlink lfc_rename lfc_rewind lfc_rmdir lfc_selectsrvr ISSGC’06, Ischia, 09-21 July 2006

  16. GFAL: Grid File Access Library • Interactions with SE require some components: • → File catalog services to locate replicas • → SRM interfaces • → File access mechanism to access files from the SE on the UI/WN • GFAL does all this tasks for you: • → Hides all these operations • → Presents a POSIX interface for the I/O operations • → Single shared library in threaded and unthreaded versions libgfal.so, libgfal_pthr.so • → Single header file gfal_api.h → User can create all commands needed for storage management → It offers as well an interface to SRM • Supported protocols: • → file (local or nfs-like access) • → dcap, gsidcap and kdcap (dCache access) • → rfio (castor access) and gsirfio (dpm) ISSGC’06, Ischia, 09-21 July 2006

  17. GFAL: File I/O API (I) int gfal_access (const char *path, int amode); int gfal_chmod (const char *path, mode_t mode); int gfal_close (int fd); int gfal_creat (const char *filename, mode_t mode); off_t gfal_lseek (int fd, off_t offset, int whence); int gfal_open (const char * filename, int flags, mode_t mode); ssize_t gfal_read (int fd, void *buf, size_t size); int gfal_rename (const char *old_name, const char *new_name); ssize_t gfal_setfilchg (int, const void *, size_t); int gfal_stat (const char *filename, struct stat *statbuf); int gfal_unlink (const char *filename); ssize_t gfal_write (int fd, const void *buf, size_t size); ISSGC’06, Ischia, 09-21 July 2006

  18. GFAL Java API • GFAL API are available for C/C++ programmers • Because of ISSGC’06 exercise requirements, we needed to have a Java version of them • We wrote a wrapper around the C APIs using Java Native Interface and a the Java APIs on top of it • More information can be found here: https://grid.ct.infn.it/twiki/bin/view/GILDA/APIGFAL ISSGC’06, Ischia, 09-21 July 2006

  19. lcg-utils DM tools • High level interface (CL tools and APIs) to • Upload/download files to/from the Grid (UI,CE and WN <---> SEs) • Replicate data between SEs and locate the best replica available • Interact with the file catalog • Definition: A file is considered to be a Grid File if it is both physically present in a SE and registered in the File Catalog • lcg-utils ensure the consistency between files in the Storage Elements and entries in the File Catalog ISSGC’06, Ischia, 09-21 July 2006

  20. lcg-utils commands Replica Management File Catalog Interaction ISSGC’06, Ischia, 09-21 July 2006

  21. LFC interfaces SEs LFC SERVER LCG UTILS GFAL Python LFC CLIENT C API DLI WMS CLI lfc-ls, lfc-mkdir, lfc-setacl, … ISSGC’06, Ischia, 09-21 July 2006

  22. Metadata on the Grid • Metadata is data about data • On the Grid: mainly,information about files • Describe files • Locate files based on their contents • They can also add details on running jobs • … • But also simplified DB access on the Grid • Many Grid applications need structured data • Many applications require only simple schemas • Can be modelled as metadata • Main advantage: better integration with the Grid environment • Metadata Service is a Grid component • Grid security • Hide DB heterogeneity • AMGA is the Metadata Component of gLite ISSGC’06, Ischia, 09-21 July 2006

  23. Example • Suppose we have a set of movie trailers saved on several storage elements $ lfc-ls -l /grid/gilda/trailers -rw-rw-r-- 1 101 102 10188804 Apr 14 17:21 BatmanBegins.mpg -rw-rw-r-- 1 109 102 3201028 Apr 14 19:34 alien.mpg -rw-rw-r-- 1 101 102 3545092 Apr 14 17:19 amelie.mpg -rw-rw-r-- 1 101 102 5277700 Apr 14 17:27 american2.mpg -rw-rw-r-- 1 101 102 5828612 Apr 14 17:28 fastfurious.mpg -rw-rw-r-- 1 192 102 20509586 Apr 20 14:08 insideman.avi -rw-rw-r-- 1 101 102 5912580 Apr 14 17:31 madagascar.mpg -rw-rw-r-- 1 101 102 5812228 Apr 14 17:30 matrix.mpg -rw-rw-r-- 1 192 102 12918756 Apr 20 19:09 pinkpanther.mov -rw-rw-r-- 1 101 102 6240260 Apr 14 17:30 spiderman.mpg • We could add more details (Movie Title, Cast, Runtime, PlotOutline, Genre, Director) on their contents associating them Metadata. • We could then look for movies that satisfy some desired search critiria (e.g.: movies that are commedies where our preferred actor perfomed or are about animals and zoos) ISSGC’06, Ischia, 09-21 July 2006

  24. Metadata Concepts • Basic Definitions • Entries - List of items to which we want attach metadata to (ex: each movie will rapresented as an entry in AMGA) • Attribute – key/value pair with type information • Name/Key – The name of the attribute (ex: MovieTitle, Cast, PlotOutline, Runtime, …) • Type – The type (ex: varchar, int, float, text, numeric, …) • Value - Value of an entry's attribute (ex: “Spider Man 2”, “Tobey Maguire, Kirsten Dunst”, 127, …) • Metadata - List of attributes associated with entries • Schema – A set of attributes • Collection – A set of entries associated with a schema • We can think of collections as DB tables, schema as the list of fields (with their types), attributes as columns, entries as rows ISSGC’06, Ischia, 09-21 July 2006

  25. AMGA Features • Dynamic Schemas • Schemas can be modified at runtime by client • Create, delete schemas • Add, remove attributes • Metadata organised as anhierarchy • Collections can contain sub-collections • Analogy to file system: • Collection  Directory; Entry  File • Flexible Queries • SQL-like query language • Joins between schemas • Example selectattr /gLibrary:FileName /gLAudio:Author /gLAudio:Album '/gLibrary:FILE=/gLAudio:FILE and like(/gLibrary:FileName, “%.mp3")‘ ISSGC’06, Ischia, 09-21 July 2006

  26. Security • Unix style permissions • ACLs – Per-collection or per-entry. • Secure connections – SSL • Client Authentication based on • Username/password • General X509 certificates • Grid-proxy and VOMS-proxy certificates • Access control via a Virtual Organization Management System (VOMS): ISSGC’06, Ischia, 09-21 July 2006

  27. AMGA Implementation • C++ multiprocess server • Runs on any Linux flavour • Backends • Oracle, MySQL, PostgreSQL, SQLite • Two frontends • TCP Streaming • High performance • Client API for C++, Java, Python, Perl, Ruby • SOAP • Interoperability • Also implemented as standalone Python library • Data stored on filesystem ISSGC’06, Ischia, 09-21 July 2006

  28. GILDA Use Cases ISSGC’06, Ischia, 09-21 July 2006

  29. gLibrary Use Case • Attempts to create a Multimedia Management System on the Grid • Examples of Multimedia Contents handled by gLibrary: • Images • Movies • Audio Files • Office Documents (Powerpoint, Word, Excel, OpenOffice) • E-Mails, PDFs, HTMLs • Customized versions of well-know document type (ex. EGEE PPTs) • …. • Keeps track and organizes in a uniform way all the additional details (metadata) of files saved in Storage Elements and registered in File Catalogues • Provides users with an easy way to locate and retrieve files based on their contents ISSGC’06, Ischia, 09-21 July 2006

  30. gLibrary JAVA GUI screenshot Alpha Prototype ISSGC’06, Ischia, 09-21 July 2006

  31. File Catalog gLibrary Deployment scenario VOMS VOMS Proxy w/Role & Group VOMS Proxy with Group & Role Information Authenticate with X509 Certificate PostGreSQL (gLibraryManager, gLibrarySubmitter, VO user) AMGA Server UI VOMS Proxy VOMS Proxy SE SE SE ISSGC’06, Ischia, 09-21 July 2006

  32. gMOD: grid Movie On Demand • gMOD provides a Video-On-Demand service • User chooses among a list of video and the chosen one is streamed in real time to the video client of the user’s workstation • For each movie a lot of details (Title, Runtime, Country, Release Date, Genre, Director, Case, Plot Outline) are stored and users can search a particular movie querying on one or more attributes • Two kind of users can interact with gMOD: TrailersManagers that can administer the db of movies (uploading new ones and attaching metadata to them); GILDA VO users (guest) can browse, search and choose a movie to be streamed. ISSGC’06, Ischia, 09-21 July 2006

  33. CE WN WN WN LFC Catalogue Metadata Catalogue gMOD interactions VOMS Storage Elements GENIUS Portal AMGA get Role User Workload Management System ISSGC’06, Ischia, 09-21 July 2006

  34. gMOD screenshot gMOD is accesible through the GENIUS Portal (https://glite-tutor.ct.infn.it) ISSGC’06, Ischia, 09-21 July 2006

  35. Data movement introduction • Grids are naturally distributed systems • The means that data also needs to be distributed • First generation data distribution mainly concentrated on copy protocols in a grid environment: • gridftp • http + mod_gridsite • File movement started and controlled on the client side • But copies controlled by clients have problems… ISSGC’06, Ischia, 09-21 July 2006

  36. Direct Client Controlled Data Movement • Although transport protocol may be robust, state is held inside client – inconvenient and fragile. • Client only knows about local state, no sense of global knowledge about data transfers between storage elements. • Storage elements overwhelmed with replication requests • Multiple replications of the same data can happen simultaneously • Site has little control over balance of network resources - DoS Control Channels Client Source Storage Element Destination Storage Element Data Flow Channel ISSGC’06, Ischia, 09-21 July 2006

  37. Transfer Service • Clear need for a service for data transfer • Client connects to service to submit request • Service maintains state about transfer • Client can periodically reconnect to check status or cancel request • Service can have knowledge of global state, not just a single request • Load balancing • Scheduling • Submit new request • Monitor progress • Cancel request Client SOAP via https Transfer Service Control Source Storage Element Destination Storage Element Data Flow ISSGC’06, Ischia, 09-21 July 2006

  38. FTS Service has a concept of channels A channel is a unidirectional connection between two sites Transfer requests between these two sites are assigned to that channel Channels usually correspond to a dedicated network pipe (e.g., OPN) associated with production But channels can also take wildcards: * to MY_SITE : All incoming MY SITE to * : All outgoing * to * : Catch all gLite FTS: Channels • Channels control certain transfer properties: transfer concurrency, gridftp streams. • Channels can be controlled independently: started, stopped, drained. ISSGC’06, Ischia, 09-21 July 2006

  39. Data Management Services Summary • Storage Elements – save data and provide a common interface • Storage Resource Manager (SRM) Castor, dCache, DPM, … • Native Accessprotocolsrfio, dcap, nfs, … • Transfer protocols gsiftp, ftp, … • Catalogs – keep track where data are stored • File Catalog • Replica Catalog • Metadata Catalog • Data Movement – schedules reliable file transfer • File Transfer Service gLite FTS(manages physical transfers) LCG File Catalog (LFC) AMGA Metadata Catalogue ISSGC’06, Ischia, 09-21 July 2006

  40. References • gLite documentation homepage • http://glite.web.cern.ch/glite/documentation/default.asp • DM subsystem documentation • http://egee-jra1-dm.web.cern.ch/egee-jra1-dm/doc.htm • LFC and DPM documentation • https://uimon.cern.ch/twiki/bin/view/LCG/DataManagementDocumentation • AMGA Project Homepage • http://project-arda-dev.web.cern.ch/project-arda-dev/metadata/ • FTS user guide • https://edms.cern.ch/file/591792/1/EGEE-TECH-591792-Transfer-CLI-v1.0.pdf ISSGC’06, Ischia, 09-21 July 2006

  41. Questions… ISSGC’06, Ischia, 09-21 July 2006

More Related