1 / 20

Introduction to iRODS

Introduction to iRODS. Jean-Yves Nief. Talk overview. Data management context. Some data management goals: Storage virtualization . Virtualization of the data management policy . Examples of data management rules . Why choose iRODS ? What is iRODS ? iRODS usage examples .

zeki
Download Presentation

Introduction to iRODS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to iRODS Jean-Yves Nief

  2. Talk overview • Data management context. • Some data management goals: • Storage virtualization. • Virtualization of the data management policy. • Examples of data management rules. • WhychooseiRODS ? • WhatisiRODS ? • iRODS usage examples. • Propects: scalability. • Summary. Introduction to iRODS - DPHEP

  3. CC-IN2P3 activities • Data centerslike CC-IN2P3 (Lyon, France) works for international scientific collaborations. • Examples: • High EnergyPhysics: CERN (Fr/Switzeland), SLAC (USA), Fermilab (USA), BNL (USA) etc… • Astroparticlephysics / astrophysics: Auger (Argentina), HESS (Namibia), AMS (Int. Space Station).  Distributedenvironment: experimental sites, data centers, collaboratorsspreadaround the world. Introduction to iRODS - DPHEP

  4. CC-IN2P3 activities Introduction to iRODS - DPHEP

  5. Data management context • Multidisciplinaryenvironment: • High Energy and nuclearphysics. • Astroparticle, astrophysics. • Biology, Biomedicalapps, Arts and Humanities.  Variousconstraints, variousneeds for data management. Introduction to iRODS - DPHEP

  6. Data management context • Data stored on various sites. • Heterogeneousstorage: • Data format: flat files, databases, data streams… • Storage media, server hardware: disks , tapes. • Data accessprotocols, information systems. • Heterogeneous OS on both clients and servers side.  Needs to federate all this in a homogeneousway. Introduction to iRODS - DPHEP

  7. Data management context • Not in the scope here: • Intensive parallel I/O for data analysis. • In the scope here: • Data preservation (replication, consistency …). • Data accessdistributed over different sites. • Data life cycle (file format transformation, data workflows, interactions withvarious info systems). • Need for virtualization of the storage: • Logicalview and organization of the data. • Data migration to new hardware/software transparent to the end clients tools: no view of the physical location of the data and underneath technologies. • Logicalview of the data unique to all the usersindependently of their location. • Virtual organization (VO) of the users: • Unique id for each user. • Organization by groups, role (simple user, sysadminetc…). • Access rights to the data within the VO. Introduction to iRODS - DPHEP

  8. Some data management goals • Storage virtualization not enough. • For client applications relying on these middlewares: • No safeguard. • No guarantee of a strict application of the data preservationpolicy. • Real need for a data distribution project to define a coherent and homogeneouspolicy for: • data management. • storageresource management. • Crucial for massive archivalprojects (digital libraries …). • No grid/cloud/… toolhadthesefeaturesuntil 2006. Introduction to iRODS - DPHEP

  9. Virtualization of the data management policy • Typicalpitfalls: • No respect of givenpre-establishedrules. • Several data management applications mayco-existat the same moment. • Several versions of the same application canbeusedwithin a projectat the same time.  potentialinconsistency. • Removevariousconstraints for various sites from the client applications. • Solution: • Data management policyvirtualization. • Policy expressed in terms of rules. Introduction to iRODS - DPHEP

  10. Examples of data management rules • Customizedaccessrights to the system: • Disallow file removalfrom a particular directory even by the owner. • Security and integrity check of the data: • Automatic checksum launched in the background. • On the flyanonymization of the files even if it has not been made by the client. • Metadata registration: • Automatedmetadata registration associated to objects (inside or outside the iRODSdatabase). • Small files aggregationbefore migration to MSS. • Customizedtransferparameters: • Number of streams, stream size, TCP window as a function of the client or server IP. • … up to yourneeds … Introduction to iRODS - DPHEP

  11. WhydidwechooseiRODS ? • Provide a solution to the aboverequirements. • SRB (iRODSpredecessor) has been usedso far: • Data virtualization. • But no policyrulebasedmechanisms. • In 2007, no « grid » toolsexceptiRODScouldprovide data management policiesbased on rule. • Scalable. • Can becustomized to fit a widevariety of use cases. Introduction to iRODS - DPHEP

  12. WhatisiRODS ? • iRuleOrientedData Systems (DICE team: UNC, San Diego): • started in 2006. • open source. • CC-IN2P3: collaborator • In a « zone » (administrative domain): • One or severalseveral servers connected to a CentralizedMetacatalog (RDBMS) with files metadata, user informations, data locations etc…  Logicalview of the data in a givenzone. • Data servers spreadgeographicallywithin a zone. • Possibility to have differentzones (separate administrative domains) interconnected. • Data management policiesexpressedwithrules in a « C-like » language: • Can betriggeredautomatically for various actions (put, get, list, rename….). • Can berunmanually. • Can berun in batch mode. • Rulesversioning. • Client interactions withiRODS: • APIs (C, Java, PHP, Python), shellcommands, GUIs, web interfaces. Introduction to iRODS - DPHEP

  13. Logicalnamespace (example of a web interface) integrityFileOwner { #Input parameteris: # Name of collection thatwillbechecked # Ownernamethatwillbeverified #Output is: # List of all files in the collection that have a differentowner #Verifythat input pathis a collection msiIsColl(*Coll,*Result, *Status); if(*Result == 0) { writeLine("stdout","Input path *Collis not a collection"); fail; } *ContInxOld = 1; *Count = 0; #Loop over files in the collection msiMakeGenQuery("DATA_ID,DATA_NAME,DATA_OWNER_NAME","COLL_NAME = '*Coll'",*GenQInp); msiExecGenQuery(*GenQInp, *GenQOut); msiGetContInxFromGenQueryOut(*GenQOut,*ContInxNew); while(*ContInxOld > 0) { foreach(*GenQOut) { msiGetValByKey(*GenQOut,"DATA_OWNER_NAME",*Attrname); if(*Attrname != *Attr) { msiGetValByKey(*GenQOut,"DATA_NAME",*File); writeLine("stdout","File *File has owner *Attrname"); *Count = *Count + 1; } } *ContInxOld = *ContInxNew; if(*ContInxOld > 0) {msiGetMoreRows(*GenQInp,*GenQOut,*ContInxNew);} } writeLine("stdout","Number of files in *Collwithownerotherthan *Attris *Count"); } INPUT *Coll = "/tempZone/home/rods/sub1", *Attr = "rods" OUTPUT ruleExecOut Ruleexample (e.g.: Are all the files in a given collection all owned by a given user ? )

  14. iRODS usage @ CC-IN2P3 • Used for data distribution, archiving, integration in analysis or data life cycle management workflows. • Interfacedwith: • Mass Storage Systems: HPSS. • Externaldatabases or information systems: RDBMS, Fedora Commons. • Web servers. • High energy and nuclearphysics: • BaBar: data management of the entire data set between SLAC and CC-IN2P3: total foreseen 2PBs. • dChooz: neutrino experiment (France, USA, Japanetc…): 550 TBs. • Astroparticle and astrophysics: • AMS: cosmic ray experiment on the International Space Station (600 TBs). • TREND, BAOradio: radioastronomy (170 TBs). • Biology and biomedicalapplications(50 TBs). • Arts and Humanities: Adonis (70 TBs). Introduction to iRODS - DPHEP

  15. iRODS usage example • archival in Lyon of the entireBaBar data set (total of 2 PBs). • automatictransferfrom tape to tape: 3-4 TBs/day (no limitation). • automaticrecovery of faultytransfers. • ability for a SLAC admin to recover files directlyfrom the CC-IN2P3 zone if data lostat SLAC. Introduction to iRODS - DPHEP

  16. Somerulesexamples (HEP) • Mass Storage System integration: • Using compound resources: iRODSdisk cache + tapes. • Data on disk cache replicationinto MSS asynchronously (1h later) using a delayExecrule. • Recoverymechanism: retries untilsuccess, delaybetweeneach retries isdoubledateach round. • ACL management: • Rulesneeded for fine granularityaccessrights management. • Eg: • 3 groups of users (admins, experts, users). • ACLs on /<zone-name>/*/rawdata => admins : r/w, experts + users : r • ACLs on all otherssubcollections => admins + experts : r/w, users : r Introduction to iRODS - DPHEP

  17. Prospects: scalability • 5.5 PBsmanaged by iRODSso far. • 26 millions files registered.  No scalability issues foreseen. • Pitfalls: • Metadatascalability ? (billions of entries in the catalog ?). • Control of the number of simultaneous connections to beenforced (like for Apache servers): needed in a wideopenedenvironment. Introduction to iRODS - DPHEP

  18. Summary • iRODS: • Lots of features for data management in a distributedenvironment. • Provides the flexibility and the freedom to interface or workdirectlywith a non limitedlist of systems. • Not the same goal as a parallel file system. Introduction to iRODS - DPHEP

  19. Summary • SomeiRODS user examples: • USA: DataNet, NASA, NOAO, iPlant. • Canada: Virtual Observatory. • France: National Library, Strasbourg Observatory, Grenoble Campus, CINES etc… • Europe: EUDAT, Prace. • Privatesector: DDN etc… • Entreprise version ramping up (E-iRODS): • Wider source of fundings. Introduction to iRODS - DPHEP

  20. Learn more • https://www.irods.org/index.php/Main_Page Introduction to iRODS - DPHEP

More Related