Toward new HSM solution using GPFS/TSM/StoRM integration

Toward new HSM solution using GPFS/TSM/StoRM integration Vladimir Sapunenko (INFN, CNAF) Luca dell’Agnello (INFN, CNAF) Daniele Gregori (INFN, CNAF) Riccardo Zappi (INFN, CNAF) Lunca Magnoni (INFN, CNAF) Elisabetta Ronchieri (INFN, CNAF) Vincenzo Vagnoni (INFN, Bologna)

Storage classes @ CNAF • Implementation of 3 Storage Classes needed for LHC • Disk0Tape1 (D0T1)  CASTOR • Space managed by system • Data migrated to tapes and deleted from when staging area is full • Disk1tape0 (D1T0)  GPFS/StoRM (in production) • Space managed by VO • Disk1tape1 (D1T1)  CASTOR (production), GPFS/StoRM (production prototype for LCHb only) • Space managed by VO (i.e. if disk is full, copy fails) • Large permanent buffer of disk with tape back-end and no gc HEPiX 2008, Geneve

Looking into HSM solution on the base of StoRM/GPFS/TSM • Project developed as a collaboration between: • GPFS development team (US) • TSM HSM development team (Germany) • End-users (INFN-CNAF) • Main idea is to combine new features of GPFS (v.3.2) and TSM (v.5.5) with SRM (StoRM), to provide transparent GRID-friendly HSM solution. • Information Lifecycle Management (ILM) used to order moving of data between disks and tapes • Interface between GPFS and TSM is on our shoulders • Improvements and development are needed from all sides • Transparent recall vs. massive (list ordered, optimized) recalls HEPiX 2008, Geneve

What we have now • GPFS and TSM are widely used as separate products • Build-in functionality in both products to implement backup and archiving from GPFS. • In GPFS v.3.2 concept of “external storage pool” extends use of policy driven ILM to tape storage. • Some groups in HEP world are starting to investigate this solution or expressed interest to start HEPiX 2008, Geneve

GPFS Approach: “External Pools” • External pools are really interfaces to external storage managers, e.g. HPSS or TSM • External pool “rule” defines script to call to migrate/recall/etc. files RULE EXTERNAL POOL ‘PoolName’ EXEC ‘InterfaceScript’ [ OPTS ’options’] • GPFS policy engine builds candidate lists and passes them to external pool scripts • External storage manager actually moves the data HEPiX 2008, Geneve

Storage class Disk1-Tape1 • D1T1 prototype in GPFS/TSM was tested for about two months • Quite simple when no competition between migration and recall • D1T1 requires that every file written to disk will be copied to tape (and remain resident on disk) • recalls needed only in case of data loss (on disk) • Although the D1T1 is a living concept… • Some adjustments were needed in StoRM • Basically to place a file on hold for migration until the write operation is completed (SRM “putDone” on file) • Definitely positive results of the test with the current testbed hardware • Need to more tests up with a larger scale • Need to establish production model HEPiX 2008, Geneve

Storage class Disk0-Tape1 • Prototype is ready and being tested now • More complicated logic is needed • Define priority between reads and writes • For example in actual version of CASTOR migration to tape have absolute priority • logic of reordering of recall “list optimized recall”: by tapes and by files inside a tape • The logic is realized by means of special scripts • First tests are encouraging, even considering the complexity of the problem • Modification were requested in StoRM to implement recall logic and file pinning for files in use. • The identified solutions are simple and linear HEPiX 2008, Geneve

GPFS+TSM tests • So far we have performed full tests of a D1T1 solution (StoRM+GPFS+TSM) and the D0T1 implementation is being developed in close contact with IBM GPFS and TSM developers • The D1T1 is entering now its first production phase, being used by LHCb during this month’s CCRC08 • As well as the D1T0, which is served by the same GPFS cluster but without migrations • GPFS/StoRM based D1T0 is also already used since February by Atlas HEPiX 2008, Geneve

D1T0 and D1T1 @CNAFusing StoRM/GPFS/TSM • 3 STORM instances • 3 major HEP experiments • 2 Storage classes • 12 servers, 200TB of disk space • 3 LTO2 tape drives HEPiX 2008, Geneve

Hardware used for test • 40TB GPFS File system (v.3.2.0-3) served by 4 I/O NSD servers (SAN devices are EMC CX3-80) • FC (4Gbit/s) interconnection between servers and disks array • TSM v.5.5 • 2 servers (1Gb Ethernet) HSM front-ends each one acting as: • GPFS client (reads and writes on the file-system via LAN) • TSM client (reads and writes from/to tapes via FC) • 3 LTO-2 tape drives • Sharing of the tape library (STK L5500) between Castor e TSM • i.e. working together with the same tape library HEPiX 2008, Geneve

TSM server (backup) DB mirror LHCb D1T0 and D1T1 details • 2 EMC CX3-80 controllers • 4 GPFS server • 2 StoRM servers • 2 Gridftp Servers • 2 HSM frontend nodes • 3 Tape Drive LTO-2 • 1 TSM server • 1/10 Gbps Ethernet • 2/4 Gbps FC GPFS FC SAN gridftp Server GPFS Server … GPFS Server Gigabit LAN TSM GPFS/TSM client GPFS/TSM client FC TAN TSM server DB Tape drive Tape drive Tape drive HEPiX 2008, Geneve

How it works • GPFS performs file system metadata scans according to ILM policies specified by the administrators • The metadata scan is very fast (is not a find…) and is used by GPFS to identify the files which need to be migrated to tape • Once the list of files are obtained, it is passed to an external process which is run on the HSM nodes and it actually performs the migration to TSM • This is in particular what we implemented • Note: • The GPFS file system and the HSM nodes can be kept completely decoupled, in the sense that it is possible to shutdown the HSM nodes without interrupting the file system availability • All components of the system are having intrinsic redundancy (GPFS failover mechanisms). • No need to put in place any kind of HA features (apart from the unique TSM server) HEPiX 2008, Geneve

Example of a ILM policy /* Policy implementing T1D1 for LHCb: -) 1 GPFS storage pool -) 1 SRM space token: LHCb_M-DST -) 1 TSM management class -) 1 TSM storage pool */ /* Placement policy rules */ RULE 'DATA1' SET POOL 'data1' LIMIT (99) RULE 'DATA2' SET POOL 'data2' LIMIT (99) RULE 'DEFAULT' SET POOL 'system' /* We have 1 space token: LHCb_M-DST. Define 1 external pool accordingly. */ RULE EXTERNAL POOL 'TAPE MIGRATION LHCb_M-DST‘ EXEC '/var/mmfs/etc/hsmControl' OPTS 'LHCb_M-DST‘ /* Exclude from migration hidden directories (e.g. .SpaceMan), baby files, hidden and weird files. */ RULE 'exclude hidden directories' EXCLUDE WHERE PATH_NAME LIKE '%/.%' RULE 'exclude hidden file' EXCLUDE WHERE NAME LIKE '.%' RULE 'exclude empty files' EXCLUDE WHERE FILE_SIZE=0 RULE 'exclude baby files' EXCLUDE WHERE (CURRENT_TIMESTAMP-MODIFICATION_TIME)<INTERVAL '3' MINUTE HEPiX 2008, Geneve

Example of a ILM policy (cont.) /* Migrate to the external pool according to space token (i.e. fileset). */ RULE 'migrate from system to tape LHCb_M-DST' MIGRATE FROM POOL 'system' THRESHOLD(0,100,0) WEIGHT(CURRENT_TIMESTAMP-ACCESS_TIME) TO POOL 'TAPE MIGRATION LHCb_M-DST' FOR FILESET('LHCb_M-DST') RULE 'migrate from data1 to tape LHCb_M-DST' MIGRATE FROM POOL 'data1' THRESHOLD(0,100,0) WEIGHT(CURRENT_TIMESTAMP-ACCESS_TIME) TO POOL 'TAPE MIGRATION LHCb_M-DST' FOR FILESET('LHCb_M-DST') RULE 'migrate from data2 to tape LHCb_M-DST' MIGRATE FROM POOL 'data2' THRESHOLD(0,100,0) WEIGHT(CURRENT_TIMESTAMP-ACCESS_TIME) TO POOL 'TAPE MIGRATION LHCb_M-DST' FOR FILESET('LHCb_M-DST') HEPiX 2008, Geneve

Example of configuration file # TSM admin user name TSMID=xxxxx # TSM admin user password TSMPASS=xxxxx # report period (in sec) REPORTFREQUENCY=86400 # report email addresses (comma separated) REPORTEMAILADDRESS=Vladimir.Sapunenko@cnaf.infn.it,Daniele.Gregori@cnaf.infn.it,Luca.dellAgnello@cnaf.infn.it,Angelo.Carbone@bo.infn.it,Vincenzo.Vagnoni@bo.infn.it # alarm email addresses (comma separated) ALARMEMAILADDRESS=t1-admin@cnaf.infn.it # alarm email delay (in sec) ALARMEMAILDELAY=7200 # HSM node list (comma separated) HSMNODES=diskserv-san-14,diskserv-san-16 # system directory path SVCFS=/storage/gpfs_lhcb/system # filesystem scan minimum frequency (in sec) SCANFREQUENCY=1800 # maximum time allowed for a migrate session (in sec) MIGRATESESSIONTIMEOUT=4800 # maximum number of migrate threads per node MIGRATETHREADSMAX=30 # number of files for each migrate stream MIGRATESTREAMNUMFILES=30 # sleep time for lock file check loop LOCKSLEEPTIME=2 # pin prefix PINPREFIX=.STORM_T1D1_ HEPiX 2008, Geneve

Example of a report A first automatic reporting system has been implemented --------------------------------------------------------------------------------------------------------- Start: Sun 04 May 2008 11:38:48 PM CEST Stop: Mon 05 May 2008 08:03:15 AM CEST Seconds: 30267 --------------------------------------------------------------------------------------------------------- Tape Files Failures File throughput Total throughput L00595 5 0 31.0798 MiB/s 0.702259 MiB/s L00599 10 0 32.4747 MiB/s 1.41891 MiB/s L00611 57 0 29.0862 MiB/s 6.59165 MiB/s L00614 47 0 31.5084 MiB/s 6.61944 MiB/s L00615 46 0 30.3926 MiB/s 6.57133 MiB/s L00617 47 0 31.1735 MiB/s 6.5116 MiB/s L00618 62 0 28.4119 MiB/s 6.06469 MiB/s L00619 44 0 27.0226 MiB/s 4.10937 MiB/s L00620 53 0 27.1009 MiB/s 7.13976 MiB/s L00621 66 0 28.9043 MiB/s 6.67269 MiB/s L00624 44 0 11.4347 MiB/s 5.82468 MiB/s L00626 62 0 30.4792 MiB/s 6.53114 MiB/s --------------------------------------------------------------------------------------------------------- Drive Files Failures File throughput Total throughput DRIVE3 218 0 30.2628 MiB/s 25.7269 MiB/s DRIVE4 197 0 29.5188 MiB/s 23.6487 MiB/s DRIVE5 128 0 21.5395 MiB/s 15.3819 MiB/s --------------------------------------------------------------------------------------------------------- Host Files Failures File throughput Total throughput diskserv-san-14 285 0 29.9678 MiB/s 34.0331 MiB/s diskserv-san-16 258 0 25.6928 MiB/s 30.7245 MiB/s --------------------------------------------------------------------------------------------------------- Files Failures File throughput Total throughput Total 543 0 27.9366 MiB/s 64.7575 MiB/s --------------------------------------------------------------------------------------------------------- Alarm part is being developed An email is sent with the reports every day (period of time is configurable by the option file) HEPiX 2008, Geneve

Description of the tests • Test A • Data transfer of LHCb files from CERN Castor-disk to CNAF StoRM/GPFS using the File Transfer Service • Automatic migration of the data files from GPFS to TSM while the data was being transferred by FTS • This is a realistic scenario • Test B • 1GiB zero’ed files created locally on the GPFS file system with the migration turned off, then migrated to tape when the writes were finished • The migration of zero’ed files to tape is faster due to compression  measures physical limits of the system • Test C • Similar to Test B, but with real LHCb data files instead of dummy zero’ed files • Realistic scenario, e.g. when for maintenance a long queue of files to be migrated accumulates in the file system HEPiX 2008, Geneve

Test A: input files File size distribution • Most of the files are of 4 and 2 GiB size, with a bit of other sizes in addition • data files are LHCb stripped DST • 2477 files • 8 TiB in total HEPiX 2008, Geneve

Test A: results Just two LTO-2 drives Black curve: net data throughput from CERN to CNAF vs. time Red curve: net data throughput from GPFS to TSM Zero tape migration failuresZero retrials A third LTO-2 drive was added A drive was removed FTS transfers were temporarily interrupted 8 TiB in total were transferred to tape in 150k seconds (almost 2 days) from CERN About 50 MiB/s to tape with two LTO-2 drives and 65 MiB/s with three LTO-2 drives HEPiX 2008, Geneve

Test A: results (II) • Most of the files were migrated within less than 3 hours with a tail up to 8 hours • The tail comes from the fact that at some point the CERN-to-CNAF throughput raised to 80 MiB/s, overcoming max performance of tape migration at that time. So, GPFS/TSM accumulated a queue of files with respect to the FTS transfers Retention time on disk (time since file is written until it is migrated to tape) HEPiX 2008, Geneve

Test A: results (III) Distribution of throughput per migration to tape • The distribution peaks at about 33 MiB/s which is the maximum sustainable for LHCb data files by the LTO-2 drives • Due to compression the actual performance depend on the content of the files… • Tail is mostly due to the fact that some of the tapes showed much smaller throughputs • For this test we reused old tapes no longer used by Castor What is this secondary peak? It is due to files which are written to the end of the tapes and the TSM splits them to a subsequent tape (i.e. must dismount and remount a new tape to continue writing the file) HEPiX 2008, Geneve

Intermezzo • Between Test A and Test B we realized that the interface logics was not perfectly balancing between the two HSM nodes • Then the logics of the interface has been slightly changed in order to improve the performance HEPiX 2008, Geneve

Test B: results Net throughput to tape versus time • File system prefilled with 1000 files of 1 GiB size each all filled with zeroes • migration to tape turned off while writing data to disk • Migration to tape turned on when prefilling finished • Hardware compression is very effective for such files • About 100 MiB/s observed over 10k seconds What is this valley here? Explained in the next slide where they are more visible No tape migration failures and no retrials observed HEPiX 2008, Geneve

Test C: results Net throughput to tape versus time • Similar to Test B, but with real LHCb data files taken from the same sample of Test A instead of zero’ed files • The valleys clearly visible here have a period of exactly 4800 seconds • They were also partially present in Test A, but not clearly visible in the plot due to larger binning • The valleys are due to a tunable feature of our interface • Each migration session is timed out if not finished within 4800 seconds • After the timeout GPFS performs a new metadata scan and a new migration session is initiated • 4800 seconds is not a magic number, could be larger or even infinite About 70 MiB/s on averagewith peaks up to 90 MiB/s No tape migration failuresand no retrials observed HEPiX 2008, Geneve

Conclusions and outlook • First phase of tests for T1D1 StoRM/GPFS/TSM-based concluded • LHCb is now starting the first production experience with such a T1D1 system • Work is ongoing for a T1D0 implementation in collaboration with IBM GPFS and TSM HSM development teams • T1D0 is more complicated since it should include active recalls optimization, concurrence between migrations and recalls, etc. • IBM will introduce efficient ordered recalls features in the next major release of TSM • Waiting for that release, in the meanwhile we are implementing it through an intermediate layer of intelligence between GPFS and TSM driven by StoRM • A first proof of principle prototype already exists, but this is something to be discussed in a future talk… stay tuned! • New library recently acquired at CNAF • Once the new library will be online and old data files will be repacked to the new one, the old library will be devoted entirely to TSM production systems and testbeds • About 15 drives, much more realistic and interesting scale than 3 drives HEPiX 2008, Geneve

Toward new HSM solution using GPFS/TSM/StoRM integration