290 likes | 317 Views
3 rd French HPSS users Meeting. Jessica Orban. Jessica.orban@ecmwf.int. European cooperation at its best. ECMWF’s role is to address the critical and most difficult research problems in medium-range NWP (Numerical Weather Prediction) that no one country could tackle on its own
E N D
3rd French HPSS users Meeting Jessica Orban Jessica.orban@ecmwf.int
European cooperation at its best • ECMWF’s role is to address the critical and most difficult research problems in medium-range NWP (Numerical Weather Prediction) that no one country could tackle on its own • Deliverables and research • Global numerical weather forecasts • Composition of the atmosphere: monitoring and forecasting • Climate reanalysis: monitoring • Supercomputing & data archiving • Education programme European Centre for Medium-Range Weather Forecasts 2 European Centre for Medium-Range Weather Forecasts
Summary • Business as usual • Migration incident • Data growth for the next years • TS4500 acceptance tests • HPSS 7.5.3 migration • HPSS 7.5.3 testing • Bologna European Centre for Medium-Range Weather Forecasts
Business as usual • 3 environnements: prod, preprod and test • HPSS 7.4.2u1p1 • AIX 6.1 • 1 partition • 3 subsystems: • General • 2 TiB of disks (4 LUNs of 512 GiB) • 7.04 PiB of tapes (431 tapes) • 401 883 files, 6 184 directories, 4 filesets • Mars • No disk • 357.18 PiB of tapes (53 509 tapes) • 11 327 276 files, 7 020 516 directories, 629 filesets, 623 junctions European Centre for Medium-Range Weather Forecasts
Business as usual • ECFS • 1.16 PiB of disks (64 LUNs of 512GiB, 268 LUNs of 2TiB, 138 LUNs of 4TiB, 2 LUNs of 8TiB) • 95.19 PiB of tapes (17,655 tapes) • 367 600 840 files, 30 826 308 directories, 13 filesets, 10 junctions • 36 CoS/Hier (5 tests) • 31 active SC (10 disk SC, 21 tape SC) • Devices: • 493 disks • SL8500 • 16 LTO7, 155 T10KD, 56 T10KC • TS3500 • 11 LTO6, 10 LTO7, 6 LTO8 • 70 tape drives moved to direct connexion • Had to update Qlogic driver parameter « Target enable reset » to 0 (comparable to comparable lpfc module parameteris lpfc_fcp2_no_tgt_reset) echo "options qla2xxx ql2xtargetreset=0" > /etc/modprobe.d/qla2xxx.conf European Centre for Medium-Range Weather Forecasts
Business as usual • Repack • MARS used tape to tape hierarchies • Some research data are deleted after being written to both level of the hierarchy • About 6000 tapes repacked in the last 6 months • Big Purge • 60 PB deleted in April 2018 • 35 PB deleted in March 2019 European Centre for Medium-Range Weather Forecasts
Business as usual • Change of technologies • 1 SC (secondary copy) changed from LTO6 to LTO7 • 2597 LTO5 • 11608 LTO6 • 384 LTO7 • 2400 LTO5 repacked since mid February • 2 SC (secondary copy) changed from LTO6 to LTO8M • Write only new data as we don’t have enough LT8 drives to migrate data and to repack at the same time • 1st SC • 11754 LTO6 • 330 LTO8M • 2nd SC • 5586 LTO6 • 170 LTO8M European Centre for Medium-Range Weather Forecasts
Business as usual European Centre for Medium-Range Weather Forecasts
Business as usual European Centre for Medium-Range Weather Forecasts
Business as usual European Centre for Medium-Range Weather Forecasts
Business as usual European Centre for Medium-Range Weather Forecasts
Migration incident European Centre for Medium-Range Weather Forecasts
Data growth for the next years • Estimate based on the new HPC (new ITT mid 2019) • end 2019: 410PB • end 2020: 580PB • end 2021: 810PB • end 2022: 1EB • Will depend on the next HPC and which upgrades we will be able to afford • end 2023: 1.5EB • end 2024: 2.1EB • end 2025: 2.9EB European Centre for Medium-Range Weather Forecasts
TS4500 acceptance tests • Functional tests • TS1160: bug in D3I5_457 firmware (fixed in new firmware). All tapes must be partitioned again. Data on tape will be lost. • Drive error code are not reported in the GUI • SYSLOG are issued with localhost instead of library’s IP and are hardcoded on local3 • GUI can’t display 16 Gb/s on FC ports • Reliability tests • Redundant power loss is not always reported in SNMP or SYSLOG • Several issues with I/O stations, included tapes dropped inside the library • Sometimes, putting an accessor in service mess up with the other one European Centre for Medium-Range Weather Forecasts
HPSS 7.5.3 migration • Objectives: • Migrate core servers from AIX to Linux (finally) • Migrate from single partition HPSS 7.4.2 to a partitioned (9) HPSS 7.5.3 • Do this with a minimal interruption of service. • conversion done on the fly, with AIX environment operational • small downtime (2-3 hours?) to complete conversion, and transfer services to the Linux machine. • Factors: • Operation should have happened 2 years ago, in two steps. • Other projects (e.g. relocation) delayed this process. • Decision to do the two jumps in one go. • Our AIX machine has fairly limited resource. • One very busy subsystem, with 370 Million files • A hell of a lot of data to transfer and migrate • very active database with many parallel transactions European Centre for Medium-Range Weather Forecasts
HPSS 7.5.3 migration • General concepts • qrep • load : copy (with transform) one table at a time. • apply changes to loaded tables. • qverify • compare the contents of source and target database, and track differences. European Centre for Medium-Range Weather Forecasts
HPSS 7.5.3 migration • 3Q18: test environment converted with minimal issues. • 4Q18: Logic issues discovered while starting the production conversion • heavy usage of renames and partitioning did not mix well during apply changes part of qrep • Dec18: New conversion code delivered, but performance issues on AIX encountered. • Latency between source table update and target table update reaches several days. • 1Q19: additional parallelism and better capture balancing are introduced. • we now just about manage to keep up with source machines updates. • 2Q19: We are now dealing with qverify performances. European Centre for Medium-Range Weather Forecasts
HPSS 7.5.3 migration European Centre for Medium-Range Weather Forecasts
HPSS 7.5.3 migration • The conversion date has been moved multiple times. • May 4th is the ultimate target. • Our AIX boxes are not supported anymore. • We want some of the 7.5 features. • We need 7.5.3 to connect TS1160s • Francis will leave ECMWF shortly after. • Hopefully a qrep based solution, but... • ... We have a plan B • Accept a long downtime and load databases offline. • 10-12 hours downtime. (estimate based on dry run test) • What if errors encountered on the day? European Centre for Medium-Range Weather Forecasts
HPSS 7.5.3 testing • Tape media 60F JE default size is 320To • Recover • Tests: • Hierarchies • Disk to dual copies on tapes • Tape to tape • Recover of second copy • Recover of primary copy • Results: • It’s working and use TOR and RAO features (RAO calls can be improved) • No logs in Alarms and Events • No timestamp in recover logfile (/var/hpss/tmp/recover_<VolID>.txt. If run several times, history is lost • Doesn’t indicate which tapes are needed for the recovery • CRs opened for Dry-run and listing of tapes needed European Centre for Medium-Range Weather Forecasts
HPSS 7.5.3 testing • Db2 configuration • Change databaseparameter LOGSECOND from 10 to -1 • Db2 backup • With multiple partition, each database has 1 backup file per partition • Db2_fullbackup.ksh only verify and copy the last written file to the secondary backup partition • Tapes drives Quotas • Several major bugs found • If a drive is locked while in used, In Use values (read or write) are not updated (fixed in the last patch) • When PVL is restarted, In Use values are reset to 0 (fixed in the last patch) • CR opened to set recall limit with percentage European Centre for Medium-Range Weather Forecasts
HPSS 7.5.3 testing • Tapes drives Quotas • If the number of unlocked drives for a PVR goes below the recall limit (recall limit not set to -1), the recall limit is automatically changed to the number of available drives European Centre for Medium-Range Weather Forecasts
HPSS 7.5.3 testing • Tapes drives Quotas European Centre for Medium-Range Weather Forecasts
Data Centre fit-out timeline HPC operational in Bologna Bologna DC handover DHS operation in Bologna Data Centre construction 05-2020 10-2019 Start Procurements N&S infrastructure deployed HPC contract signed 100 Gbps link Reading-Bologna Q1 Q2 Q3 Q4 Q2 Q3 Q4 Q1 2020 2019 2020 Operational services Delivery and installation : racks, fibbers, network, servers, storage Procurements European Centre for Medium-Range Weather Forecasts
Questions ? European Centre for Medium-Range Weather Forecasts
Thank you European Centre for Medium-Range Weather Forecasts