1 / 27

3 rd French HPSS users Meeting

3 rd French HPSS users Meeting. Jessica Orban. Jessica.orban@ecmwf.int. European cooperation at its best. ECMWF’s role is to address the critical and most difficult research problems in medium-range NWP (Numerical Weather Prediction) that no one country could tackle on its own

nadiap
Download Presentation

3 rd French HPSS users Meeting

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 3rd French HPSS users Meeting Jessica Orban Jessica.orban@ecmwf.int

  2. European cooperation at its best • ECMWF’s role is to address the critical and most difficult research problems in medium-range NWP (Numerical Weather Prediction) that no one country could tackle on its own • Deliverables and research • Global numerical weather forecasts • Composition of the atmosphere: monitoring and forecasting • Climate reanalysis: monitoring • Supercomputing & data archiving • Education programme European Centre for Medium-Range Weather Forecasts 2 European Centre for Medium-Range Weather Forecasts

  3. European Centre for Medium-Range Weather Forecasts

  4. Summary • Business as usual • Migration incident • Data growth for the next years • TS4500 acceptance tests • HPSS 7.5.3 migration • HPSS 7.5.3 testing • Bologna European Centre for Medium-Range Weather Forecasts

  5. Business as usual • 3 environnements: prod, preprod and test • HPSS 7.4.2u1p1 • AIX 6.1 • 1 partition • 3 subsystems: • General • 2 TiB of disks (4 LUNs of 512 GiB) • 7.04 PiB of tapes (431 tapes) • 401 883 files, 6 184 directories, 4 filesets • Mars • No disk • 357.18 PiB of tapes (53 509 tapes) • 11 327 276 files, 7 020 516 directories, 629 filesets, 623 junctions European Centre for Medium-Range Weather Forecasts

  6. Business as usual • ECFS • 1.16 PiB of disks (64 LUNs of 512GiB, 268 LUNs of 2TiB, 138 LUNs of 4TiB, 2 LUNs of 8TiB) • 95.19 PiB of tapes (17,655 tapes) • 367 600 840 files, 30 826 308 directories, 13 filesets, 10 junctions • 36 CoS/Hier (5 tests) • 31 active SC (10 disk SC, 21 tape SC) • Devices: • 493 disks • SL8500 • 16 LTO7, 155 T10KD, 56 T10KC • TS3500 • 11 LTO6, 10 LTO7, 6 LTO8 • 70 tape drives moved to direct connexion • Had to update Qlogic driver parameter « Target enable reset » to 0 (comparable to comparable lpfc module parameteris lpfc_fcp2_no_tgt_reset) echo "options qla2xxx ql2xtargetreset=0" > /etc/modprobe.d/qla2xxx.conf European Centre for Medium-Range Weather Forecasts

  7. Business as usual • Repack • MARS used tape to tape hierarchies • Some research data are deleted after being written to both level of the hierarchy • About 6000 tapes repacked in the last 6 months • Big Purge • 60 PB deleted in April 2018 • 35 PB deleted in March 2019 European Centre for Medium-Range Weather Forecasts

  8. Business as usual • Change of technologies • 1 SC (secondary copy) changed from LTO6 to LTO7 • 2597 LTO5 • 11608 LTO6 • 384 LTO7 • 2400 LTO5 repacked since mid February • 2 SC (secondary copy) changed from LTO6 to LTO8M • Write only new data as we don’t have enough LT8 drives to migrate data and to repack at the same time • 1st SC • 11754 LTO6 • 330 LTO8M • 2nd SC • 5586 LTO6 • 170 LTO8M European Centre for Medium-Range Weather Forecasts

  9. Business as usual European Centre for Medium-Range Weather Forecasts

  10. Business as usual European Centre for Medium-Range Weather Forecasts

  11. Business as usual European Centre for Medium-Range Weather Forecasts

  12. Business as usual European Centre for Medium-Range Weather Forecasts

  13. Migration incident European Centre for Medium-Range Weather Forecasts

  14. Data growth for the next years • Estimate based on the new HPC (new ITT mid 2019) • end 2019: 410PB • end 2020: 580PB • end 2021: 810PB • end 2022:   1EB • Will depend on the next HPC and which upgrades we will be able to afford • end 2023: 1.5EB • end 2024: 2.1EB • end 2025: 2.9EB European Centre for Medium-Range Weather Forecasts

  15. TS4500 acceptance tests • Functional tests • TS1160: bug in D3I5_457 firmware (fixed in new firmware). All tapes must be partitioned again. Data on tape will be lost. • Drive error code are not reported in the GUI • SYSLOG are issued with localhost instead of library’s IP and are hardcoded on local3 • GUI can’t display 16 Gb/s on FC ports • Reliability tests • Redundant power loss is not always reported in SNMP or SYSLOG • Several issues with I/O stations, included tapes dropped inside the library • Sometimes, putting an accessor in service mess up with the other one European Centre for Medium-Range Weather Forecasts

  16. HPSS 7.5.3 migration • Objectives: • Migrate core servers from AIX to Linux (finally) • Migrate from single partition HPSS 7.4.2 to a partitioned (9) HPSS 7.5.3 • Do this with a minimal interruption of service. • conversion done on the fly, with AIX environment operational • small downtime (2-3 hours?) to complete conversion, and transfer services to the Linux machine. • Factors: • Operation should have happened 2 years ago, in two steps. • Other projects (e.g. relocation) delayed this process. • Decision to do the two jumps in one go. • Our AIX machine has fairly limited resource. • One very busy subsystem, with 370 Million files • A hell of a lot of data to transfer and migrate • very active database with many parallel transactions European Centre for Medium-Range Weather Forecasts

  17. HPSS 7.5.3 migration • General concepts • qrep • load : copy (with transform) one table at a time. • apply changes  to loaded tables. • qverify • compare the contents of source and target database, and track differences. European Centre for Medium-Range Weather Forecasts

  18. HPSS 7.5.3 migration • 3Q18: test environment converted with minimal issues. • 4Q18: Logic issues discovered while starting the production conversion • heavy usage of renames and partitioning did not mix well during apply changes part of qrep • Dec18: New conversion code delivered, but performance issues on AIX encountered. • Latency between source table update and target table update reaches several days. • 1Q19: additional parallelism and better capture balancing are introduced. • we now just about manage to keep up with source machines updates. • 2Q19: We are now dealing with qverify performances. European Centre for Medium-Range Weather Forecasts

  19. HPSS 7.5.3 migration European Centre for Medium-Range Weather Forecasts

  20. HPSS 7.5.3 migration • The conversion date has been moved multiple times. • May 4th is the ultimate target. • Our AIX boxes are not supported anymore. • We want some of the 7.5 features. • We need 7.5.3 to connect TS1160s • Francis will leave ECMWF shortly after. • Hopefully a qrep based solution, but... • ... We have a plan B • Accept a long downtime and load databases offline. • 10-12 hours downtime. (estimate based on dry run test) • What if errors encountered on the day? European Centre for Medium-Range Weather Forecasts

  21. HPSS 7.5.3 testing • Tape media 60F JE default size is 320To  • Recover • Tests: • Hierarchies • Disk to dual copies on tapes • Tape to tape • Recover of second copy • Recover of primary copy • Results: • It’s working and use TOR and RAO features (RAO calls can be improved) • No logs in Alarms and Events • No timestamp in recover logfile (/var/hpss/tmp/recover_<VolID>.txt. If run several times, history is lost • Doesn’t indicate which tapes are needed for the recovery • CRs opened for Dry-run and listing of tapes needed European Centre for Medium-Range Weather Forecasts

  22. HPSS 7.5.3 testing • Db2 configuration • Change databaseparameter LOGSECOND from 10 to -1 • Db2 backup • With multiple partition, each database has 1 backup file per partition • Db2_fullbackup.ksh only verify and copy the last written file to the secondary backup partition • Tapes drives Quotas • Several major bugs found • If a drive is locked while in used, In Use values (read or write) are not updated (fixed in the last patch) • When PVL is restarted, In Use values are reset to 0 (fixed in the last patch) • CR opened to set recall limit with percentage European Centre for Medium-Range Weather Forecasts

  23. HPSS 7.5.3 testing • Tapes drives Quotas • If the number of unlocked drives for a PVR goes below the recall limit (recall limit not set to -1), the recall limit is automatically changed to the number of available drives European Centre for Medium-Range Weather Forecasts

  24. HPSS 7.5.3 testing • Tapes drives Quotas European Centre for Medium-Range Weather Forecasts

  25. Data Centre fit-out timeline HPC operational in Bologna Bologna DC handover DHS operation in Bologna Data Centre construction 05-2020 10-2019 Start Procurements N&S infrastructure deployed HPC contract signed 100 Gbps link Reading-Bologna Q1 Q2 Q3 Q4 Q2 Q3 Q4 Q1 2020 2019 2020 Operational services Delivery and installation : racks, fibbers, network, servers, storage Procurements European Centre for Medium-Range Weather Forecasts

  26. Questions ? European Centre for Medium-Range Weather Forecasts

  27. Thank you European Centre for Medium-Range Weather Forecasts

More Related