ASGC Tier1 Center & Service Challenges activities ASGC, Jason Shih
Outlines • Tier1 center operations • Resource status, QoS and utilizations • User support • Other activities in ASGC (exclude HEP) • Biomed DC2 • Service availability • Service challenges • SC4 disk to disk throughput testing • Future remarks • SA improvement • Resource expansion
Computing resources • Instability of IS cause ASGC service endpoints removed from exp. bdii • High load on CE have impact to site info published (site GIIS running on CE)
Job execution at ASGC • Instability of site GIIS cause dynamic information publish error • High load of CE lead to abnormal functioning of maui
OSG/LCG resource integration • Mature tech help integrating resources • GCB introduced to help integrating with IPAS T2 computing resources • CDF/OSG users can submit jobs by gliding-in into GCB box • Access T1 computing resources from “twgrid” VO • Customized UI to help accessing backend storage resources • Help local users not ready for grid • HEP users access T1 resources
ASGC Helpdesk • Currently support following services (queue): • CIC/ROC • PRAGMA • HPC • SRB • Classification of sub-queue of CIC/ROC: • T1 • CASTOR • SC • SSC
First run on partial of 36690 ligands (started at 4 April, 2006 Fourth run (started at 21 April) Biomed DC2 • Add 90KSI2k dedicate for DC2 activities, introduced additional subcluster in IS • Maintaining site functional to help utilizing grid jobs from DC2 • Troubleshooting grid-wide issues • Collaborate with biomed in AP operation • AP: GOG-Singapore devoted resources for DC2.
Biomed DC2 (cont’) • Two framework introduced • DIANE and wisdom • Ave. 30% contribution from ASGC, in 4 run (DIANE)
SC4 Disk to disk transfer • problem observed at ASGC: • system crash immediately when tcp buffer size increase • castor experts help in trouble shooting, but prob. remains for 2.6 kernel + xfs • download kernel to 2.4 + 1.2.0rh9 gridftp + xfs • again, crash if window sized tuned • problem resolved only when down grade gridftp to identical version for SC3 disk rerun (Apr. 27, 7AM) • try with one of disk server, and move forward to rest of three • 120+ MB/s have been observed • continue running for one week
Castor troubleshooting *gridftp bundled in castor+ ver. 2.4, 2.4.21-40.EL.cern, adopted from CERN** ver 2.4, 2.4.20-20.9.XFS1.3.1, introduced by SGI++ exact ver no 2.6.9-11.EL.XFS$ tcp window size tuned, max to be 128MB Stack size recompiled to 8k for each experimental kernel adopted
SC Castor throughput: GridView • disk to disk nominal rate • currently ASGC have reach120+ MB/s static throughput • Round robin SRM headnodes associate with 4 disk servers, each provide ~30 MB/s • debugging kernel/castor s/w issues early time of SC4 (reduction to 25% only, w/o further tuning)
Castor2@ASGC • testbed expected to deployed end of March • delayed due to • obtaining LSF license from Platform • DB schema trouble shooting • overlap HR in debugging castorSC throughput • revised 2k6 Q1 quarterly report • separate into two phase, phase • (I) w/o considering tape functional testing • plan to connect to tape system in next phase • expect mid May to complete phase (I) • phase (II) plan to finish mid of Jun.
Future remarks • Resource expansion plan • QoS improvement • Castor2 deployment • New tape system installed • Continue with disk to tape throughput validation • Resource sharing with local users • For users more ready using grid • Large storage resource required
Resource expansion: MoU *FTT: Federated Taiwan Tire2
Resource expansion (I) • CPU • Current status: • 430 KSI2k (composite by IBM HS20 and Quanta Blades) • Goal: • Quanta Blades • 7U, 10blades, Dual CPU, ~1.4 ksi2k/cpu • ratio 30 ksi2k/7U, to meet 950KSI2k • need 19 chassis (~4 racks) • IBM Blades • LV model available (save 70% power consumption) • Higher density, 54 processors (dual core + SMP Xeon) • Ratio ~80 KSI2k/7U, only 13 chassis needed (~3 racks)
Resource expansion (II) • Disk • Current status • 3U array, 400GB drive, 14 drives per rack • ratio: 4.4 TB/6U • Goal: • 400 TB ~ 90 Arrays needed • ~ 9 racks (assume 11 arrays per rack) • Tape • New 3584 tape lib installed mid of May • 4 x LTO4 tape drives provide ~ 80MB/s throughput • expected to be installed in mid-March • delayed due to • internal procurement • update items of projects from funding agency • Expected new tape system implemented at mid-May • full system in operation within two weeks after installed.
Resource expansion (III) • C2 area of IPAS new machine room • Rack space design • AC, cooling requirement: • for 20 racks: 1,360,000 BTUH or 113.3 tons of cooling - 2800 ksi2k • 36 racks: 1,150,000 BTUH or 95 tons - 1440 TB • HVAC: ~800 kVA estimated • HS20: 4000Watt * 5*20 + STK array: 1000Watt * 11*36 ) • generator
Summary • new tape system ready mid of May, full operation in two weeks • plan to have disk to tape throughput testing • Split batch system and CE • help stabilizing scheduling functionality (mid of May) • Site GIIS sensitive to high CPU load, move to SMP box • CASTOR2 deployed mid of Jun • Connect to new tape lib • migrate data from disk cache
Acknowledgment • CERN: • SC: Jamie, Maarten • Castor: Olof • Atlas: Zhong-Liang Ren • CMS: Chia-Ming Kuo • ASGC: • Min, Hung-Che, J-S • Oracle: J.H. • Network: Y.L., Aries • CA: Howard • IPAS: P.K., Tsan, & Suen
Disk server snapshot (I) • Host: lcg00116 • Kernel: 2.4.20-20.9.XFS1.3.1 • Castor gridftp ver.: VDT1.2.0rh9-1
Disk sever snapshot (II) • Host: lcg00118 • Kernel: 2.4.21-40.EL.cern • Castor gridftp ver.: VDT1.2.0rh9-1
Disk server snapshot (III) • Host: sc003 • Kernel version: 2.6.9-11.EL.XFS • Castor gridftp ver.: VDTALT1.1.8-13d.i386