140 likes | 270 Views
GDB - February 2014 Summary. Jeremy’s notes Agenda: http:// indico.cern.ch / event /272618/. Introduction (M Jouvin ). Please check 2014 meeting dates. March 12 th – CNAF Bologna (register) WLCG workshop (1 st /2 nd week July). Barcelona. Possibly 8 th -9 th July .
E N D
GDB - February 2014 Summary Jeremy’s notes Agenda: http://indico.cern.ch/event/272618/
Introduction (M Jouvin) • Please check 2014 meeting dates. • March 12th – CNAF Bologna (register) • WLCG workshop (1st/2nd week July). Barcelona. Possibly 8th-9th July. • GDB actions: https://twiki.cern.ch/twiki/bin/view/LCG/GDBActionInProgress • Future (pre-)GDB topics welcome • Upcoming: By introducing a pay‐per‐usage scheme as part of • funding model the funding agencies will have the information to be able to measure the • level of usage of a service and whether it justifies their investments. In addition, if the • pay‐per‐usage model is implemented by giving some of the financial control to the users • then they will favour those services which offer better value‐propositions. • Site Nagios testing – any feedback? • OSG Federation workshop: https://indico.fnal.gov/conferenceDisplay.py?confId=7207 • HEPiX May 19-23rd May. Annecy: EGI CF 19-23rd May. Helsinki.
HEP SW Collaboration (I Bird) • Performance now a limiting factor. • CPU technology trends. More transistors but not easy to use them. • Most s/w designed for sequential processing. Migrating to multi-threaded not easy. Target geant and root. • Concurrency Forum est. 2yrs ago. Towards Open Scientific Software Initiative. • Components such asGeantand ROOT should be part of a modular infrastructure. • HEP S/W Collab: goal to build /maintain libraries… • Establish a formal collaboration to develop open scientific software packages guaranteed to work together (inc. frameworks to assemble apps). • Workshop 3rd-4th April 2014
IPv6 Update (D Kelsey) • WG meeting 23/24 Jan 2014 (included CERN cloud and OpenStack.) • Progress in various areas. CERN campus wide deployment in March (some dhcpv6 issues): http://ipv6.web.cern.ch/ • PerfSONAR very useful… works IC. Run dual stack? • IPv6 file transfer test bed. Decayed a bit. • ATLAS testing (Alastair): AGIS. Simple tests then HC. Squid 2.8 not IPv6 compatible. • Plan to get mesh working again. Site deployments. Move to use SRM/FTS… • Define use-cases • Barrier to move for some sites if availability affected going to dual stack etc. • Software survey shows 15/66 ‘services’ known to be fully compliant. • Pre-release of dCache 2.8.0 has IPv6 fixes. • Want to survey sites – when will they run out of IPv4 and be capable of IPv6. pre-GDB meeting in June.
Future of SLC (J Polok) • CentOS team joining Red Hat in open standards team. Not RHEL. • CentOS Linux platform is not changing • Impact for SL5/6: Source packages may have to be generated from git repositories. • No other changes – releases stay as now • SL(C) 7 options being discussed • May rebuild from source as for 5 and 6 OR create a Scientific Centos variant OR adopt Centos core. • Approaches: 1. Keep process: build from source with our actual tool chain. 2. Create SIG for our variant. 3. SL become an add-on repository to CentOS core. • Centos 7 Beta in preparation. RHEL7 production due in summer. Source RPMs not guaranteed after summer. • Need to ensure risks for 5 and 6 covered.
Ops coordination report (S. Campana) • Input based on pre-GDB Ops Coordination meeting. • gLexec: CMS SAM test not yet critical. Still 20 sites have not deployed. • perfSONAR: It is a service. Site w/o or at an old release will feature in report(s) to MB. • Tracking tools evolution – Savannah to JIRA. JIRA still lacking GGUS some functionality • SHA-2 migration: progress with VOMS-admin but manual process needed. New host certs soon. • Machine/Job features: Prototype ready. Options for clouds being looked at. • Middleware Readiness: Model will rely on experiments & frameworks + sites deploying test instances + monitoring. MB will discuss process for ‘rewarding’ site participation.
Ops Coordination - cont • Baseline enforcement: Looking at options to monitor and then automate for campaigns • WMS decommissioning: Shared/CMS instances end in April. SAM will use till June. • Multi-core deployment: ATLAS & CMS different usage. Trying prototypes. Torque/Maui a concern. • FTS3 deployment: FTS3 works well. Few instances needed – 3 or 4 for resilience. • Experiment Computing Commissioning: Experiment plans for 2014 discussed. Conclude no need for common commissioning exercise. • Conclusion – some deployment areas being escalated.
High memory jobs (J Templon) • NIKHEF observations • Which high mem problem!? Virtual memory usage in GB. Pvmem 4096MB. User jobs and some prod jobs high usage. These don’t ‘ask’ for the memory. Link multi-core and high mem. • Pvmem – ulimit on process – allows handling of out-of-mem signal (not kill) • Different ways to ask for more memory in job… few work. Inconsistencies arise. • Situation being reviewed.
SAM test scheduling (Luca Magnoni) • SAM: framework to schedule checks (Nagios) via dedicated plug-in (probes = scripts/executables) • Categories: Public grid/cloud services (custom probes); job submission (via WMS); WNs (via job payloads). • Job submission – to include direct CREAM and condor-G • Remote testing assumes deterministic execution. There are granularity issues (CE vs site) and not always agreement between site and experiment views. • Can test with different credentials. Jobs can timeout whe VO out of share. Site availability determined by experiment critical profiles. • Most timeouts looked to be on WMS side! • New Condor-G and CREAM probes for job submission coming • Aim to provide web UI/API for users • Looking at options to replace Nagios for scheduler • Test submission via other frameworks (e.g. HC) being investigated – ATLAS want a hybrid approach, CMS do not support framework approach.
New transfer dashboard (A Beche) • Reviewed history of data transfer monitoring. Separate web API/UI for FTS, FAX, AAA. Added in ALICE and EOS. • Plan to federate. Data split into schemas: FTS, XRootD and high optimization. • Data retention policies differ – raw and statistics • Dashboard now aggregates over APIs • Plan for a map view
WLCG monitoring coordination (Pablo Saiz) • Consolidation group: reduce complexity; modular design; simplify ops and support; common dev and core. • Need more site input. • Timeline – starting to deploy. • Survey & tasks. Tasks in JIRA: https://its.cern.ch/jira/browse/WLCGMON • 1. Application support(for jobs, transfers, infrastructure…) • 2. Running the services(moving to AI, Koji, SL6, puppet…) • 3. Merging applications (SSB+SAM; SSB+REBUS; HC+Nagios…). Idea is to reduce to make maintenance easier. Many infrastructure monitoring tools - schema copes with several use-cases. http://wlcg-sam-atlas.cern.ch/ • 4. Technology evaluation • Nagios plug-in for sites developed by PIC • SAM/SUM -> SAM3 (for SUM background see https://indico.cern.ch/event/285330/contribution/3/material/slides/1.pdf) • Next steps: https://its.cern.ch/jira/browse/WLCGMON
Data Preservation Update (J Shiers) • Things are going well. • Workshop. • Increasing archive growth. • Annual cost of WLCG is 100M euro. • Need 4 staff: documentation; standards; .. • DPHEP portal core. Digital library. Sustainable software+virtualisationtech+validation frameworks. Sustainable funding. Open data.
LHCOPN/LHCONE evolution workshop (E. Martelli) • Networking stable. Key. Growth with technology evolution ok. • New sites in areas where network under-developed. • ATLAS: Expect bursty traffic. US sites-> 40/100 • CMS: Mesh will increase traffic. • LHCb: no specific concerns. • More bandwidth needed at T2s. Connectivity needs to improve to Asia – capacity and rtts. • Demands for better network monitoring & LHCONE operations. • P2P-link-on-demand (over provisioning vscomplexity (L3VPN))
perfSONAR (Shawn McKee) • Sites to use “mesh” configuration • Metrics will adjust over time • 85% sites with PS have issues to resolve (firewalls, versions…). • Likely go with MaDDash (Monitoring and Debugging Dashboard) • Checking of primitive services – OMD (Open Monitoring Distribution) • For test instance…. WLC*** WLC*** • Context between all sites … • 3.3.4 release will mean only one machine needed • Alerting – high-priority but complicated