SA1 – Infrastructure Operations: Overview and Achievements

SA1 – Infrastructure Operations:Overview and Achievements PSC05 Meeting Dubrovnik, 9-11 September 2009 Antun Balaz SA1 Leader Institute of Physics Belgrade antun@ipb.ac.rs The SEE-GRID-SCI initiative is co-funded by the European Commission under the FP7 Research Infrastructures contract no. 211338

Overview • SA1 objectives, metrics, activities • SA1 deliverables status • SA1 milestones status • Infrastructure development • Infrastructure management • Service Level Agreement • Infrastructure usage • Network link to Moldova status • Collaboration/Interoperation • Action points

SA1 objectives and metrics • Objective 2: Providing infrastructure for new communities • O2.1: Expand the current infrastructure • MTSA1.1: Increase in the number of computing and storage resources (tables given in DoW) • O2.2: Inclusion of Armenia and Georgia • MTSA1.2: Number of Grid sites and processing and storage resources (tables given in DoW) • O2.3: Achieve high reliability, availability and automation • MTSA1.3: Increase of the average overall Grid site availability(M01 >= 70%, M12 >= 75%, M24 >= 81%) • MTSA1.4: Number of successful jobs ran as % of total jobs(M01 >= 50%, M12 >= 55%, M24 >= 60%) • MTSA1.5: Number of management tools expanded or developed (+achieving tools integration and automation) • O2.4: Provision of the network link to Moldova

SA1 activities SA1.1: Implementation of the advanced SEE-GRID infrastructure SA1.1.1: Expand the existing SEE-GRID infrastructure and deploy Grid middleware components and OS in SEE Resource Centers SA1.1.2: Operate the SEE-GRID infrastructure SA1.1.3: Deploy and Operate the core services for new VOs SA1.1.4: Catch-all CA and deployment and operational support for new and emerging Grid CAs SA1.1.5: Certify and migrate SEE-GRID sites from regional to global production-level eInfrastructure SA1.2: Resource Centre SLA monitoring and enforcement SA1.2.1: SLA detailed specification, identification and deployment of operational tools relevant for SLA monitoring SA1.2.2: Monitoring, assessment and enforcement of RC conformance to SLA SA1.3: Network Resource Provision SA1.3.1: Network resource provision and liaison with regional eInfrastructure networking projects SA1.3.2: Procurement of a link between Moldova and GEANT

SA1 deliverables status • DSA1.1a: Infrastructure Deployment Plan (M04) • CERN, Editor: D. Stojiljkovic • DSA1.2: SLA detailed specification and related monitoring tools (M05) • UOBL, Editor: M. Savic • DSA1.3a: Infrastructure overview and assessment (M12) • UKIM, Editor: B. Jakimovski • DSA1.1b: Infrastructure Deployment Plan (M14) • UOB-IPB, Editor: A. Balaz • DSA1.3b: Infrastructure overview and assessment (M23) • UKIM, Editor: B. Jakimovski

SA1 milestones status • MSA1.1: Infrastructure deployment plan defined (M04) • CERN (verified by DSA1.1a) • MSA1.2: SLA structure and enforcement plan defined (M05) • UoBL (verified by DSA1.2) • MSA1.3: Network link for Moldova established (M23) • RENAM: (verified by the operational link to MD and DSA1.3b) • MSA1.4: Infrastructure performance and usage assessed (M23) • UKIM (verified by DSA1.3b)

Infrastructure development

Core services • Catch-all Certification Authority • enables regional sites to obtain user and host certificates • Virtual Organisation Management Service (VOMS), • For each scientific community deployed in two instances for failover • Supporting groups and roles • Workload management service (glite-WMS/LB) and Information Services (BDII) • For each scientific community deployed in several instances for failover • Logical File Catalogue (LFC) • For each scientific community deployed in several instances for failover • MyProxy • Supports certificate renewal for all deployed WMS/RB services • For each scientific community deployed in several instances for failover • File Transfer Service (FTS) • Used in production • Relational Grid Monitoring Architecture (R-GMA), Registry and Schema • SEE-GRID accounting publisher, with support for MPI jobs accounting • AMGA Metadata Catalogue

Core services map

Certificate authorities map: M01 Established CA New CA Catch All CA Candidate CA Training CA RA

Certificate authorities map: M12 Established CA New CA Catch All CA Candidate CA Training CA RA RA

Infrastructure expansion (1) • SEE-GRID-SCI infrastructure contains currently the following resources: • Dedicated CPUs: 1086 total (increase in Y1: 318 CPUs) • Storage: 288 TB (237 TB more than planned at the end of Y1) • 40 sites in SEE-GRID-SCI production (increase in Y1: 6 sites) • Typical machine configuration: dual or quad-core CPUs, with 1GB of RAM per CPU core; many sites with 64-bit architecture • All sites on gLite-3.1; Scientific Linux 4.x used as a base OS, but others also present (CentOS, Debian) • Metrics MTSA1.1 generally fulfilled • Armenia and Georgia have deployed new Grid sites and joined the SEE-GRID infrastructure – MTSA1.2

Infrastructure expansion (2) Total number of CPUs, storage size (TBs) and number of Grid sites

Infrastructure expansion (3) Number of CPUs per SEE country

Infrastructure expansion (4) Available storage per SEE country (TB)

Infrastructure expansion (5) Number of Grid sites per SEE country

Infrastructure management

Grid operations • Convergence in procedures with EGEE-SEE in (extended) region • Monitoring switched to new core services, to the new VO • Still problems with VOMS certificates at some sites! • Migration to SL5/gLite-3.2 • MPI issues • Deployment of sites • RO missing 1 site • AL missing 5 sites • MD missing 1 site • BA missing 1 site • ME missing 1 site • Excellent progress in AM

Operational/monitoring tools (1) • Hierarchical Grid Site Management (HGSM)(+interface to GOCDB) – Turkey • BBMSAM Service Availability Monitoring + extensions – Bosnia and Herzegovina with Serbia support • Helpdesk + NMTT (+ interoperation with EGEE-SEE and GGUS + intergration with Nagios) – Romania with CERN support • SEE-GRID GoogleEarth – Turkey + ic.ac.uk • Global Grid Information Monitoring System (GStat) – ASGC, Taiwan • R-GMA and Accounting Portal – Bulgaria • Nagios- Bulgaria • Real Time Monitor (RTM) – ic.ac.uk and Turkey (HGSM) • MONitoring Agents using a Large Integrated Services Architecture (MonALISA) – Romania • What is at the Grid (WatG) – Serbia • WMSMON tool – Serbia • Pakiti – Greece • GSSVA (security-enabled Pakiti extension) – SZTAKI • SEE-GRID Wiki with detailed information for site administrators

Operational/monitoring tools (2) • Static Database: HGSM • Static database containing all relevant data about all SEE-GRID-SCI sites • Synchronized with the real situation • Monitoring • BBmSAM • Portal that provides access to the database of SAM tests results • Central tools for identification of operational problems • Provides SLA metrics

Operational/monitoring tools (3) • Gstat • Central tool for monitoring of the information system of SEE-GRID-SCI infrastructure • Nagios • Collection of alarms raised by various tools • In the future, automatic creation of Helpdesk tickets will be implemented • Pakiti • Helps the system administrator keeping multiples machines up-to-date and prevent unpatched machines to be kept silently on the network. • GSSVA (JRA1)

Operational/monitoring tools (4) • WMSMON • Aggregated and detailed status view of all monitored WMS services • Links to the appropriate troubleshooting guides • Real Time Monitor • Using satellite imagery from NASA, these clients display the SEE-GRID-SCI as it is geographically spread over the region • GridIce • Googlemap • MonaLisa

Operational/monitoring tools (5) • Helpdesk: OneOrZero • Central reference point for tracking of all operational and user problems • Identified problems are reported through the Helpdesk and assigned to the appropriate supported • NMTT (JRA1) • Accounting portal • Collects the accounting data from all SEE-GRID-SCI sites through apel MPI-enabled accounting publisher developed by the project • Provides aggregated accounting data by site, country, institution, application • Operations wiki

Service Level Agreement (1) • Sites need to conform to SEE-GRID-SCI SLA availability and reliability criteria • Monitoring done automatically by the BBmSAM portal • New SLA defined (80% availability goal) • SLA Enforcement Team (SET) was established to monitor the conformance of sites to SLA • Sites that fully conform to the SLA availability (> 80%) are upgraded to the new status in HGSM: seegrid_certified • The aim is to include them into all production BDIIs in the SEE region (incl. EGEE infrastructure), so that reliable resources are available to all users

Service Level Agreement (2) • Overall availability of resources (MTSA1.3) • Q2: 82.80%, Q3: 82.61%, Q4: 91.17% • Apparent drop in site availabilities from Q1 to Q2 caused by the change in SLA calculation method New SLA New SLA SLA site service availability SLA site service reliability

Helpdesk statistics (1) Number of Helpdesk tickets solved per SEE country

Helpdesk statistics (2) Average ticket solving time (days) per SEE country

Infrastructure usage (1) • SEE-GRID-SCI sites have provided more than 15.4 million Normalized Elapsed CPU hours and more than 1.2 million number of jobs during Y1; total utilization 90.71% • SEEGRID VOs (seegrid, seismo, environ and meteo) • National VOs (AEGIS VO, TRGRID VOs) • Regional VOs (see) Normalized CPU time (hours) Number of jobs

Infrastructure usage (2) • The assessment of the usage of the regional infrastructure per scientific discipline Number of jobs per discipline VO Normalized elapsed time (h) per discipline VO

Infrastructure usage (3) • Number of jobs and elapsed time per country for all supported VOs in SEE-GRID-SCI Number of jobs per SEE country Elapsed CPU time (h) per SEE country

Job success rates • Data gathered from Logging and Bookkeeping (LB) services from all core services in the region provides information on the number of successful jobs (MTSA1.4) • Out of more than 800k jobs that were submitted through SEE WMS services during Y1, 94.27% were successful

Network link to Moldova status • Two stages • Upgrade of the existing radio-relay link Chisinau-Iasi: this is actually implemented approach that has restrictions on the perspective growth; currently realized operational connection to RoEduNet has 2x155 Mbps capacity • Provision of the direct Dark Fiber link Chisinau-Iasi; contract signed in February 2009 • In November 2008 an updated proposal was submitted to NATO for co-finding of the Dark Fiber link • NATO Science for Peace Committee authorities positively evaluated the updated proposal and in February 2009, NATO project co-director received confirmation that the revised proposal was accepted

Collaboration/Interoperation (1) • NA4: support of discipline VOs (core services and resources) and apps • JRA1: developing and deploying OTs • NA3: providing training infrastructure (core services and resources) • NA2: inputs to/implementation of policy documents • Infrastructure fully interoperable with EGEE and a number of other regional Grid infrastructures • Active participation in EGEE Operations Automation Team (OAT) • Joint work and development of Nagios solutions for Grid resources • Interoperation of HGSM and GOCDB; regionalization and testing of GOCDB • Testing of GStat 2.0 • Grid-Operator-on-Duty experiences communicated to EGEE • Basis for regionalization of COD • Sharing of tools with other projects/infrastructures: WMSMON, WatG • Collaboration with the EDGeS project on establishing interoperability with infrastructures based on desktop Grids

Collaboration/Interoperation (2)

Action points • AP42: HGSM interface with GOCDB (Hakan, ongoing) • AP43: Helpdesk statistics (AlexS, 15 Sep 08 -> 15 Sep 09) • AP49: Nagios – integration of all alarms; “CIC dashboard” (Emanouil, 30 Jun 08 -> 15 Sep 09) • AP54: Wiki reorganization (Boro, 30 Jun 08 -> 15 Sep 09) • AP98: WMS monitoring supported on all WMS+LBs (Dusan, site admins, 15 Sep 09) • AP160: GOOD Templates functionality in Nagios to be implemented (Emanouil, 15 Sep 09) • AP221: Deployment of additional core services for discipline VOs per country (Antun, site admins, 15 Sep 09) • AP214: Define logical DNS names for SA1 services (Antun, 15 Sep 09) • New: All sites to install VOMS certificates (site admins, 15 Oct 09) • New: Accounting portal to produce reliable data (Emanouil, 15 Oct 09) • New: BBmSAM portal to produce full SLA reports (Mihajlo, 15 Oct 09) • New: Migration of 64-bit WNs to SL5/gLite-3.2 (Dusan, site admins, 15 Dec 09) • New: Re-certification of bad sites (SET, GIMs, 15 Oct 09) • New: Send list of wrongly configured WMS (Emanouil, 15 Sep 09)

SA1 – Infrastructure Operations: Overview and Achievements