1 / 20

Grid INTER-Operations

Hélène Cordier EGEE/WLCG Operations IN2P3 Computing Centre Lyon (France) - helene.cordier@in2p3.fr. Grid INTER-Operations. Contents. Existing Common Interests in solving mainly 2 issues so far: Security and accounting issues, monitoring workflow efforts are diverse.

tracymason
Download Presentation

Grid INTER-Operations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hélène Cordier EGEE/WLCG Operations IN2P3 Computing Centre Lyon (France) - helene.cordier@in2p3.fr Grid INTER-Operations

  2. Contents • Existing Common Interests in solving mainly 2 issues so far: • Security and accounting issues, monitoring workflow efforts are diverse. • Existing efforts at inter-project level involving: • Grid Interoperability Now (GIN, as a workgroup from OGF) https://forge.ogf.org/sf/go/projects.gin/wiki • Existing efforts at project level involving: • EGEE, WLCG and OSG • NDGF, PRAGMA, TERAGRID and NAREGI • Existing efforts at IN2P3-CC: • IGTMD • Concerns and Updates

  3. Joint Security Policy Group Certification Authorities EUGridPMA  IGTF and so one. Grid Acceptable Use Policy (AUP) common, general and simple AUP for all VO members using many Grid infrastructures e.g. EGEE, OSG, SEE-GRID, DEISA, national Grids… Incident Handling and Response defines basic communications paths defines requirements (musts) for IR not to replace or interfere with local response plans Incident Response Certification Authorities Audit Requirements Usage Rules Security & Availability Policy VOSecurity Application Development & Network Admin Guide User Registration & VO Management Security & Policy Grid Security Policy (v5.7) : https://edms.cern.ch/document/428008/4 Grid Site Operations Policy (v1.4): https://edms.cern.ch/document/819783/1 Virtual Organisation Operations Policy (v1.0): https://edms.cern.ch/document/853968/1

  4. Usage record working group Mandate : In order for resources to be shared, sites must be able to exchange basic accounting and usage data in a common format. This working group proposes to define a common usage record based on those in current practice. The record format will be specific enough to facilitate information sharing among grid sites, yet general enough that the usage data can be used for a variety of purposes - traditional usage accounting, service usage monitoring, perfomance tuning, etc. This group will therefore be concentrating on collecting and disseminating resource consumption data. We will not be addressing how that data is to be collected by the resource sites, nor how it will be used by its recipients.

  5. Accounting • Tools needed to collect and report information on resource utilization • Intended audience: site managers, virtual organization managers, grid operators, funding agencies,… • Need to define common ways of measuring resource consumption • Including usage of same units • LCG/EGEE • CPU usage information (per user or per VO) provided by each site and stored in a central repository : Reports (charts and numeric data) available through a web interface • Next step: collect information on storage utilization. • Developed and operated by Grid Operations Centre (UK) and CESGA (SWE).

  6. Accounting – Cont’d

  7. Accounting

  8. Accounting

  9. Site monitoring High-Level Model

  10. Site monitoring (cont’d) We can’t/won’t impose a solution on sites , as they might/should have something Already. Specification based approach allows our probes fit into any fabric monitoring system : Data Exchange format allows higher-level services consume the data regardless of fabric monitoring system https://twiki.cern.ch/twiki/bin/view/LCG/GridMonitoringDataExchangeStandard WLCG Monitoring Working Groups since January 23rd 2007: System Management Working Group – SMWG /J. Casey, I. Neilson https://twiki.cern.ch/twiki/bin/view/LCG/SystemManagement Grid Service Monitoring Working Group – GSMWG / A. Forti, M. Jouvin https://twiki.cern.ch/twiki/bin/view/LCG/GridServiceMonitoringInfo System Analysis Working Group – SAWG / J. Andreeva, P. Saiz https://twiki.cern.ch/twiki/bin/view/LCG/SystemAnalysisMonitoringInfo [Rob Quick, Workshop on Grid services Monitoring HPDC’07 – June 27th 2007]

  11. CMS Dashboard 1/2

  12. CMS Dashboard 2/2

  13. CIC Operations Portal • Web portal for integrating all the tools and sources of operations-related information into one single place • Developed and operated by CC-IN2P3, failover instance at CNAF • http://cic.gridops.org/ • Provides and maintains an integrated operations dashboard for grid on duty operator • Provides mechanisms for keeping information needed for appropriate hand over between operators on duty • Easy access to appropriate contact information on every actor involved in the operations of the grid • Provides communication tools

  14. Alarms Dashboard

  15. Opening tickets

  16. Tracking incidents via GGUS • Incident tracking model • Unique channel for opening tickets • End-users : e.g job submission failures, data transfer failed • Operators : e.g job submission failures • Classification and 1rst assignment done by the ticket process manager • Tickets are assigned to support units - one per domain of expertise • Grid operators, applications, federations, m/w experts,.. • OSG : ggus@tick.globalnoc.iu.edu Automatic helpdesk/ XML Format Exchange  4 tickets created by cms users from June 27th • WLCG/EGEE • Central incident tracking tool : https://gus.fzk.de/ • Same tool used by grid operators and end users via e-mail and web interface • Sites failing the tests receive are assigned a ticket • Escalation procedure for solving site-related problems • Involves the regional operator and the site operator • Interface with ticket handling tools used by sites/federations (if needed) • Tools for collecting metrics on the responsiveness of support units

  17. The ENOC • The EGEE Network Operations Centre (ENOC): • Single point of contact between EGEE and the NRENs • Where EGEE and the network can exchange operational information • Network support unit in GGUS ENOC

  18. IGTMD Grid Interoperability and Massive Data Transfer • 3 years, started in Feb 2006 • Renater, ENS, CC-IN2P3, FNAL-unfunded Goals • Disk to disk Bulk data transfer • Replication and referring mechanisms • Information Sytem and job management interoperability • Grid control and monitoring • Usage of statistics and accounting data

  19. IGTMD Roadmap • Network: items 1and 2 • 2* 1 Gb/s CC-IN2P3/FNAL on October 16th 2006 – LCG/EGEE • Tests on Massive Data transfer – CC-IN2P3/FNAL • Interoperability: item 3 • Access to grid resources through standard APIs – LCG/EGEE • State-of-the art cf. JTR – October17th; • RoadMap on the IGTMD face-to face meeting May 4th • Inter-operations: items 4 to 5 • Tests suite relevancy to US sites – EGEE • Operations and Daily Monitoring of services – EGEE • Usage Records and accounting – OGF

  20. Concerns and updates • Achieve a real 24x7 production quality-like service : Failover mechanisms • Increase automation of daily monitoring tools and alarms treatment. • OGF20—GIN JOBS - EGEE/TERAGRID/OSG/NORDUGRID/DEISA • https://forge.ogf.org/sf/wiki/do/viewPage/projects.gin/wiki/WorkerNodeEnvironmentOGF20 • 29/08/2005 http://edms.cern.ch/document/630962 • 29/03/2007 mail from Laurence Field on GIN-JOB • GIN-OPS : Savannah and Ninf-G • GIN-IS :EGEE-NDGF and EGEE-OSG not updated since 17 Août 2006 • GIN-data :idem • GIN-auth : AUP for the gin.gg.org VO since 12/06.

More Related