1 / 32

GridPP Deployment Status, User Status and Future Outlook

GridPP Deployment Status, User Status and Future Outlook. Tony Doyle. Introduction. What is the deployment status? Is the system usable? What is the future of GridPP?. Wot no middleware?. Workload Management. Grid Data Management. Network Monitoring. Information Services. Security.

cleo-finch
Download Presentation

GridPP Deployment Status, User Status and Future Outlook

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GridPP Deployment Status, User Statusand Future Outlook Tony Doyle

  2. Introduction • What is the deployment status? • Is the system usable? • What is the future of GridPP? Wot no middleware? INFNGrid Meeting

  3. Workload Management Grid Data Management Network Monitoring Information Services Security Storage Interfaces Middleware GridPP Middleware is.. INFNGrid Meeting

  4. Middleware e.g. LCG monitoring applet • Monitor: • resource brokers • virtual organisations • ATLAS • CMS • LHCb • DTeam • Other • SQL queries to logging and book-keeping database INFNGrid Meeting

  5. Middleware e.g. APEL and R-GMA • used in accounting system (GOCDB) • For gLite the sensors are provided by DGAS via DGAS2APEL • the EGEE portal for accounting data is provided by CESGA R-GMA structure INFNGrid Meeting

  6. Resources Steady climb since 2004 towards target of ~10,000 CPU (cores) (~job slots) Sunday’s STATUS totalCPU freeCPU runJob waitJob seAvail TB seUsed TB maxCPU avgCPU Total 6949 3210 2912 77321 246 313 8176 6716 17/12/06: EGEE total slots 34141 => UKI is 6949 ~20% of the total 17/12/06: EGEE jobs running 21291 => UKI is 2912 ~ 14% jobs Max EGEE = 42517 Max UKI = 8176 (N.B. hyperthreading distorts 1:1 job:CPU core relation – reduces UKI numbers by ~500) http://goc.grid.sinica.edu.tw/gstat/UKI.html INFNGrid Meeting

  7. Resources 2006 CPU Usageby Region http://www3.egee.cesga.es/gridsite/accounting/CESGA/tree_egee.php Via APEL accounting INFNGrid Meeting

  8. Resources (not all records are being accounted) http://www3.egee.cesga.es/gridsite/accounting/CESGA/tree_egee.php INFNGrid Meeting

  9. Resources 2006 CPU Usageby experiment http://www3.egee.cesga.es/gridsite/accounting/CESGA/tree_egee.php Total CPU used 52,876,788 kSI2k-hours! INFNGrid Meeting

  10. Utilisation (Estimated utilisation based on gstat job slots/usage) UKI mirrors overall EGEE utilisation Average Utilisation for Q306: 66% Compared to target of ~70% CPU utilisation was a T2 issue, but now improving.. INFNGrid Meeting

  11. Efficiency (measured by UK Tier-1 for all VOs) ~90% CPU efficiency due to i/o bottlenecks is OK Concern that this is currently ~75% target Each experiment needs to work to improve their system/deployment practice anticipating e.g. hanging gridftp connections during batch work INFNGrid Meeting

  12. Storage (is still an issue for Tier-1 and Tier-2s) http://www.gridpp.ac.uk/storage/status/gridppDiscStatus.html • Utilisation is low (~30%) at T2s and accounting [by VO] is not (yet) there INFNGrid Meeting

  13. Storage GOCDB Accounting Display - under development • Looking at data for RAL-LCG2 • Storage units are 1TB = 10^6 MB • Tape Used + Disk Used = Total Tape Used Disk Used Total Used Storage (TB) Sensor Drop Outs have been fixed INFNGrid Meeting

  14. Storage • SRM at T1 ~200TB of disk (deployment problem in 2006) • ~100% usage (problem for 2006 service challenges) • Castor 2.1 • SRM at all T2s ~200TB of disk in total • ~30% usage: difficult to calculate • dCache 1.7.0 and DPM v1.5.10 • Dedicated disk servers advised (storage should be robust) • Need to make sure sites are running the latest GIP plugins (https://twiki.cern.ch/twiki/bin/view/EGEE/GIP-Plugins) • New GOC storage accounting system being put in place • being deployed at Tier-2s • SRM v2.2 is being implemented: need to test interoperability INFNGrid Meeting

  15. File Transfers (individual rates) http://www.gridpp.ac.uk/wiki/Service_Challenge_Transfer_Test_Summary Current goals: >250Mb/s inbound-only >250Mb/s outbound-only >200Mb/s inbound and outbound Aim: to maintain data transfers at a sustainable level as part of experiment service challenges INFNGrid Meeting

  16. Tier-1 Resource • Approval for new (shared) machine room – ETA Summer 2008. Space for 300 racks. • Procurement • March 06: 52 AMD 270 units, 21 disk servers (168TB data capacity) • FY 06/07: 47 disk servers (282TB disk capacity), 64 twin dual-core Intel Woodcrest 5130 units (550kSI2K) • FY 06/07 upcoming: further 210 TB disk capacity plus high-availability systems (redundant PSUs, hot-swappable paired HDDs) • Storage commissioning saga • Ongoing problems with March kit. Firmware updates have now solved problem. (Disks on Areca 1170 in raid 6 experienced multiple dropouts during testing of WD drives) • Move to CASTOR • Very support heavy but made available for CSA06 and performing well • General • - Air-con problems with high-temperatures triggering high pressure cut-outs in refrigerator gas circuits • (summers are warmer even in the UK...) • - July security incident • - 10Gb CERN line in place. Second 10Gb line scheduled in 07Q1 INFNGrid Meeting

  17. T2 Resources e.g. Glasgow: UKI-SCOTGRID-GLASGOW August 28 • 800 kSI2k • 100 TB DPM Needed for LHCstart-up September 1 • IC-HEP • 440 KSI2K • 52 TB dCache • Brunel • 260 KSI2K • 5 TB DPM October 13 October 23 INFNGrid Meeting

  18. T2 Resources Could also be 2006 As overheard at one T2 site.. INFNGrid Meeting

  19. A. “Usability” (Prequel) • GridPP runs a major part of the EGEE/LCG Grid, which supports ~3000 users • The Grid is not (yet) as transparent as end-users want it to be • The underlying overall failure rate is ~10% • User (interface)s, middleware and operational procedures (need to) adapt • Procedures to manage the underlying problems such that system is usable are highlighted INFNGrid Meeting

  20. Virtual Organisations • Users are grouped into VOs • Users/VO varies from 1 to 806 members (and growing..) • Broadly four classes of VO • LHC experiments • EGEE supported • Worldwide (mainly non-LHC particle physics) • Local/regional e.g. UK PhenoGrid • Sites can choose which VOs to support, subject to MOU/funding commitments • Most GridPP sites support ~20 VOs • GridPP nominally allocates 1% of resources to EGEE non-HEP VOs • GridPP currently contributes 30% of the EGEE CPU resources INFNGrid Meeting

  21. User evolution Number of users of the UK Grid (exc. Deployment Team) Quarter: 05Q4 06Q2 06Q3 Value: 1342 1831 2777 Many EGEE VOs supported c.f. 3000 EGEE target Number of active users (> 10 jobs per month) Quarter: 05Q4 06Q1 06Q2 Value: 83 166 201 Fraction: 6.2% 11.0% Viewpoint: growing fairly rapidly, but not as active as they could be? depends on the “active” definition INFNGrid Meeting

  22. Know your users? UK-enabled VOs 806 atlas 763 dzero 577 cms 566 dteam 150 lhcb 131 alice 75 bio65 dteamsgm41 esr 31 ilc 27 atlassgm 27 alicesgm 21 cmsprg 18 atlasprg 17 fusn 15 zeus 13 dteamprg 13 cmssgm 11 hone 9 pheno 9 geant 7 babar 6 aliceprg 5 lhcbsgm 5 biosgm 3 babarsgm 2 zeussgm 2 t2k 2 geantsgm 2 cedar 1 phenosgm 1 minossgm 1 lhcbprg 1 ilcsgm 1 honesgm 1 cdf INFNGrid Meeting

  23. Resource allocation • Assign quotas and priorities to VOs and measure delivery, but further work required on VOMS-roles/groups within each VO • VOMS provides group/role information in the proxy • Tools to control quotas and priorities in site services being developed • So far only at whole-VO level • Maui batch scheduler is flexible, easy to map to groups/roles • Sites set the target shares • Can publish VO/group-specific values in GLUE schema, hence the RB can use them for scheduling • Accounting tool (APEL) measures CPU use at global level (UK task) • Storage accounting currently being added • GridPP monitors storage across UK • Privacy issues around user-level accounting, being solved by encryption INFNGrid Meeting

  24. User Support • Becoming vital as the number of users grows • But modest effort available in the various projects • Global Grid User Support (GGUS) portal at Karlsruhe provides a central ticket interface • Problems are categorised • Tickets are classified by an on-duty Ticket Process Manager, and assigned to an appropriate support unit • UK (GridPP) contributes support effort • GGUS has a web-service interface to ticketing systems at each ROC • Other support units are local mailing lists • Mostly best-effort support, working hours only • Currently ~tens of tickets/week • Manageable, but may not scale much further • Some tickets slip through the net INFNGrid Meeting

  25. Documentation & Training • Need documentation and training for both system managers and users • Mostly expert users up to now, but user community is expanding • Induction of new VOs is a particular problem – no peer support • EGEE is running User Fora for users to share experience • Next in Manchester in May ’07 (with OGF) • EGEE has a dedicated training activity run by NeSC/Edinburgh • Documentation is often a low priority, little dedicated effort • The rapid pace of change means that material requires constant review • Effort on documentation is now increasing • GridPP has appointed a documentation officer • GridPP web site, wiki • Installation manual for admins is good • There is also a wiki for admins to share experience • Focus is now on user documentation • New EGEE web site – coming soon INFNGrid Meeting

  26. Alternative view? • The number of users in the Grid School for the Gifted is ~manageable now • The system may be too complex, requiring too much work by the “average user”? • Or the (virtual) help desk may not be enough? • Or the documentation may be misleading? • Or.. • Having smart users helps (the current ones are) INFNGrid Meeting

  27. 6th September – 1st PPRP review 16th June – GridPP16 at QMUL 13th July – Bid Submitted 1st November – GridPP17 31st March – PPARC Call CB OC CB 8th November PPRP “visiting panel” Proposal Writing Proposal Defence Future? Timeline – 1 http://www.gridpp.ac.uk/docs/gridpp3/ Apr May Jun Jul Aug Sep Oct Year-long process to define future LHC exploitation INFNGrid Meeting

  28. Scenario Planning – Resource Requirements [TB, kSI2k] GridPP requested a fair share of global requirements, according to experiment requirements Changes in the LHC schedule prompted a(nother) round of resource planning - presented to CRRB on Oct 24th New UK resource requirements have been derived and incorporated in the scenario planning e.g. Tier-1 INFNGrid Meeting

  29. Input to Scenario Planning –Hardware Costing • Empirical extrapolations with extrapolated (large) uncertainties • Hardware prices have been re-examined following recent Tier-1 purchase • CPU (woodcrest) was cheaper than expected based on extrapolation of previous 4 years of data INFNGrid Meeting

  30. Scenario Planning An example 70% scenario based on Experiment Inputs [£m] INFNGrid Meeting

  31. 6th Dec – PPRP recommend to SC 8th Nov –PPRP Visiting Panel Back to the Future? Timeline – 2 Science Committee PPARC Council Grants etc. Nov Dec Jan Feb Mar Apr May GridPP2+ outcome (1/9/07-31/3/08) now known emphasis on operations (modest middleware support)Anticipates GridPP3 outcome (1/4/08-31/3/11) known in the New Year INFNGrid Meeting

  32. Conclusion • What is the deployment status? (snapshot) See e.g. “Performance of the UK Grid for Particle Physics” http://www.gridpp.ac.uk/papers/GridPP_IEEE06.pdf for more info. • Is the system usable? Yes, but more work required from end-user perspective • What is the future of GridPP? Operations-led activity, working with EGEE/EGI (EU) and NGS (UK) INFNGrid Meeting

More Related