1 / 34

GridPP: A Brief History Of UK Particle Physics Grid

This article provides an overview of the history and development of GridPP, the UK Particle Physics Grid. It discusses the challenges faced and the solutions implemented to create a large-scale science Grid for the worldwide particle physics community.

dbuck
Download Presentation

GridPP: A Brief History Of UK Particle Physics Grid

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GridPP: That is the Grid that is Tony Doyle

  2. Outline That was the Grid that was: A Brief History Of GridPP UK Computing Centres The Grid & its Challenges That is the Grid that is: Resource Accounting Performance Monitoring Outlook Conclusions x 3 A talk with a start, no middle and 3 ends? The Icemen Cometh GridPP19 Collaboration Meeting

  3. Context (2000) • To create a UK Particle Physics Grid and the computing technologies required for the Large Hadron Collider (LHC) at CERN • To place the UK in a leadership position in the international development of an EU Grid infrastructure GridPP19 Collaboration Meeting

  4. From Web to Grid - Building the next IT Revolution Premise The next IT revolution will be the Grid. The Grid is a practical solution to the data-intensive problems that must be overcome if the computing needs of many scientific communities and industry are to be fulfilled over the next decade. Aim The GridPP Collaboration aims to develop and deploy a large-scale science Grid in the UK for use by the worldwide particle physics community. GridPP Vision Many Challenges.. Shared distributed infrastructure For all applications Tony Doyle - University of Glasgow

  5. The Grid Step 0: certificates and passwords I am going to put my root password on a web page. I'll encrypt it of course, to prevent abuse. My fantastic encryption consist of uppercasing the entire password. Now, if anybody decrypts my lowercase password then I'll go after them in court for invading my privacy. GridPP19 Collaboration Meeting

  6. 1999 Grid: Blueprint for a New Computing Infrastructure by Ian Foster and Carl Kesselman published. February 2000 A Joint Infrastructure Fund bid is submitted for £6.2m to fund a prototype Tier-1 centre at RAL, for the EU-funded DataGrid project. At the time of the JIF bid the LHC was expected to produce 4PB of data a year for 10 years. By 2005, the expected figures had risen to 15PB a year for 15 years. RAL was chosen as the location of the Tier-1 centre because it already hosts the UK BaBar computing centre. May 2000 R-ECFA meeting in the UK. October 2000 PPARC signs up to the EU DataGrid project, contributing 20 people and a Tier-1 centre. November 2000 Trade and Industry Secretary Stephen Byers announces £98m for e-Science with Spending Review 2000. This includes £26m for PPARC to develop HEP and astronomy Grid Projects. December 2000 GridPP plan created at a meeting at RAL. Initially the £26m was to help fund UK posts to coordinate the UK arm of LCG, as part of that organisation. April 2001 A Shadow Project Management Board, refered to as "DataGrid-UK", is established. GridPP first proposal submitted. 30/31st May 2001 PPARC's e-Science Committee meets to consider the proposal and approves the GridPP project, allocating £17m. 1st September 2001 GridPP officially starts, with funding for 3 years January 2002 DataGrid releases first production version of the testbed middleware. February 2002 First international file transfers using X.509 digital certificates 1st March 2002 RAL involved in a test of DataGrid by creating a small 5 site testbed Grid, with CERN, IN2P3-Lyon, CNAF-Bologna and NIKHEF 11th March 2002 LHC Computing Grid Project launched. 23th March 2002 First Prototype Tier1/A Hardware delivered to RAL, consisting of 156 dual CPU PCs with 30GB of storage each. 25th April 2002 UK National e-Science Centre (NeSC) opened in Edinburgh by Gordon Brown June 2002 ScotGrid, one of the four Tier-2s in GridPP, goes into production August 2002 GridPP makes its first visit to the All Hands e-Science meeting A Brief History Of GridPP GridPP19 Collaboration Meeting

  7. December 2002 PPARC receive a further £31.6m for their e-Science programme The UK plays significant role in LHCb Data Challenge February 2003 PPARC put out call for proposals for the second phase of its e-Science programme. June 2003 Proposal for GridPP2 submitted August 2003 UKHEP Certificate Authority is replaced by the UK e-Science Certificate Authority. This issues the digital certificates needed to use the Grid. September 2003 LHC Computing Grid is launched December 2003 GridSite, initially a tool used by the GridPP website gets its first production release GridPP2 proposal accepted by PPARC ensuring project will run until Sept 2007 with £16.9m April 2004 EU DataGrid project ends and is replaced by EGEE (Enabling Grids for E-science in Europe) September 2004 GridPP2 is launched GridPP website wins award at All Hands Meeting for Best e-Science Project Website October 2004 CERN's 50th anniversary January 2005 BaBar UK demonstrates the first successful integration of the Grid into the official BaBar Monte Carlo production system March 2005 LCG passes 100 sites worldwide May 2005 GridPP has grown to 2,740 CPUs and 67 TB of storage July 2005 GridPP members use the UKLight high speed connection between Lancaster and RAL for the first time, moving data 50 times faster than a normal ASDL line. September 2005 First WISDOM biomedical data challenge for drug discovery is run to look for drugs against Malaria GridSiteWiKi software released which allows users with the correct digital certificate to edit wiki pages. New version of Real Time Monitor Launched at e-Science All Hands Meeting. October 2005 International Grid Trust Federation (IGTF) established to regulate the digital certificates used on the Grid worldwide. November 2005 GridPP's storage capacity reaches 100TB 2,000,000 jobs were run on the EGEE Grid in 2005 January 2006 LCG reaches data speeds of 1GB/s during testing of the infrastructure March 2006 The PEGASUS project is announced, a social science study of GridPP by researchers from the London School of Economics PPARC signs the LCG Memorandum of Understanding with CERN, which commits the UK Tier-1 at RAL and the four UK Tier-2s to provide services and resources to the LCG April 2006 EGEE enters 2nd phase PPARC looks for proposals for the continuation of the UK's Grid computing for Particle Physicists after September 2007 May 2006 Second WISDOM biomedical data challenge for drug discovery is run to look for drugs against avian flu July 2006 Proposal for GridPP3 submitted; this would extend the project beyond the current end date of September 2007 August 2006 GridPP has 3,240 CPUs and 246.25TB December 2006 GridPP accounts for 27% of the 2006 total EGEE CPU resources March 2007 PPARC annnounces ₤30m for GridPP extension A Brief History Of GridPP GridPP19 Collaboration Meeting

  8. 11 10 1 6 0 9 3 2 4 5 7 8 VOMS-proxy-init LFC gridui JDL Job Submission Job Retrieval RB BDII Job Status? JS Grid Enabled Resources Grid Enabled Resources Grid Enabled Resources Grid Enabled Resources Logging & Bookkeeping Submitter CPU Nodes CPU Nodes CPU Nodes CPU Nodes Storage Storage Storage Storage The Grid VOMS WLMS GridPP19 Collaboration Meeting

  9. The Grid LCG OSG NDG NGS An open operating system does not only have advantages GridPP19 Collaboration Meeting

  10. Context (2007) • 2007 is the third year for the UK Production Grid • More than 5,000 CPUs and more than 1/2 Petabyte of disk storage • The UK is the largest CPU provider on the EGEE Grid, with total CPU used of 25 GSI2k-hours in the last year • The GridPP2 project has met 69% of its original targets with 92% of the metrics within specification • The LCG (full) Grid Service is now underway • The aim is to continue to improve reliability and performance • The GridPP2 project has been extended by 7 months to April 2008 • The GridPP3 proposal was recently accepted by PPARC (£30m) to extend the project to March 2011 • It not the end of the Grid as we know it • We anticipate a challenging period ahead GridPP19 Collaboration Meeting

  11. Tier-1 Centre at RAL • High quality data services • National and International Role • UK focus for International Grid development • 1500 CPUs • 750 TB Disk • 530 TB Tape (Capacity 1PB) • “Who is General Failure and why is he reading my disk?” Grid Operations Centre GridPP19 Collaboration Meeting

  12. UK Tier-2 Centres 2. The Tier-2 philosophy basically involves giving you enough rope to hang yourself. And then a couple of feet more, just to be sure. ScotGrid Durham, Edinburgh, Glasgow NorthGrid Daresbury, Lancaster, Liverpool, Manchester, Sheffield SouthGrid Birmingham, Bristol, Cambridge, Oxford, RAL PPD London Brunel, Imperial, QMUL, RHUL, UCL Mostly funded by HEFCE (SFC) GridPP19 Collaboration Meeting

  13. GridPP: Who are we? 19 UK Universities + STFC GridPP1 2001-2004 "From Web to Grid" [£16m+] GridPP2+ 2004-2008 "From Prototype to Production” [£17m+] GridPP3 2008-2011 "From Production to Exploitation” [£30m] 2 + 2 = 5, for sufficiently large values of 2. GridPP19 Collaboration Meeting

  14. Workload Management Grid Data Management Network Monitoring Information Services Security Storage Interfaces Middleware GridPP Middleware is.. C++: an octopus made by nailing extra legs onto a dog. Java is high performance. By high performance we mean adequate. By adequate we mean slow. GridPP19 Collaboration Meeting

  15. Grid Challenges 2. Software efficiency 1. Software process 3. Deployment planning 4. Link centres 10. Policies 5. Share data Data Management, Security and Sharing 9. Accounting 8. Analyse data 7. Install software 6. Manage data GridPP19 Collaboration Meeting

  16. Real Time Monitor 3. Any sufficiently advanced technology is indistinguishable from magic (Arthur C. Clarke) GridPP19 Collaboration Meeting

  17. Grid Status • Aim: by 2008 (full year’s data taking) • CPU ~100MSI2k (100,000 CPUs) • Storage ~80PB • - Involving >100 institutes worldwide • Build on complex middleware being developed in advanced Grid technology projects, both in Europe (Glite) and in the USA (VDT) • Prototype went live in September 2003 in 12 countries • Extensively tested by the LHC experiments in September 2004 • February 200625,547 CPUs, 4398 TB storage Status in May 2007: 177 sites, 29,266 CPUs, 13,815 TB storage Monitoring via Grid Operations Centre GridPP19 Collaboration Meeting

  18. Resources Accumulated EGEE CPU Usage 102,191,758 kSI2k-hoursor >100 GSI2k-hours (!) http://www3.egee.cesga.es/gridsite/accounting/CESGA/tree_egee.php UKI: 24,788,212 kSI2k-hours Via APEL accounting GridPP19 Collaboration Meeting

  19. UK Resources Past year’s CPU Usageby experiment GridPP19 Collaboration Meeting

  20. UK Resources Past year’s CPU Usageby Region Close contest in 2007 for CPU honours.. GridPP19 Collaboration Meeting

  21. Job Slots and Use 2004 2005 2006 2007 2004 2005 2006 2007 GridPP19 Collaboration Meeting

  22. Resource Accounting LHC start-up CPU CPU resources at ~required levels (just in time delivery) Grid Operations Centre Grid-accessible disk accounting being improved 100,000 3GHz CPUs time GridPP19 Collaboration Meeting

  23. (measured by UK Tier-1 and Tier-2 for all VOs) Efficiency ~90% CPU efficiency due to i/o bottlenecks Concern that this is currently ~70% at the Tier-1 target http://www.gridpp.ac.uk/pmb/docs/GridPP-PMB-113-Inefficient_Jobs_v1.0.pdf Each experiment needs to work to improve their system/deployment practice anticipating e.g. hanging gridftp connections during batch work A big issue for the Tier-2s.. A bigger issue for the Tier-1.. GridPP19 Collaboration Meeting

  24. Draft Policy All UK sites are given flexibility to deal with stalled jobs (in order that their CPUs are occupied more fully overall) according to the following stalled job definition: Any job consuming <10 minutes CPU over a given 6 hour period (efficiency < 0.027) is considered stalled. The following intervention scheme should be applied: Either If the site identifies that the problem is due to a well known problem, e.g., a hanging lcg-cp command, then the jobs may be deleted at once, with the user or VO being informed of the problem. (Note, the site should attempt to identify the SE, SURL or LFN involved to help debug the underlying data management issue). Or In cases where the reasons for stalling are not obvious (e.g. a binary just hanging), sites should contact users or VO production teams informing them of the number of stalled jobs at a site, providing as much information as possible to help debug the problem. (An example of such an email is given in Appendix 1). In this case: The user or VO should respond within 6 hours, stating whether the site should simply delete the jobs or take further debugging steps. If no response is received within 6 hours the site may delete the jobs, informing the user of the action taken. In order to contact users via e-mail the CIC portal lookup of a user's email from their DN will be used. In order to contact VO production teams the Operational Contact email address from the CIC portal will be used. If no contact is possible with the user or VO because this information is unavailable then the site may delete the stalled jobs from the batch system immediately. If more than 20 jobs are deleted in any 24 hour period this way by the site then the site should raise a GGUS ticket against the user or VO for future reference. This draft policy will come into force from 1st August to 31st December 2007: the Production Manager should be informed of any issues that arise in implementing this draft policy. Future (2008 onwards) policy and all inefficient job parameters (highlighted in bold) will be reviewed at the outset and on an annual review basis by DTeam. GridPP19 Collaboration Meeting

  25. site testing SAM tests (critical=subset) BDII Top-level BDII sBDII Site BDII FTS File Transfer Service gCE gLite Computing Element LFC Global LFC VOMS VOMS CE Computing Element SRM SRM gRB gLite Resource Broker MyProxy MyProxy RB Resource Broker VOBOX VO BOX SE Storage Element RGMA RGMA Registry Global Tier-1s UK Tier-1 (RAL) http://gridview.cern.ch/GRIDVIEW/same_index.php GridPP19 Collaboration Meeting

  26. ATLAS site testing • End-user analysis tests in advance of LHC data-taking • Example: ATLAS • Hourly polling of all sites 12/01/07 10/05/07 http://hepwww.ph.qmul.ac.uk/~lloyd/atlas/atest.php • Measurably improved performance GridPP19 Collaboration Meeting

  27. ATLAS site testing %age success http://hepwww.ph.qmul.ac.uk/~lloyd/atlas/atest.php 12/01/07 16/08/07 • Performance at ~80% can be maintained.. GridPP19 Collaboration Meeting

  28. SAM site testing %age success http://hepwww.ph.qmul.ac.uk/~lloyd/gridpp/sam.html 30/05/07 27/08/07 • Performance at ~90% can be maintained.. • (go away for a holiday in July and don’t come back?) GridPP19 Collaboration Meeting

  29. CMS Challenge CSA06: Successful CMS global 25% capacity test over a 6 week period in Sep/Oct 2006. 5. “CMS-OS” *is* user friendly. It's not idiot-friendly or fool-friendly! • Reconstruction, event selection, calibration, alignment, analysis. • 1PB of data shipped between T0 – T1 – T2s in 6 weeks. • 30 analysis projects involving 70 physicists GridPP19 Collaboration Meeting

  30. CERN Germany UK Spain France Italy LHCb Production UK consistently largest producer for LHCb 6. If Windows is the answer, it must have been a stupid question. GridPP19 Collaboration Meeting

  31. Conclusion • From UK Particle Physics perspective the Grid is thebasis for computing in the 21st Century: • needed to utilise computing resources efficiently and securely • uses gLite middleware (with evolving standards for interoperation) • required significant investment from PPARC (STFC) – O(£100m) over 10 yrs - including support from HEFCE/SFC • required 3 years’ prototype testbed development [GridPP1] • provides a working production system that has been running for over two years in build-up to LHC data-taking [GridPP2] • enables seamless discovery of computing resources: utilised to good effect across the UK – internationally significant • not (yet) as efficient as end-user analysts require: ongoing work to improve performance • ready for LHC – just in time delivery • future operations-led activity as part of LCG, working with EGEE/EGI (EU) and NGS (UK) [GridPP3] • future challenge is to exploit this infrastructure to perform (previously impossible) physics analyses from the LHC (and ILC and nFact and..) GridPP19 Collaboration Meeting

  32. “Everything is becoming, nothing is.” Plato “Common sense is the best distributed commodity in the world. For every (wo)man is convinced (s)he is well supplied with it.” Descartes “The superfluous is very necessary” Voltaire “Heidegger, Heidegger was a boozy beggar, I drink therefore I am” Python “Only daring speculation can lead us further, and not accumulation of facts.” Einstein “The real, then, is that which, sooner or later, information and reasoning would finally result in.” C. S. Pierce “The philosophers have only interpreted the world in various ways; the point is to change it.” Marx (some of) these may be relevant to your view of “The Grid”… Philosophical Conclusion GridPP19 Collaboration Meeting

  33. Alternative Conclusion • Any sufficiently advanced technology is indistinguishable from magic (Arthur C. Clarke) • Who is General Failure and why is he reading my disk? • 2 + 2 = 5, for sufficiently large values of 2. • The UNIX philosophy basically involves giving you enough rope to hang yourself. And then a couple of feet more, just to be sure. • C++: an octopus made by nailing extra legs onto a dog. • Java is high performance. By high performance we mean adequate. By adequate we mean slow. • Emacs is a great OS, but it lacks a good text editor. That's why I use vi. • Linux *is* user friendly. It's not idiot-friendly or fool-friendly! • If Windows is the answer, it must have been a stupid question. • It compiles - let's ship it! • If it ain't broke, fix it 'til it is! • Dilbert's Project Uncertainty Principle: If you understand a project, you won't know its cost, and vice versa. • After all is said and done, there is always a lot more said than done. http://www.cs.ualberta.ca/~mburo/quotes.html GridPP19 Collaboration Meeting

  34. It's the end of the Grid as we know it An open operating system does not only have advantages Any sufficiently advanced technology is indistinguishable from magic (Arthur C. Clarke) Who is General Failure and why is he reading my disk? 2 + 2 = 5, for sufficiently large values of 2. The UNIX philosophy basically involves giving you enough rope to hang yourself. And then a couple of feet more, just to be sure. C++: an octopus made by nailing extra legs onto a dog. Java is high performance. By high performance we mean adequate. By adequate we mean slow. Emacs is a great OS, but it lacks a good text editor. That's why I use vi. Linux *is* user friendly. It's not idiot-friendly or fool-friendly! If Windows is the answer, it must have been a stupid question. It compiles - let's ship it! The other night I dreamt no APIs, continental drift divide. Grids sat in a line, Leonard Bernstein. Leonid Brezhnev, Lenny Bruce and Lester Bangs. Birthday party, Ambleside, PMB, boom! Its your symbolic Grid bug net, right? Right. It's the end of the World as we know it. (Its time I had some time alone)It's the end of the World as we know it. (Its time I had Grid time alone)It's the end of the World as we know it and I feel fine...fine... If it ain't broke, fix it 'til it is! Dilbert's Project Uncertainty Principle: If you understand a project, you won't know its cost, and vice versa. After all is said and done, there is always a lot more sung than done. Six o'clock - VO hour. Don't get caught in foreign towers. Code and burn, return, listen to your code churn. Lock it in, planet blogging, book burning, standards setting. Every user escalate. Automate, incinerate. Write a portal, light a motive. Step down CPU. Watch your disk crush, crushed, uh-oh, this means no fear at this Tier. Revoked certificate steer clear! A tournament, not what I meant, a firmament of sites. Offer me solutions, offer me alternatives and I decline. It's the end of the World as we know it. (Its time I had some time alone) It's the end of the World as we know it. (Its time I had Grid time alone) It's the end of the World as we know it and I feel fine. After all is said and done, there is always a lot more sung than done. After all is said and done, there is always a lot more sung than done. After all is said and done, there is always a lot more sung than done. Its time I had Grid time alone Its time I had Grid time alone Its time I had Grid time alone Its time I had Grid time alone Its time I had Grid time alone Its time I had Grid time alone It's the end of the World as we know it and I feel fine. GridPP19 Collaboration Meeting That's Grid, it starts with an earthquake, middleware, an aeroplane and Lenny Bruce is not afraid. Eye of a hurricane, listen to yourself turn – windows serves its own needs, dummy serve your own needs. Feed it off linux, speed grunt, more strength, the testbed starts to clatter where bits make bytes. Fire in a wire, representing seven names, and an RA for hire at a Tier-2 site. Left of west and coming in a hurry with the users breathing down your neck. Deployment team reporters baffled, trumped, tethered, stumped? Look at that code playing. Fine, then. Uh oh, overflow, population, common Grid, but it'll do to run your jobs, serve yourself. (Hello) World serves its own needs, listen to your heartbeat dummy with the rapture and the revered and the right, right. You vitriolic, patriotic, slam, fight, bright light, feeling pretty site. It's the end of the World as we know it. It's the end of the World as we know it. It's the end of the World as we know it and I feel fine.

More Related