1 / 43

The Worldwide LHC Computing Grid

The Worldwide LHC Computing Grid. Processing the Data from the World’s Largest Scientific Machine --- Jamie Shiers, CERN, Geneva, Switzerland. Abstract. The world's largest scientific machine will enter production about one year from the time of this conference

javier
Download Presentation

The Worldwide LHC Computing Grid

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Worldwide LHC Computing Grid Processing the Data from the World’s Largest Scientific Machine --- Jamie Shiers, CERN, Geneva, Switzerland

  2. Abstract • The world's largest scientific machine will enter production about one year from the time of this conference • In order to exploit its scientificpotential, computational resources way beyond those needed for previous accelerators are required • Based on these requirements, a distributed solution based on Grid technologies has been developed • This talk describes the overall requirements that come from the Computing Models of the experiments, the state of deployment of the production services, on-going validation of these services as well as the offline infrastructure of the experiments and finally the remaining steps that need to be achieved in the remaining months before the deluge of data arrives. The Worldwide LHC Computing Grid - Jamie.Shiers@cern.ch - CCP 2006 - Gyeongju, Republic of Korea

  3. Overview • Brief Introduction to CERN & LHC • Data Processing requirements • The Worldwide LHC Computing Grid • Status and Outlook The Worldwide LHC Computing Grid - Jamie.Shiers@cern.ch - CCP 2006 - Gyeongju, Republic of Korea

  4. LHC Overview The Large Hadron Collider Proton-proton collider using an existing tunnel 27km in circumference, ~100m underground Lies beneath French / Swiss border near Geneva

  5. CERN – European Organization for Nuclear Research

  6. The LHC Machine

  7. 40 MHz (1000 TB/sec) Level 1 75 KHz (75 GB/sec) Level 2 5 KHz(5 GB/sec) Level 3 100 Hz (100 MB/sec) Data Recording & Offline Analysis CMS Data Rates: • 1PB/s from detector • 100MB/s – 1.5GB/s to ‘disk’ • 5-10PB growth / year • ~3GB/s per PB of data Data Processing: • 100,000 of today’sfastest PCs The Worldwide LHC Computing Grid - Jamie.Shiers@cern.ch - CCP 2006 - Gyeongju, Republic of Korea

  8. The Worldwide LHC Computing Grid - Jamie.Shiers@cern.ch - CCP 2006 - Gyeongju, Republic of Korea

  9. simulation Data Handling and Computation for Physics Analysis reconstruction event filter (selection & reconstruction) detector ESD analysis processed data event summary data raw data RAW batch physics analysis event reprocessing AOD analysis objects (extracted by physics topic) event simulation les.robertson@cern.ch interactive physics analysis

  10. Data R A W E S D A O D TAG 1TB/yr 10TB/yr 100TB/yr Tier1 1PB/yr (1PB/s prior to reduction!) Tier0 random seq. Users

  11. Physics @ LHC • Principal Goals: • Explore a new energy/distance scale • Look for ‘the’ Higgs boson • Look for supersymmetry/extra dimensions, … • Find something the theorists did not expect Concluding talk, ≠ Summary Kraków, July 2006 John Ellis, TH Division, PH Department, CERN

  12. LHC: Higgs Decay into 4 muons selectivity: 1 in 1013 - 1 person in a thousand world populations - A needle in 20 million haystacks

  13. e, m e, m Z(*) g e, m H t Z mZ e, m g Signal expected in ATLAS after ‘early' LHC operation Physics example H  ZZ  4  “Gold-plated” channel for Higgs discovery at LHC Simulation of a H   ee event in ATLAS The Worldwide LHC Computing Grid - Jamie.Shiers@cern.ch - CCP 2006 - Gyeongju, Republic of Korea

  14. (W)LCG Overview The LHC Computing Grid A Worldwide Grid build on existing Grid Infrastructures, including OpenScience Grid (OSG), EGEE and NorduGrid

  15. Grid Computing • Today there are many definitions of Grid computing: • The definitive definition of a Grid is provided by [1] Ian Foster in his article "What is the Grid? A Three Point Checklist"[2]. • The three points of this checklist are: • Computing resources are not administered centrally; • Open standards are used; • Non trivial quality of service is achieved. • … Some sort of Distributed System at least… • that crosses Management / Enterprise domains

  16. LCG depends on 2 major science grid infrastructures … The LCG service runs & relies on the grid infrastructures provided by: EGEE - Enabling Grids for E-SciencE OSG - US Open Science Grid

  17. EGEE – Close-up • Many EGEE regions are Grids in their own right • In some cases these too are build out of smaller, regional Grids • These typically have other, local users, in addition to those of the ‘higher-level’ Grid(s) • Similarly, OSG also supports communities other than those of the LCG… The Worldwide LHC Computing Grid - Jamie.Shiers@cern.ch - CCP 2006 - Gyeongju, Republic of Korea

  18. EGEE OSG Grid WLCG Grid Grid Grid Grid Grid WLCG • WLCG: • A federation of fractal Grids… • A (small) step towards “the” Grid • (rather than “a” Grid) The Worldwide LHC Computing Grid - Jamie.Shiers@cern.ch - CCP 2006 - Gyeongju, Republic of Korea

  19. Why a Grid Solution? • The LCG Technical Design Report lists: • Significant costs of [ providing ] maintaining and upgrading the necessary resources … more easily handled in a distributed environment, where individual institutes and … organisations can fund local resources … whilst contributing to the global goal • … no single points of failure. Multiple copies of the data, automatic reassigning of tasks to resources… facilitates access to data for all scientists independent of location. … round the clock monitoring and support. The Worldwide LHC Computing Grid - Jamie.Shiers@cern.ch - CCP 2006 - Gyeongju, Republic of Korea

  20. WLCG Collaboration • The Collaboration • ~130 computing centres • 12 large centres (Tier-0, Tier-1) • 40-50 federations of smaller “Tier-2” centres • 29 countries • Memorandum of Understanding • Agreed in October 2005, now being signed • Purpose • Focuses on the needs of the 4 LHC experiments • Commits resources • Each October for the coming year • 5-year forward look • Agrees on standards and procedures

  21. Tier0 – the accelerator centre (CERN) • Data acquisition & initial processing • Long-term data curation • Distribution of data  Tier1s Tier1 – “online” to the data acquisition process  high availability • Managed Mass Storage – grid-enabled data service • Data intensive analysis • National, regional support • Continual reprocessing activity Nordic countries – distributed Tier-1 Spain – PIC (Barcelona) Taiwan – Academia Sinica (Taipei) UK – CLRC (Didcot) US – FermiLab (Illinois) – Brookhaven (NY) Canada – Triumf (Vancouver) France – IN2P3 (Lyon) Germany – Forschungszentrum Karlsruhe Italy – CNAF (Bologna) Netherlands – NIKHEF (Amsterdam) LCG Service Model Tier2 – ~100 centres in ~40 countries • Simulation • End-user analysis – batch and interactive Les Robertson

  22. CPU Disk Tape Networking Requirements: • GB/s out of CERN (1.6GB/s nominal + factor 6 safety • 100s of MB/s into Tier1s • 10s of MB/s into / out of Tier2s Provisioned: • (Backbone at CERN) • 10Gbps link to each Tier1 site • 1Gbps minimum to Tier2s

  23. Summary of Tier0/1/2 Roles • Tier0: safe keeping of RAW data (first copy); first pass reconstruction, distribution of RAW data and reconstruction output to Tier1; reprocessing of data during LHC down-times; • Tier1: safe keeping of a proportional share of RAW and reconstructed data; large scale reprocessing and safe keeping of corresponding output; distribution of data products to Tier2s and safe keeping of a share of simulated data produced at these Tier2s; • Tier2: Handling analysis requirements and proportional share of simulated event production and reconstruction. N.B. there are differences in roles by experiment Essential to test using complete production chain of each!

  24. ATLAS Computing Model • Tier-0: • Copy RAW data to Castor tape for archival • Copy RAW data to Tier-1s for storage and reprocessing • Run first-pass calibration/alignment (within 24 hrs) • Run first-pass reconstruction (within 48 hrs) • Distribute reconstruction output (ESDs, AODs & TAGS) to Tier-1s • Tier-1s: • Store and take care of a fraction of RAW data • Run “slow” calibration/alignment procedures • Rerun reconstruction with better calib/align and/or algorithms • Distribute reconstruction output to Tier-2s • Keep current versions of ESDs and AODs on disk for analysis • Tier-2s: • Run simulation • Keep current versions of AODs on disk for analysis

  25. ATLAS Tier-0 Data Flow Tape RAW ESD RAW AODm ESD (2x) RAW 0.44 Hz 37K f/day 440 MB/s AODm (10x) 1.6 GB/file 0.2 Hz 17K f/day 320 MB/s 27 TB/day 1 Hz 85K f/day 720 MB/s Castorbuffer EF T1s T1 T1 2.24 Hz 170K f/day (temp) 20K f/day (perm) 140 MB/s 0.4 Hz 190K f/day 340 MB/s RAW ESD AOD AODm AOD 0.5 GB/file 0.2 Hz 17K f/day 100 MB/s 8 TB/day 10 MB/file 2 Hz 170K f/day 20 MB/s 1.6 TB/day 500 MB/file 0.04 Hz 3.4K f/day 20 MB/s 1.6 TB/day CPUfarm

  26. RAW ESD2 AODm2 0.044 Hz 3.74K f/day 44 MB/s 3.66 TB/day RAW RAW 1.6 GB/file 0.02 Hz 1.7K f/day 32 MB/s 2.7 TB/day 1.6 GB/file 0.02 Hz 1.7K f/day 32 MB/s 2.7 TB/day OtherTier-1s Tier-2s OtherTier-1s ESD1 ESD2 ESD2 ESD2 ESD2 AOD2 AOD2 AODm2 AODm1 AODm2 AODm2 AODm2 AODm1 AODm2 T1 T1 T1 T1 T1 T1 0.5 GB/file 0.02 Hz 1.7K f/day 10 MB/s 0.8 TB/day 0.5 GB/file 0.02 Hz 1.7K f/day 10 MB/s 0.8 TB/day 0.5 GB/file 0.02 Hz 1.7K f/day 10 MB/s 0.8 TB/day 0.5 GB/file 0.02 Hz 1.7K f/day 10 MB/s 0.8 TB/day 0.5 GB/file 0.02 Hz 1.7K f/day 10 MB/s 0.8 TB/day 10 MB/file 0.2 Hz 17K f/day 2 MB/s 0.16 TB/day 10 MB/file 0.2 Hz 17K f/day 2 MB/s 0.16 TB/day 500 MB/file 0.04 Hz 3.4K f/day 20 MB/s 1.6 TB/day 500 MB/file 0.036 Hz 3.1K f/day 18 MB/s 1.44 TB/day 500 MB/file 0.04 Hz 3.4K f/day 20 MB/s 1.6 TB/day 500 MB/file 0.004 Hz 0.34K f/day 2 MB/s 0.16 TB/day 500 MB/file 0.04 Hz 3.4K f/day 20 MB/s 1.6 TB/day 500 MB/file 0.004 Hz 0.34K f/day 2 MB/s 0.16 TB/day 500 MB/file 0.036 Hz 3.1K f/day 18 MB/s 1.44 TB/day ATLAS “average” Tier-1 Data Flow (2008) Tape Real data storage, reprocessing and distribution Tier-0 diskbuffer CPUfarm Plus simulation & analysis data flow diskstorage

  27. Nominal Tier0 – Tier1 Data Rates (pp) Heat

  28. Global Inter-Site Rates

  29. The Scoville Scale • The Scoville scale is a measure of the hotness of a chilli pepper. These fruits of the Capsicum genus contain capsaicin, a chemical compound which stimulates thermoreceptornerve endings in the tongue, and the number of Scoville heat units (SHU) indicates the amount of capsaicin present. Many hot sauces use their Scoville rating in advertising as a selling point. • It is named after Wilbur Scoville, who developed the Scoville Organoleptic Test in 1912[1]. As originally devised, a solution of the pepper extract is diluted in sugarwater until the 'heat' is no longer detectable to a panel of (usually five) tasters; the degree of dilution gives its measure on the Scoville scale. Thus a sweet pepper, containing no capsaicin at all, has a Scoville rating of zero, meaning no heat detectable even undiluted. Conversely, the hottest chiles, such as habaneros, have a rating of 300,000 or more, indicating that their extract has to be diluted 300,000-fold before the capsaicin present is undetectable. The greatest weakness of the Scoville Organoleptic Test is its imprecision, because it relies on human subjectivity. The Worldwide LHC Computing Grid - Jamie.Shiers@cern.ch - CCP 2006 - Gyeongju, Republic of Korea

  30. Scoville Scale – cont. The Worldwide LHC Computing Grid - Jamie.Shiers@cern.ch - CCP 2006 - Gyeongju, Republic of Korea

  31. LCG Status The LHC Computing Grid Status of Deployment of Worldwide Production Grid Services

  32. The LCG Service • The LCG Service has been validated over the past 2 years via a series of dedicated “Service Challenges”, designed to test the readiness of the service infrastructure • These are complementary to tests by the experiments of the offline Computing Models – the Service Challenges have progressively ramped up the level of service in preparation for ever more detailed tests by the experiments • The target: full production services by end September 2006! • Some additional functionality is still to be added, resource levels will continue to ramp-up in 2007 and beyond • Resource requirements are strongly coupled to total volume of data acquired to date The Worldwide LHC Computing Grid - Jamie.Shiers@cern.ch - CCP 2006 - Gyeongju, Republic of Korea

  33. The Service Challenge Programme • Significant focus on Data Management, including data export from Tier0-Tier1 • Services required by VO / site agreed in mid-2005 with small but continous evolution expected • Goal is delivery of stable production services • Status: after several iterations, requirements and plans of experiments understood, required services by site established • Still some operational and functional problems, being pursued on a regular basis The Worldwide LHC Computing Grid - Jamie.Shiers@cern.ch - CCP 2006 - Gyeongju, Republic of Korea

  34. CERN (Tier0) MoU Commitments

  35. Breakdown of a normal year - From Chamonix XIV - 7-8 Service upgrade slots? ~ 140-160 days for physics per year Not forgetting ion and TOTEM operation Leaves ~ 100-120 days for proton luminosity running ? Efficiency for physics 50% ? ~ 50 days ~ 1200 h ~ 4 106 s of proton luminosity running / year R.Bailey, Chamonix XV, January 2006

  36. Easter w/e Target 10 day period July-August 2006 Disk-Tape Rates Testing of experiment driven data export at 50% ofnominal rate > 1 year prior to first collisions

  37. Experiment Production • Experiments currently testing full production chain • Elements include: • Data export • Job submission • Full integration ofTier0/Tier1/Tier2 sites The Worldwide LHC Computing Grid - Jamie.Shiers@cern.ch - CCP 2006 - Gyeongju, Republic of Korea

  38. Plans Prior to First Collisions • Between now and first collisions these activities will continue, progressively ramping up in scope and scale • Still significant work to involve ~100 Tier2s in a distributed, reliable service • Still much work to do to attain data rates for prolonged periods (weeks) including recovery from site failure • power, cooling, service issues The Worldwide LHC Computing Grid - Jamie.Shiers@cern.ch - CCP 2006 - Gyeongju, Republic of Korea

  39. And Beyond… • First collisions LHC expected November 2007 • These will be at ‘low’ energy – 450 GeV per beam • Main target will be understanding detectors, trigger and offline software • ‘Re-discover’ existing physics – excellent for calibration! • Data rates will be full nominal values! (Machine efficiency?) • First full energy run in 2008: 7 + 7 TeV • Physics discovery run! • Heavy Ions in 2009? Data export schedule? • Typically takes ~years to fully understand detector and software chain • Much of the initial ‘analysis’ will be done starting from RAW/ESD datasets • Big impact on network load – larger datasets, transferred more frequently • Potential mismatch with ‘steady-state’ planning? • Much larger initial bandwidth requirement (but do you really believe it will go down?) • Those sites that have it will be more ‘competitive’ (and vice-versa…) • Rate calculations have overhead for recovering backlogs due to down-time • But not for recovery from human and / or software error! • e.g. bug in alignment / calibration / selection / classification code -> junk data! The Worldwide LHC Computing Grid - Jamie.Shiers@cern.ch - CCP 2006 - Gyeongju, Republic of Korea

  40. Summary & Conclusions • Deploying a Worldwide Production Grid is not without its challenges • Much has been accomplished; much still outstanding • My two top issues? • Collaboration & communication at such a scale requires significant and constant effort • We are not yet at the level that this is just basic infrastructure • “Design for failure” – i.e. assume that things don’t work, rather than hope that they always do! • A lesson from our “founding fathers” – the creators of the Internet? The Worldwide LHC Computing Grid - Jamie.Shiers@cern.ch - CCP 2006 - Gyeongju, Republic of Korea

  41. The Worldwide LHC Computing Grid - Jamie.Shiers@cern.ch - CCP 2006 - Gyeongju, Republic of Korea

More Related