1 / 32

Large-Scale Computing with Grids

Large-Scale Computing with Grids. Jeff Templon. KVI Seminar 17 February 2004. Information I intend to transfer. Why are Grids interesting? Grids are solutions so I will spend some time talking about the problem and show how Grids are relevant. Solutions should solve a problem.

amy-weeks
Download Presentation

Large-Scale Computing with Grids

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large-Scale Computing with Grids Jeff Templon KVI Seminar 17 February 2004

  2. Information I intend to transfer • Why are Grids interesting? Grids are solutions so I will spend some time talking about the problem and show how Grids are relevant. Solutions should solve a problem. • What are Computational Grids? • How are Grids being used, in HEP as well as in other fields? • What are some examples of current Grid “research topics”?

  3. Our Problem • Place event info on 3D map • Trace trajectories through hits • Assign type to each track • Find particles you want • Needle in a haystack! • This is “relatively easy” case

  4. More complex example

  5. Data Handling and Computation for Physics Analysis event filter (selection & reconstruction) detector processed data event summary data raw data batch physics analysis event reprocessing analysis objects (extracted by physics topic) event simulation interactive physics analysis

  6. Computational Aspects • To reconstruct and analyze 1 event takes about 90 seconds • Maybe only a few out of a million are interesting. But we have to check them all! • Analysis program needs lots of calibration; determined from inspecting results of first pass. • Each event will be analyzed several times!

  7. 40 MHz (40 TB/sec) level 1 - special hardware 75 KHz (75 GB/sec) level 2 - embedded processors 5 KHz (5 GB/sec) level 3 - PCs 100 Hz (100 MB/sec) data recording & offline analysis One of the four LHC detectors online system multi-level trigger filter out background reduce data volume

  8. Computational Implications (2) • 90 seconds per event to reconstruct and analyze • 100 incoming events per second • To keep up, need either: • A computer that is nine thousand times faster, or • nine thousand computers working together • Moore’s Law: wait 20 years and computers will be 9000 times faster (we need them in 2007!)

  9. More Computational Issues • Four LHC experiments – roughly 36k CPUs needed • BUT: accelerator not always “on” – need fewer • BUT: multiple passes per event – need more! • BUT: haven’t accounted for Monte Carlo production – more!! • AND: haven’t addressed the needs of “physics users” at all!

  10. Large Dynamic Range Computing • Clear that we have enough work to keep somewhere between 50 and 100 k CPUs busy • Most of the activities are “bursty”, so doing them with dedicated facilities is inefficient • Need some meta-facility that can serve all the activities • Reconstruction & reprocessing • Analysis • Monte-Carlo • All four LHC experiments • All the LHC users

  11. LHC User Distribution • Putting all computers in one spot leads to traffic jams • Which spot is willing to pay for & maintain 100k CPUs?

  12. Classic Motivation for Grids • Large Scales: 50k CPUs, petabytes of data • (if we’re only talking ten machines, who cares?) • Large Dynamic Range: bursty usage patterns • Why buy 25k CPUs if 60% of the time you only need 900 CPUs? • Multiple user groups on single system • Can’t “hard-wire” the system for your purposes • Wide-area access requirements • Users not in same lab or even continent

  13. Solution using Grids • Large Scales: 50k CPUs, petabytes of data • Assemble 50k+ CPUs and petabytes of mass storage • Don’t need to be in the same place! • Large Dynamic Range: bursty usage patterns • When you need less than you have, others use excess capacity • When you need more, use others’ excess capacities • Multiple user groups on single system • “Generic” grid software services (think web server here) • Wide-area access requirements • Public Key Infrastructure for authentication & authorization

  14. How do Grids Work? Public Key Infrastructure & Generic Grid Software Services

  15. Public Key Infrastructure – Single Login • Based on asymmetric cryptography: public/private key pairs • Need network of trusted Certificate Authorities (e.g. Verisign) • CAs issue certificates to users, computers, or services running on computers (typically valid for two years) • Certificates carry • an “identity” /O=dutchgrid/O=users/O=kvi/CN=Nasser Kalantar • a public key • identity of the issuing CA • a digital signature (transformation on cert text using CA private key) • Need to tighten this part up, takes too long

  16. What is a Grid? (what are these Generic Grid Software Services?)

  17. Info system defines grid DS Grid software service = C = = C C C C C C C (like http server) Information System Information System is Central Nervous System of Grid

  18. Data Grid DS Grid software service = C = = DS DS DS C C DS C C C C C DS DS I.S. D.M.S

  19. Computing Task Submission DS Grid software service = C = = DS DS DS C C DS C C Get fresh, detailed info Coarse Requirements C C Candidate Clusters C DS proxy + command; (data); DS I.S. W.M.S. D.M.S

  20. Computing Task Execution DS Grid software service = C = = proxy DS DS DS DS List of best locations Where is my data? proxy + command; (data); Find DMS DS DS C I.S. W.M.S. D.M.S logger

  21. Computing Task Execution DS Grid software service = C = = proxy + data Where do I put the data? How to contact O.D.S.? Register output Done DS DS DS C I.S. W.M.S. D.M.S logger

  22. Can do a bit more with data…

  23. Computing Task Submission with data DS Grid software service = C = = DS DS DS C C DS C C Get fresh, detailed info Coarse Requirements C C Candidate Clusters Find copies of data C DS proxy + command; (data file spec); DS I.S. W.M.S. D.M.S

  24. Computing Task Execution DS Grid software service = C = = Local access DS DS DS DS List of best locations Where is my data? proxy + command; (data file spec); Find DMS DS DS C I.S. W.M.S. D.M.S logger

  25. Grids are working now • See it in action

  26. Biomedical Image Comparison Applications D0 Run II Data Reprocessing Ozone Profile Computation, Storage, & Access

  27. Grid Research Topics • “push” vs. “pull” model of operation (you just saw “push”) • We know how to do fine-grained priority & access control in “push” • We think “pull” is more robust operationally • Software distribution • HEP has BIG programs with many versions … “pre” vs. “zero” install • Data distribution • Similar issues (software is just executable data!) • A definite bottleneck in D0 reprocessing • (1 Gbit/100/2/2=modem!) • Speedup limited to 92% of theoretical • About 1 hr CPU lost per job • Optimization in general is hard, and is it worth it????

  28. Directed Acyclic Graphs

  29. Didn’t talk about • Grid projects • European Data Grid – ends this month, 10 M€ / 3y • VL-E – ongoing, 20 M€ / 5y • EGEE – starts in April, 30 M€ / 2y • Gridifying HEP software & computing activities • Transition from supercomputing to Grid computing (funding) • Integration with other technologies • Submit, monitor, control computing work from your PDA or mobile!

  30. Conclusions • Grids are working now for workload management in HEP • Big developments in next two years with large-scale deployments of LCG & EGEE projects • Seems Grids are catching on (IBM / Globus alliance) • Lots of interesting stuff to try out!!!

  31. What’s There Now? • Job Submission • Marriage of Globus, Condor-G, EDG Workload Manager • Latest version reliable at 99% level • Information System • New System (R-GMA) – very good information model, implementation still evolving • Old System (MDS) – poor information model, poor architecture, but hacks allow 99% “uptime” • Data Management • Replica Location Service – convenient and powerful system for locating and manipulating distributed data; mostly still user-driven (no heuristics) • Data Storage • “Bare gridFTP server” – reliable but mostly suited to disk-only mass store • SRM – no mature implementations

  32. Grids are Information A-Grid Information System B-Grid Information System

More Related