1 / 45

Real-life experiences with grids: It’s not as easy as it looks

Real-life experiences with grids: It’s not as easy as it looks. Alain Roy roy@cs.wisc.edu University of Wisconsin-Madison Condor Team. Who Am I?. Member of Condor Team Experience with Condor Experience with grid deployment Developer of Virtual Data Toolkit Used by GriPhyN, EDG, LCG…

mare
Download Presentation

Real-life experiences with grids: It’s not as easy as it looks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Real-life experiences with grids:It’s not as easy as it looks Alain Roy roy@cs.wisc.edu University of Wisconsin-Madison Condor Team Grid Experiences

  2. Who Am I? • Member of Condor Team • Experience with Condor • Experience with grid deployment • Developer of Virtual Data Toolkit • Used by GriPhyN, EDG, LCG… • Packaging of Globus, Condor, etc. • Collaborator with INFN • Working with Paolo Mazzanti • In Bologna for four weeks Grid Experiences

  3. Italy • Italy is beautiful • The food is wonderful • The people are friendly Grid Experiences

  4. Background • Condor’s environment is a little like a grid • Not all computers (grid sites) are under Condor’s control • Computers (grid sites) disappear at the owner’s whim • Everything changes constantly • Condor was built to deal with this dynamic environment • Grid software needs to do the same Grid Experiences

  5. Background • Late 1980s until today • Condor developed and deployed on hundreds of sites • Condor built to deal with failures • Recently • Condor-G: your window to the grid • Condor team has helped deploy grid technology for real use—not just experiments Grid Experiences

  6. Background: Condor • Condor is a batch job system • Goal: High throughput computing • Different than high-performance • Goal: High reliability • Goal: Support distributed ownership Grid Experiences

  7. High-Throughput Computing • Worry about FLOPS/year, not FLOPS/second • Use all resources effectively • Dedicated clusters • Non-dedicated computers (desktop) Grid Experiences

  8. Effective Resource Use • Requires high reliability Computers come and go, your jobs shouldn’t. • Checkpointing • Be prepared for everything breaking • Requires distributed ownership Grid Experiences

  9. Condor-G • Condor-G submits Globus jobs • Jobs are in persistent queue • Unlike globus-job-run • Jobs are retried on system failures • Jobs are held on some failures • Condor-G makes it easy to submit grid jobs Grid Experiences

  10. Background: USCMS • CMS: • Detector online in 2007 • Needs to simulate & reconstruct millions of events • USCMS testbed • Joint PPDG/GriPhyN effort • Integrate CMS tools with grid tools • Globus • Condor-G • Contribute real work to CMS Grid Experiences

  11. Background: USCMS • 7 sites, 250+ CPUs • Spring 2002: Deploy & test • Fall 2002 • Last minute production • 150,000 events in two weeks • Successful, but lots of work • Today: • Wider deployment & use Grid Experiences

  12. Background: DØ • Experiment at Fermilab • Already doing real production, real analysis • Deploying on grid sites today • Condor-G • Globus • SAM Grid Experiences

  13. DØ: Condor-G • They liked Condor-G: • Condor-G missing a feature: • Deciding which grid-site to use • SAM (data handling software) knows where data is located • SAMGrid: • Condor-G asks SAM for advice • Condor-G decides where to run jobs Grid Experiences

  14. DØ: deployment • Spring: Beginning of deployment • Late summer: production • Early results: • It looks good • We have more work to do • Better error reporting • Better matchmaking • What will we learn later? Grid Experiences

  15. Problems & Lessons • During our experiences, we’ve: • Encountered many problems • Developed solutions to these problems • Learned many lessons about grids • This talk: • Shares some interesting problems • Gives some advice & solutions Grid Experiences

  16. Problem Taking a taxi • How do you take a taxi in Paestum, Italy? • We don’t need to: walk 4km there • The ruins were lovely • The ruins were outside • It was about 35°C • Wife is pregnant Grid Experiences

  17. Lesson Use all your resources • Walk up to storekeeper • Ask: Dovay Ooon Taxi? (Dove un taxi?) • Be patient: Wait ten minutes • Take taxi • I assumed my resources (local knowledge, Italian) were insufficient, but they saved me time when I used them Grid Experiences

  18. Lesson Use all your resources • Condor: • Uses dedicated machines (I can walk) • Uses non-dedicated machines (I can sometimes ask for help) • Grids: • Connect your machine rooms • Can you take advantage of other resources? • Avoid mentality “I must control all resources”, and you will prosper Grid Experiences

  19. Grid: distributed machine room? • You can have good control • You can pre-install applications • You know how everything works BUT… • You lose flexibility • How quickly can you upgrade sites? • Did they install everything correctly? • Can you use new grid sites easily? Grid Experiences

  20. Grid: Use all resources • Assume: basic grid software is installed • Assume: nothing else is installed • Bring your software with you • Submit one job: install software • Submit N jobs: use software • You control software • You ensure correct installation • Easy to use any grid site Grid Experiences

  21. Problem Long-running programs • Long-running programs crash • Condor has daemons on each machine: • User (job) agent • Machine agent • Matchmaker • They crash: • Programming errors • Network failures • Disk failures • … Grid Experiences

  22. Lesson Watch programs • Condor master • Small program, rarely changed • Runs Condor daemons • When daemon crashes: • Restart daemon, send email • If it crashes again, restart after backoff • Result: • Many errors are silently fixed • Yet we don’t just ignore crashes Grid Experiences

  23. Problem Short-running programs • Short-running programs crash/hang • Example: globus-url-copy • USCMS testbed: staging data • Some fraction of copies hang or fail • Programming error + delicate network • Hard to reproduce and fix Grid Experiences

  24. Lesson Watch programs • When copy exceeds timeout, kill and retry • Possible to do in shell scripting languages, but not easy • Use Fault Tolerant Shell to watch programs Grid Experiences

  25. Exponential backoff on failure: Wait {1, 2, 4…} seconds * rand in [1,2] Fault Tolerant Shell • Shell language built for coping with errors try for 30 minutes wget http://www.example.com/file.tar.gz gunzip file.tar.gz tar xf file.tar end Grid Experiences

  26. FTSH: exponential backoff • Why exponential backoff? • What if 100 ftsh scripts are executing? • Avoid synchronization  reduce load, increase chance of success • Similar to Ethernet Grid Experiences

  27. Fault Tolerant Shell • Easier to cope with failures: try 5 times wget http://www.example.com/file.tar.gz catch rm –f file.tar.gz failure end Cleanup partially downloaded file, if it exists Grid Experiences

  28. Cope with network failure Cope with disk failure Fault Tolerant Shell • Flexible try for 30 minutes try for 5 minutes wget http://example.com/file.tar.gz end try for 1 minute or 3 times gunzip file.tar.gz tar xf file.tar catch rm –rf file.tar end end Grid Experiences

  29. FTSH: More information • Work of Doug Thain • thain@cs.wisc.edu • Excellent paper: • The Ethernet Approach to Grid Computing, by Doug Thain • Available from: http://www.cs.wisc.edu/~thain • Even if you don’t use FTSH, read this paper! Grid Experiences

  30. Problem Whose error is it? • The source of an error is not always obvious • The source of an error influences how you react to the error • Example: Java universe in Condor Grid Experiences

  31. Java Universe • Users submit Java jobs to Condor • Whose error is it? Check result code: • 1: Program dereferenced NULL pointer • 1: Job’s image is corrupt • 1: VM doesn’t have enough memory to run program • 1: Java installation is misconfigured Job shouldn’t run again Job shouldn’t run again Try another machine with more memory Don’t use this machine for Java Grid Experiences

  32. Lesson Don’t trust configuration • Users tells Condor: “Java is installed” • This is just a hint! • Condor verifies Java configuration • Run simple job, verify output • If Java works, Condor advertises that Java can be used • If Java fails, error is reported, Java can’t be used Grid Experiences

  33. Lesson Look for error scope • Add Java wrapper to all Java jobs • Run program • Examine return code/exception • Write all details to file • Examine output of wrapper, or exception from JVM • We know if job is bad • We know if JVM is insufficient for job • We know if JVM is bad Grid Experiences

  34. Error Scope • We could have an entire talk on error scope • Excellent paper: Error Scope on a Computational Grid: Theory and Practice, by Doug Thain • Useful paper even if you don’t use Condor or Java Grid Experiences

  35. Problem condor_submit Globus GRAM Condor-G job agent inetd Globus gatekeeper Globus jobmanager condor_submit Condor job agent Condor matchmaker Execution computer Many layers in a grid Grid Experiences

  36. We forgot inetd • We submitted 300 jobs at once • Inetd noticed many connections per second • Inetd presumed there was a denial of service attack and refused connections for five minutes • Lots of debugging! Grid Experiences

  37. There are more layers! USCMS Testbed Architecture (A bit dated) Master Site Worker Impala Globus MOP Batch System (Condor, PBS) DAGMan Real Work Condor-G Grid Experiences

  38. MCRunJob Impala MOP condor_schedd DAGMan Condor-G condor_schedd condor_gridmanager gahp_server globus-gatekeeper globus-job-manager globus-job-manager-script.pl local batch system submit local batch system execute MOP wrapper Impala wrapper actual job More layers than that! USCMS Testbed Architecture (A bit dated) This disregards inetd, network, file servers, file transfers… Grid Experiences

  39. Lesson Recovery at multiple levels • Fault-tolerance and recovery is built in at many levels: • Condor_master: restart daemons • Condor_schedd: job queue • DAGMan: checkpoint DAG of jobs • Gahp_server: isolate Globus libraries • And others… Grid Experiences

  40. Lesson Allocate debugging time • Allocate lots of debugging time • It is very hard to propagate errors • How does a user find a remote error? • Call system administrator • Admin looks through log files for each layer (not accessible to user) • We need better debugging methods Grid Experiences

  41. Problem Everything will fail(Everything) • In the USCMS testbed production: • Power outage for several hours • Network outages: few minutes-11 hr. • Failed configuration change • Site upgraded • Jobs accidentally removed • Software bugs everywhere Grid Experiences

  42. How do you cope? • Condor-G: • Error: job cannot run. This is not good enough • Resubmit jobs that can be resubmitted, perhaps after a delay • Put jobs on hold in queue: • User examines hold reason (proxy is expired) • User fixes error • User restarts job Grid Experiences

  43. Problem Everything will fail(Even the little things) • Condor Matchmaker: • Collects descriptions of machines & jobs • Soft state in matchmaker (push smarts to edge, like Internet) • UDP packets to advertise machines • Less overhead than many TCP connections • Works great in a LAN • But… Grid Experiences

  44. Everything will fail: UDP • But you lose some UDP packets • Send packets every five minutes • Keep stale information for 15 minutes • Be prepared to cope with stale information • This has worked for years in Condor • DØ: matchmaking on grid • UDP packets from Korea to Chicago were completely lost on weekdays • Added TCP option Grid Experiences

  45. Lesson Be prepared • Assume everything will fail • Have recovery at multiple levels • Understand scope of errors • Don’t trust configuration: • Verify it • Install & configure software “on the fly” • Assume bugs are everywhere • Build software to cope with errors Grid Experiences

More Related