1 / 22

Decision Time

Metronome and The NMI Lab: This subtitle included solely to steal the “longest title” award from Ewa, who thought she won it this morning with, “ Pegasus and DAGMan: From Concept to Execution Mapping Scientific Workflows onto the National Cyberinfrastructure ”. Decision Time. Past

lplatt
Download Presentation

Decision Time

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MetronomeandThe NMI Lab:This subtitle included solely to steal the “longest title” award from Ewa, who thought she won it this morning with, “Pegasus and DAGMan: From Concept to Execution Mapping Scientific Workflows onto the National Cyberinfrastructure”

  2. Decision Time • Past • Quick Review: why, what, who • Present • Current status, new this year • Future • Future plans, new next year

  3. Why: The Problem • Good distributed computing (“grid”) software is… • badly needed • hard to find • hard to build and test

  4. The Fix(Part of it, anyway) • Good build/test cycle • To be good, build/test process must be… • frequent • reliable • automatic • repeatable

  5. The (Next) Problem • Building and testing distributed computing software requires… • Distributed resources • Not always in-house, not always dedicated to builds • I.e., shared, scheduled resources • Unless you have a spare Blue Gene lying around… and an old Alpha running RedHat 7.2… and an HPUX 11 box… and an Itanium running Scientific Linux 3 (CERN-flavored) … and… • Distributed testbeds, tests • Not: “the grid works on my machine… ship it!”

  6. Grid Build and Test • Building and testing distributed computing software brings distributed challenges… • Complex workflows, cross-site/project/user scheduling priorities, data management, fault-tolerance, failure recovery • A lot like “real” distributed computing • Tinderbox or the latest Web 2.0 build system doesn’t cut it • Deep, integrated software stacks • Distributed providers

  7. How We Do It • Use proven grid software to build and test new grid software • “Condor works, let’s use Condor” • Metronome is our second-generation build/test framework built on top of Condor, DAGMan, and other distributed computing technologies • NSF-funded

  8. Metronome Principles • Tool-independent • Lightweight • Encourage explicit, well-controlled build/test environments • Central results repository • Fault-tolerance • Support platform-neutral and platform-specific tasks • Build/test separation

  9. Metronome DAGMan DAG INPUT Distributed Build/Test Pool Spec File NMI Build & Test Software Condor Queue DAG Customer Source Code build/test jobs Spec File results results Customer Build/Test Scripts results Web Portal Finished Binaries MySQL Results DB OUTPUT

  10. NMI Lab • Dedicated, heterogeneous distributed computing facility • Opposite extreme from typical “cluster” -- instead of 1000’s of identical CPUs, we have a handful of CPUs each for 50+ platforms. • Much harder to manage! You try finding a monitoring tool that works on 50 platforms! • Carefully-controlled resources • No mystery meat

  11. The Team • Subset of the Condor Team • Becky Gietzel, master of all things NMI • Todd Miller, new guy on the block • Andy Pavlo, part-timer, short-timer • Ken Hahn, sysadmin to the stars • Me

  12. Dogfood and Hats • Eating our own dogfood… • Condor builds failed last weekend (true!) • Condor developers complained to NMI Lab (“your build system failed… fix it!”) • NMI Lab discovered Condor bug (“hmm…”) • NMI Lab complained to Condor developers (“your software failed… fix it!”) • Feel the love!

  13. The Past Year:What We Did on Our Summer Vacation

  14. New Name! • Before: • NMI Build & Test System, NMI Build & Test Software, NMI Build & Test Framework, NMI Software, NMI Build & Test Lab, UW-Madison Build & Test Lab, Build & Test Lab at UW-Madison • After: • Metronome + the NMI Lab • Why? • Old names were a mouthful • Clear separation between the software framework (Metronome) and the facility (the NMI Lab)

  15. Real Work • Extremely Productive Collaborations • TeraGrid: production Metronome deployment using dynamically provisioned resources • ETICS, OMII: building higher-level services to generate and manage build/test jobs across an international federation of Metronome deployments • Extremely Productive Users • Condor, TeraGrid, Open Science Grid / VDT, Globus, NCSA (MyProxy), SDSC (SRB), LIGO, many others in this room…

  16. New Metronome Capabilities • “Productization”, customization for other sites • Parallel testing • Enables dynamic, co-scheduled, distributed testbeds! • Automatic cross-site job migration • Run your own local Metronome pool with access to ours for exotic platforms • Many smaller features and extensions for production users -- users drive development • More bugs fixed than introduced!

  17. New NMI Lab Capabilities • More platforms • “always with the platforms…” • new Itanium platforms, NLOTW (New Linux of the Week), additional vendor Unix machines, etc. • Now over 50 (!) platforms • Improved Lab Management • No, not me… better design and automation of systems & their administration

  18. Future

  19. The Plan: Metronome • “Support, maintain, enhance” • VM--I mean slot--no wait, I mean VM support • Enhanced parallel testing support • Custom testbed environments (network, etc.) • Dynamic deployments (glide-in) • Advanced scheduling policies • Scalability testing enhancements • Better docs/installation/management

  20. The Plan: NMI Lab • “Support, maintain, enhance” • More platforms, always with the platforms • More capacity • VM servers for… • Root-level testing • On-demand platforms • Federation with other Metronome labs • Better support, smoother management, less downtime • New sysadmin starting in June: take a bow, Ross!

  21. You • Want to use it? • Metronome • The NMI Lab • http://nmi.cs.wisc.edu/

  22. Feedback • When we started, the state of the art was unimpressive (almost non-existant)… we had to build our own • More build tools now exist -- if you know & like one of them, what do you like about it? • We’d like to better understand what we do well, what we don’t, and how we can integrate with other systems you find useful…

More Related