1 / 24

SC04 Release, API Discussions, SDK, and FastOS

SC04 Release, API Discussions, SDK, and FastOS. Al Geist August 26-27, 2004 Chicago, ILL. Resource Management. Accounting & user mgmt. System Monitoring. System Build & Configure. Job management. ORNL ANL LBNL PNNL. SNL LANL Ames. IBM Cray Intel SGI. NCSA PSC SDSC.

tuan
Download Presentation

SC04 Release, API Discussions, SDK, and FastOS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SC04 Release, API Discussions,SDK, and FastOS Al Geist August 26-27, 2004 Chicago, ILL

  2. Resource Management Accounting & user mgmt System Monitoring System Build & Configure Job management ORNL ANL LBNL PNNL SNL LANL Ames IBM Cray Intel SGI NCSA PSC SDSC Scalable Systems Software Participating Organizations Problem • Computer centers use incompatible, ad hoc set of systems tools • Present tools are not designed to scale to multi-Teraflop systems Goals • Collectively (with industry) define standard interfaces between systems components for interoperability • Create scalable, standardized management tools for efficiently running our large computing centers To learn more visit www.scidac.org/ScalableSystems

  3. Participating Organizations Coordinator: Al Geist Participating Organizations ORNL ANL LBNL PNNL SNL LANL Ames NCSA PSC IBM SGI Cray Intel • How do we position ourselves with respect to the • - National Leadership-class facility? • NLCF is a partnership between • ORNL (Cray), ANL (BG), PNNL (cluster) • - NERSC and NSF centers

  4. Major Topics this Meeting Fred’s Feedback on the Project – he asks that we discuss several things at this meeting. SC04 Suite Release – code freeze Sept 3. Where do we stand? What do we show/demo at SC? (make demo list) SDK for SSS Components - developed since last meeting. What it is, how to use it. FastOS presentations - SNL, ORNL, ANL, and LANL winners. What are they proposing? How can SSS help? API Discussions - SSSRMAP and Restriction Syntax the saga continues.

  5. Fred’s Requests • “The ISICs will not be continued beyond 5 years. Time to think about how you are going to wrap up the SSS effort.” • Fred would like alist of our accomplishments consistent with the $10M he has spent on the project over 5 years. • List our priorities for what is still to be done (and what is likely to not be done) by the end. • Get our software out on large clusters(LLNL, NCSA, NERSC, PNNL, where the ones he mentioned) • Is our software robust enough to use at NLCF? NERSC? (in part or whole) • Define our relationship and to the new FastOS effort.

  6. Scalable Systems Software Suite Any Updates to this diagram? Grid Interfaces Components written in any mixture of C, C++, Java, Perl, and Python can be integrated into the Scalable Systems Software Suite Meta Scheduler Meta Monitor Meta Manager Meta Services Accounting Scheduler System & Job Monitor Node State Manager Service Directory Standard XML interfaces Node Configuration & Build Manager authentication communication Event Manager Allocation Management Packaging & Install Usage Reports Process Manager Job Queue Manager Hardware Infrastructure Manager Validation & Testing Checkpoint / Restart

  7. Review of Last Meeting Scalable Systems Software Center May 6-7 Argonne Details in Main project notebook

  8. Highlights from May mtg Craig – General thoughts on official V1.0 Released at SC04 This will be the first time many people will see the software. Our orthogonal directions in syntax is disturbing If we don’t make a decision soon it hurts project progress towards V1.0 Brett, who works with both, favors the SSSRMAP He likes the more descriptive nature of it and OO nature. Paul says the one is better but two is not too bad. Scott doesn’t think we can reconcile Paul asks for straw vote for a preference. For SSRMAP – 7 votes representing 5 institutions For Restriction Syntax - 3 votes all ANL Abstain – 3 votes Craigsays he will do whatever it takes to make either work. he is going to make ssslib SSSRMAP work Neil says “users” are guiding factor and RMAP better there Paul says understandability and acceptability is key and RMAP is better Both say that RS is more compact and elegant.

  9. Highlights from May Meeting (cont) Narayan- asks does it just need documentation and tutorials Paul says no. There is closer match for SOAP et al. the OO was not a factor in his choice, but it is more popular today. Neil says potential users won’t have a Narayan to figure this out. Components are both client and server so developer has to know syntax. Rusty – if there was something else added to RS that made it easier to use or understand. He is not sure it is a good idea. Will – documentation is better in RMAP and he has looked at RMAP more Would all this stuff be more abstracted? User does as little as they can read manual only after they get stuck. Doesn’t care as long we pick ONE! Need to have a same look and feel across the project. Rick – I don’t care which. I don’t like XML. What about the SD and EM that are already accepted. Al – says that he feels that RMAP would be more acceptable to vendors and this would be a critical to long term success of the project.

  10. This Meeting Scalable Systems Software Center August 26-27, 2004

  11. Agenda – August 26 8:30 Al Geist – Project Status and Fred’s comments. 9:30 Scott Jackson – Resource Management 10:30 Paul Hargrove – Process Management 11:30 Narayan Desai – Node Build, Configure 12:30 Lunch (on own – cafeteria) 1:30 Rusty Lusk -- Comparing Restriction Syntax and SSSRMAP 3:00 Break 3:30 Beckman 4:00 Brightwell 4:30 Scott 5:00 Al – Discussion on SC04, What demos will we have? 5:30 Adjourn FastOS presentations and discussion of SSS interaction

  12. Agenda – August 27 8:30 Discussion, proposals, votes Rusty – new SDK for SSS components Will McClendon – Validation and Testing Thomas Naughton – SSS-OSCAR and v1.0 release 10:30 Break 11:00 Al Geist – Response to Fred v1.0 SSS-OSCAR release formal collaboration with FastOS projects SSS on NLCF machines rather than production clusters run scale tests on big clusters (short windows) steps towards long-term support priority to vote on written component interfaces next meeting date: Hacking mtg Oct. 6-8 ORNL next regular mtg: Jan 25-26 2005 (check w/Fred) location: DC for Fred to attend 12:00 meeting ends

  13. Meeting notes Al Geist – presents Fred’s requests and goals for this meeting Scott – RM working group status Updated and implemented SSSRMAP v3 spec Second alpha release including Maui, Bamboo, Warehouse, Gold Added interactive FAQOMATIC Completed merger of Maui 3.2 and Maui SSS – uses SSS interfaces (commercial versions of these will also as a matter of course) QM – interactive job support finished and tested. Packaging updated to separate out components required on the execution nodes. Accounting and allocation- complete rewrite in PERL. Significantly improved accounting design and account report Completed allocation, reservation, quotation and charge rates GOLD GUI Metascheduler (grid scheduler-Silver) migrated interface to use SSS Future work – Beta release of all components including Silver FT supporting 25% cluster loss Continued OS support for Linux, AIX, Tru-64. Future OS-X, Unicos, HPUX Who using these alpha? Few just looking at it Production deployment of GOLD on 11.8TF PNNL cluster (November) Also think - MauiHPC center, ANL, and DOD centers likely

  14. Meeting notes –(cont) SC04 demo Discussion – RM GUI (commercial) by David Jackson Paul – Process Management status Checkpoint status – full save and restore of: registers, memory, signals, PID, Files (open but unmodified, open and appended, pipes between processes), and communication (via LAM/MPI over TCP) Handles in flight data (drains), linear scaling and migration. In future OpenMPI? Paul will check Discussion of handling files Will always be a Linux-only solution (across all of them) Presently x86 only – Alpha and PPC as possible future work-not high priority Future work – more on files (mutable files, directories), process groups Checkpoint Manager works with Bamboo and MPDPM Process Manger – continued daily use on Chiba New option to signal entire unix process group Misc hardening of MPD system. Prompted by Intel use (cook & associates) Future- Intel donated a IA64 test cluster could be used to test SSS Warehouse- major bug fix, works with RM components Ssslib version with RMAP delayed due to harddrive crash.

  15. Meeting notes –(cont) Narayan – Build configure status Infrastructure improvements – ssslib wire protocol user support additions SSS SDK development Components Improvements – efficiency of service directory and event mgr Node state manager – simplified implementation by using other SSS components particularly from PM More discussion of SC04 demos – GUIs and handling failure. Rusty – Syntax Discussion We agreed on XML as basis of communication mechanism. Many benefits. Allowed multiple wire protocol and service directory to keep track We have created a couple XML styles. Rusty thinks having two or more is fine. He is not suggesting we have only one, although others in group have Steps: match a set of objects, apply function with args to set of objects, and construct return message.

  16. Meeting notes –(cont) Rusty – Syntax Discussion cont. RS syntax is <command> predicate </command> Command is the function to apply (args go here) Predicate is a field value match to select set of objects Return message includes info on all fields in predicate Goes through a few examples in RS and explains them In SSSRMAP <request action=value><where …></where></request> Go through same examples in RMAP Matching is in the “where” clause Args are in the “Option” object Return message indicated by “Get” object Looking at Both Completeness - Probably equal – both lack general negation Validation – RS is somewhat better here Extensibility – SSSRMAP is somewhat better Readability – SSSRMAP is somewhat better here Conciseness – RS is better Atomicity- equal

  17. Meeting notes –(cont) Rusty – Syntax Discussion cont. Critique of both RS Puts too much in attributes Overloads use for (see slides flesh out here)

  18. Meeting notes –(cont) • Rusty – Syntax Discussion cont. • The Less Restrictive Syntax • Keep high-level spec of commands like RS for validation • Move attributes in RS to subobjects as in SSSRMAP • Explicitly specify fields to return • Show and discuss examples in new syntax style • <function> • <List of objects to match> • <matching criteria> • </List> • </function> • Still has the same implicit AND and OR that was in RS. • Argonne is starting to transition to the Less Restriction Syntax.

  19. Meeting notes –(cont) Pete – FastOS at ANL and U Oregon (Budget starts Oct. at 2/3 request Future systems (smart memory, message processor, stream processor) Functional decomposition and Hierarchical organization Example BG/L uses 4 OS SuSE 8, SuSE 9, embedded Linux, microkernel Get a BG/L in December For Petascale how many OS will be required What are their performance characteristics and requirements Can they be dynamic What is the cost of each component. What if a part is left out Are collective – coupled OSes needed Can we build experimental framework for FT Four focus areas- Flexible OS suites, Scalable system calls, FT, performance tools Interact with SSS Dynamic node builds and kernel loads Tao will be added to kernel and middleware could compliment SSS Faulty Towers provide info to SSS layers via component interface OS is Linux 2.6 kernel Embedded Linux

  20. Meeting notes –(cont) Ron – SNL, UNM, CalTech also got 2/3 budget Need OS functionality out on network interface – distributing bits of OS OS bypass, offload, splintering Light Weight Kernel influences – hard to make changes Programming Models – problems with mixing PIM, MPI, OpenMP Usage Models – Apps number and time change over time External services – parallel file system, chkpt, dynamic libraries What Build a collection of micro services. Small components with well defined interfaces Combine services specifically specifically for an app and system Tools for combining Micro services Building custom OS on the fly See http://coset.irisa.fr/

  21. Meeting notes –(cont) Stephen – ORNL, UNM, NCSU, OSU, LousianaTech got 1/2 budget RAS for scientific and engineering apps Paul – getting K42 to work on clusters Scott – PNNL going to do work in SGI and single system image

  22. Meeting notes –Day 2 Rusty – SDK for SSS components Lots of components in the future, some we have never imagined Crucial to make component development easy Ssslib and event manager, service directory for an foundation Also need to encapsulate functionality of an abstract component Have been trying this with Python Classes Useful for BG/L and FastOS experiments Low Levels of SDK multiple wire protocols, EM, SD, and communication for any language Upper Level of SDK Server and Event receiver classes provide all the services that are independent of component – registers, logging, errors, XML validating, … Shows the “stack” Goes through echo example – makes SSS coding pretty easy! Goes through job submitter example (several slides) Conclusion makes writing SSS components easy currently for Python components but other languages possible Scott says this should be easy to implement RMAP syntax in this SDK

  23. Meeting notes –Day 2 Will – Validation and Testing status Mainly working on APItest current release v0.2.0 Available for download on SNL ftp site (see slides) Easier to define a new test type (already does shell, script, and SSS) There is some caution with SUID Packages required Python 2.3, ElementTree (www.effbot.com), Twisted, and ssslib SSS- Service Directory test – need to extend to all SSS components Discussion about details how to use for SSS tests What about a user manual? Future work Future work develop more tests for SSS components test developer GUI additional native tests types – http, TCP, XMLRPC user guide ability to SU jobs to different users Discussion

  24. Meeting notes –Day 2 Thomas – SSS-OSCAR Current status v0.2a8 prerelease for v1.0 at SC04 Two more items are in CVS (at this meeting) need testing Starting work on v0.3 w/ new GOLD pkg OSCAR support for BCWG schema Future work Integrate Gold integrate APItest in OSCAR – authors create their own test cases Improve documentation for v1.0 Start weekly builds for testing Release schedule Nov 8 SC04 release v1.0 Oct 4 code freeze Sept weekly builds – available first day of week by noon for developer to test their component Test resources – ORNL “Test1” cluster

More Related