1 / 19

FBSNG Overview

FBSNG Overview. Jim Fromm Farms and Clustered Systems Group, Computing Division, Fermilab. People. FCS Group: Jim Fromm (Fermilab) Tanya Levshina (Fermilab) Igor Mandrichenko (Fermilab) Krzysztof Genser (Fermilab) Former FCS Group Members: Mark Breitung Marilyn Schweitzer

beulah
Download Presentation

FBSNG Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. FBSNG Overview Jim Fromm Farms and Clustered Systems Group, Computing Division, Fermilab

  2. People • FCS Group: • Jim Fromm (Fermilab) • Tanya Levshina (Fermilab) • Igor Mandrichenko (Fermilab) • Krzysztof Genser (Fermilab) • Former FCS Group Members: • Mark Breitung • Marilyn Schweitzer • FBSNG users/testers who provided significant feedback: • Antonio Wong Chan (Academia Sinica, Taiwan, CDF) • Yen-Chu Chen (Academia Sinica, Taiwan, CDF) • Miroslav Siket (Academia Sinica, Taiwan, CDF) • Heidi Schellman (Northwestern University, D0) • Steve Wolbers (Fermilab, CDF) • Ping Yeh (Academia Sinica, Taiwan, CDF) • Thomas Las (Minooka Junior High, Minooka, IL). http://www-isd.fnal.gov/fbsng

  3. History and Status • FBS project goals • Replace CPS batch in PC farms environment • Develop a farm batch system for file-based parallel data processing style adopted for RunII • Do not preclude event-based parallelism • Milestones • Spring 1998 - initial FBS design, first working prototypes • Fall 1998 - first production users (E871) • Spring 1999 - review by CDF, D0 • Fall 1999 - FBS v2.2 • Fall 1999 - beginning of FBSNG project • July 2000 - FBSNG v1.0 released • Oct 2000 – FBSNG v1.1 released • Currently FBSNG is installed on: • Fixed target farm • CDF, D0 farms • NIKHEF (D0 collaborators) http://www-isd.fnal.gov/fbsng

  4. FBSNG Immediate Project Goals • Stop using LSF as scheduler and job storage as was done with previous versions of FBS. • Reduce maintenance and support costs • Avoid possible scalability problems • Simplify maintenance • Allows for addition of new features (such as resource management). • Decision made to release FBSNG as soon as a core set of features were implemented. First requirement was to “Not break anything!” • Preserve as many features of FBS as reasonably possible • Add few fundamental features such as: • Abstract resources • Customizable scheduler http://www-isd.fnal.gov/fbsng

  5. Long Term Goals • Dynamic re-configuration (implemented in V1.1) • Further development of resource management (resource pools implemented in v1.0) • Integration with FIPC, the Farms Interprocess Communications toolkit developed at Fermilab. • To make FBSNG an “open” system, accomplished through the API. http://www-isd.fnal.gov/fbsng

  6. FBSNG Redesign • Whenever possible, features were carried forward from FBS to FBSNG. • The look and feel of FBSNG is very much like FBS, but FBS and FBSNG are not compatible. Feedback from users is that converting to FBSNG was relatively easy. http://www-isd.fnal.gov/fbsng

  7. FBSNG Design (Big Picture) • BMGR functions: • Scheduling • Resource management • Job storage • Communication with API clients http://www-isd.fnal.gov/fbsng

  8. FBSNG Concepts: Farm Model http://www-isd.fnal.gov/fbsng

  9. FBSNG Concepts: Job Consists of Sections http://www-isd.fnal.gov/fbsng

  10. FBSNG Resources • FBSNG allows for several new ideas in resource management • Global resources • Visible to the entire farm • Examples: • Disk space on NFS server • Network bandwidth • Local resources • Visible on individual nodes • Examples: • CPU • Local disk • Attributes • Attributes are local to a particular node. • Attributes are just there, they aren’t “used up”. • Examples: • Special software installed (Fortran compiler). • Version of OS • FBSNG assumes users know how a job will use resources, and assumes that the user will give this info to FBSNG. • Resources are just counters to FBSNG, it does not know anything about what they represent. http://www-isd.fnal.gov/fbsng

  11. Resource Pools • Pools are collections of similar resources. • The actual resources in a resource pool are referred to as underlying resources. • Examples: • Multiple scratch disks on a given host. Users could specify 2GB of scratch disk, not caring which specific disk has 2GB free. • A user could have a job that needs to run on any version of Linux. A resource pool named Linux could be created with underlying resources (attributes) of Linux52 and Linux61. http://www-isd.fnal.gov/fbsng

  12. Resource Management • Process type per task or project • Soft association between queue and process type (user can override) • User can request additional resources • Queue is more of a scheduling than resource management entity • More flexibility for users • Only per-process type resource quotas http://www-isd.fnal.gov/fbsng

  13. User Interface • Job Submission: Users issue submit command and provide the name of a Job Description File(JDF). The JDF file contains control information needed by FBSNG such as: • Section Name • Executable Name • Queue • Number of processes to spawn. • Job Control: • Monitor status • Kill/cancel • Hold/release • History • Farm node status • Resource utilization statistics • Scheduler status http://www-isd.fnal.gov/fbsng

  14. SECTION Stage QUEUE=IO_Q EXEC=stage.sh VSN123 /stage01 NUMPROC=1 SECTION Reconstructor QUEUE=Long_Q EXEC=reco123.sh /stage01 /stage02 NUMPROC=10 DEPEND=done(Stage) PROC_RESOURCES=disk:5 Linux Blue SECTION Clean-up QUEUE=Short_Q PROC_TYPE = Light EXEC=clean.sh /stage02 VSN123 NUMPROC=1 PRIO_INC = 5 DEPEND=failed(Reconstructor) First section: pre-stage data Queue to submit to Command to execute 1 process Second section: reconstruction 10 processes Only if pre-staging succeeds 5(GB) of local disk, Linux, Blue node Emergency clean-up section Override default process type Run at higher priority Run only if reconstructor fails FBSNG: Example of Job Description File (JDF) http://www-isd.fnal.gov/fbsng

  15. Batch Process Environment • Environment Variables • FBS_JOB_ID • FBS_SECTION_NAME • FBS_JOB_SIZE - number of processes in the section • FBS_SCRATCH - Scratch disk area for user processes. • FBS_PROC_NO - logical process id(1…FBS_JOB_SIZE) • FBS_SECTION_NAME – The name of the section. • FBS_HOSTS - list of nodes assigned to this job. • FBS_PROC_STDOUT - path of processes stdout • FBS_PROC_STDERR - path of processes stderr • HOME - home directory • Others… • Current working directory is HOME • Stderr, stdout as specified in JDF http://www-isd.fnal.gov/fbsng

  16. Scheduler • Algorithms are based on the idea of dynamic priorities • Controllable fair-share scheduling • Projects are assigned relative shares of farm resources • Guaranteed scheduling. Avoids starvation. • No infinite delays for big jobs • Will hold small jobs if necessary http://www-isd.fnal.gov/fbsng

  17. API • FBSNG provides a Python API that allows: • Job submission • Job monitoring and control • Resource management and monitoring • UI, GUI are layered on top of the API http://www-isd.fnal.gov/fbsng

  18. FBSNG Requirements • On control node: • bmgr daemon • logd daemon (optional) • On each worker node: • Launcher (root) • Rstatd (optional) • Software/hardware requirements: • Python (Most of the FBSNG sources are Python) • Tcl/Tk, Tkinter (for GUI) • FCSLIB package available from fermitools. FCSLIB contains some Python modules used by FBSNG. • Configuration files synchronized on all worker nodes (NFS works well) http://www-isd.fnal.gov/fbsng

  19. FBSNG Project Status and Plans for Future • FBSNG V1.0 (IRIX, Linux) released July 2000. • FBSNG V1.1 (IRIX, Linux) released October 2000. • Available in Fermitools • V1.1 is in production. Feedback on V1.0 and V1.1 thus far has been positive. • See http://www-isd.fnal.gov/fbsng for project details. http://www-isd.fnal.gov/fbsng

More Related