1 / 36

A Database-Centric Approach to System Management The Blue Gene Supercomputer

A Database-Centric Approach to System Management The Blue Gene Supercomputer. Tom Budnik Mark Megerian August 2008. Database-Centric System Management. Why use a database for Blue Gene? Need a software representation of the Blue Gene hardware

Download Presentation

A Database-Centric Approach to System Management The Blue Gene Supercomputer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Database-Centric Approach to System ManagementThe Blue Gene Supercomputer Tom Budnik Mark Megerian August 2008

  2. Database-Centric System Management • Why use a database for Blue Gene? • Need a software representation of the Blue Gene hardware • A machine of such large scale requires a persistent means of storing errors (RAS events), job history, block definitions, environmental readings, etc. • DB2 is the central repository of ALL system information • Allows control system components to get hardware information and topology from the database, which is always kept current • Blue Gene Navigator pulls majority of data it displays from the database • All current jobs, as well as all completed jobs are stored • Admins can see a history of every job that has ever been run on the machine • We record start time and end time, as well as the number of nodes used, and this information is used by Navigator to compute machine utilization • All service actions, and replaced hardware, are tracked in the database • DB2 is used as a method of communication between components • Setting values in the database can trigger actions in other components • Can simplify the design by having policy stored in the database itself via procedures, triggers, and constraints instead of the code • Enforces consistency across components and reduces bugs • DB2 and the Control System run on the “Service Node” machine, which controls the Blue Gene nodes (pSeries running Linux)

  3. Database-Centric System Management - Benefits • DB2 provides the storage of all data (except logs). This provides a well-known set of interfaces for: • Querying data using existing tools or SQL • Building web interfaces and browser-based tools using JSF, PHP, Java, CLI, and many other established technologies • Standard classes so all code can easily interact with the database • System administrators can learn DB2 from books and classes • New team members can come up to speed quickly • Customers can write their own tools, no hidden or closed data structures • Functions such as backup and recovery, performance settings, and security are handled by DB2 • DB2 is a robust, commercial database, able to handle large multi-user apps

  4. Basic SQL Concepts • Schema • The collection of objects such as tables, views, indexes, and triggers that define the database • Blue Genes uses BGPSYSDB • Table (most common database object) • A table is a collection of rows of data, organized into columns • The table definition (CREATE TABLE) describes the columns and their names and data types (integer, float, character, timestamp, etc.) • Once a table is created, you can insert, update, delete rows, and query the contents • Tables can be joined to other tables, and sorted and nested, to create many useful and complex constructions of data Example: CREATE TABLE TBGPBlockUsers ( blockId char(32) NOT NULL, username char(32) NOT NULL, CONSTRAINT BGPBlkUsers_pk PRIMARY KEY (blockId, username), CONSTRAINT BGPBlkUsers_fk FOREIGN KEY (blockId) REFERENCES TBGPBlock(blockId) ON DELETE CASCADE ); CREATE ALIAS BGPBlockUsers for TBGPBlockUsers;

  5. Basic SQL Concepts - continued • Views • A view is a virtual view of data, it stores a description of how to retrieve and map the data, but it stores no data itself • Generally used to present the same data in different ways, and act like a “virtual” table Example: CREATE VIEW BGPMidplane as SELECT serialnumber, productid, machineserialnumber, status, ismaster, posinmachine as location FROM TBGPMidplane; • Index(a stored, sorted set of pointers to rows) • Like a view, an index contains no actual “data” • An index is built to sequence the rows using a certain set of columns that is frequently used for searching and sorting • A full table scan through millions of rows for a particular value would take several minutes, where a lookup using an index over that column is often sub second • Indexes are kept current as the data changes, so a large number of indexes can impact update performance. There is a tradeoff between query performance, and only necessary and useful indexes should be created. Example: CREATE INDEX EventLogJ on Tbgpeventlog (jobid, recid desc)

  6. Basic SQL Concepts - continued • Triggers • A trigger allows you to define an action to take place, generally when data is updated • Triggers can be defined on an insert, update, or delete of rows in a table • Triggers can fire “before” the action, and possibly modify the action taking place • Triggers can also fire after the action • Triggers can generate errors to block the action Example: create trigger sc_history_i after insert on tbgpservicecard referencing new as n for each row mode db2sql begin atomic insert into tbgpservicecard_history (serialNumber, productId, midplanepos, status,vpd, action) values (n.serialNumber, n.productId, n.midplanepos, n.status, n.vpd, 'I'); end @

  7. Basic SQL Concepts - continued • Constraint • A constraint is a “rule” that is enforced by the database • Check constraints give a list of valid values for a column • Unique constraints enforce uniqueness on values in a column, or set of columns • Referential Integrity constraints enforces values in a “child” table exist in “parent” table Example: CREATE TABLE TBGPMidplane ( serialNumber char(19) , productId char(16) NOT NULL, machineSerialNumber char(19) , posInMachine char(6) NOT NULL, CONSTRAINT BGPMidPo_chk CHECK ( posInMachine LIKE 'R__-M_' ), status char(1) NOT NULL WITH DEFAULT 'A' , CONSTRAINT BGPMidSt_chk CHECK ( status IN ('A','M','E', 'S') ), isMaster char(1) NOT NULL WITH DEFAULT 'T', CONSTRAINT BGPMidMs_chk CHECK ( isMaster IN ('T', 'F') ), vpd VARCHAR(4096) FOR BIT DATA, seqId BIGINT NOT NULL WITH DEFAULT 0, CONSTRAINT BGPMidpplane_pk PRIMARY KEY (posInMachine), CONSTRAINT BGPMidMachineId_fk FOREIGN KEY (machineSerialNumber) REFERENCES TBGPMachine (serialNumber) ON DELETE RESTRICT, CONSTRAINT BGPMidplaneType_fk FOREIGN KEY (productId) REFERENCES TBGPProductType (productId) )

  8. DB2 Naming Guidelines for BG/P • Tables always start with TBGP, such as TBGPNodeCard or TBGPLinkCard • Names are NOT case sensitive in SQL • For each table, there is a view that has the more user-friendly columns • These are named without the T, such as BGPNodeCard • In cases where some information is omitted from the view • If there is no need for any derived columns in the view, or omitted columns, then an alias is created • i.e. BGPClockCard • The net effect is that most all the time, using the “BGP” name will show what you want • If there is a history being kept, then _history is added to the end

  9. TBGPBlock TBGPBPBlockMap TBGPSmallBlock TBGPLinkBlockMap TBGPProductType TBGPMachine TBGPMachineSubnet TBGPMidplane TBGPNodeCard TBGPNode TBGPServiceCard TBGPLinkCard TBGPClockCard TBGPBulkPowerSupply TBGPSwitch TBGPCable TBGPClockCable TBGPLinkChip TBGPICON TBGPFanModule TBGPJob TBGPEthGateway TBGPEGWMachineMap TBGPPortBlockMap TBGPBlockUsers TBGPMidplaneSubnet TBGPNodeSubnet TBGPServiceAction TBGPUserPrefs TBGPReplacement_history TBGPMachine_history TBGPMidplane_history TBGPNodeCard_history TBGPNode_history TBGPServiceCard_history TBGPLinkCard_history TBGPClockCard_history TBGPLinkChip_history TBGPIcon_history TBGPFanModule_history TBGPJob_history TBGPServiceCardEnvironment TBGPFanEnvironment TBGPClockCardEnvironment TBGPBULKPOWEREnvironment TBGPNodeCardPOWEREnvironment TBGPLinkCardPOWEREnvironment TBGPSrvcCardPOWEREnvironment TBGPLinkChipEnvironment TBGPLinkCardEnvironment TBGPNodeEnvironment TBGPNodeCardEnvironment TBGPEventLog TBGPERRCodes TBGPDiagRuns TBGPDiagBlocks TBGPDiagResults TBGPDiagTests BG/P Tables

  10. BGPMidplane BGPMidplaneAll BGPNodeCard BGPNodeCardAll BGPNode BGPNodeAll BGPServiceCard BGPServiceCardAll BGPLinkCard BGPLinkCardAll BGPClockCardAll BGPBulkPowerSupplyAll BGPLinkChip BGPLinkChipAll BGPFanModule BGPFanModuleAll BGPLink BGPClockCardEnvironment BGPDiagTests BGPNodeCardCount BGPLinkCardCount BGPServiceCardCount BGPNodeCount BGPBasePartition BGPBPBlockStatus BGPSwitchLinks BGPLinkBlockStatus BGPSwitchPort BGPPortBlockStatus BGPBlockSize BG/P Views

  11. DB2 BG/P DB2 Structure • Configuration database is the representation of all the hardware on the system • Operational database contains information and status for things that do not correspond directly to a single piece of hardware such as jobs, partitions, and history • Environmental database keeps current values for all of hardware components on the system, such as fan speeds, temperatures, voltages • RAS database collects hard errors, soft errors, machine checks, and software problems detected from the compute complex Configuration Database Operational Database Environmental Database RAS Database

  12. DB2 BG/P DB2 Structure • Configuration database is the representation of all the hardware on the system • Machine • Midplanes • Service Cards • Link Cards • Link Chips • Node Cards • Processor Cards • Compute & I/O • Nodes • Cables • Clock Cards • Fan Modules • Populated during initial system install and kept current during hardware service actions Configuration Database

  13. DB2 BG/P DB2 Structure • Operational database contains information and status for things that do not correspond directly to a single piece of hardware such as jobs, partitions, and history • Blocks (partitions) • Jobs • Job history • Switch settings • Link <-> Block map • Block users • Maintained by the Blue Gene control system running on the service node Operational Database

  14. Environmental database keeps current values for all of hardware components on the system, such as fan speeds, temperatures, voltages • Fan Modules • Desired and actual fan speed • Voltages • Temperatures • Service Cards • Ambient temp • Voltages • Node Cards • Chip temps • Temp limits • Wiring faults • Link Cards • Power Status • Temps • Hardware Monitor reads and stores information on customizable intervals • By default, BG/P purges the data every 3 months (mmcs_envs_purge_months=3). The db.properties configuration can be altered to store more or less data as required by the local environment. DB2 BG/P DB2 Structure Environmental Database

  15. DB2 BG/P DB2 Structure • RAS database collects hard errors, soft errors, machine checks, and software problems detected from the compute complex. • RAS events collected for bad hardware, missing cards, bad memory, bad cables • RAS events collected from compute complex while jobs are running, from kernel interrupts • RAS events generated by HW monitoring, for wiring faults, bad cards, fan speeds, over temps • RAS events generated by MMCS during link training, software errors, file system errors RAS Database

  16. Putting It All Together – Database Populate/Verification • Install team runs a Perl script (dbPopulate.pl) that populates the database with the expected configuration for the Blue Gene system. • The machine is powered on, and the InstallServiceActionprogram finds all hardware on the service network and verifies the database matches with the actual hardware config • This information is also modified and kept current during service actions (card replacement, recabling, etc.) • VerifyCables program confirms that the Torus network cabling is correct and VerifyIpAddresses confirms that the IO card IP addresses are correct BGPMidplane BGPCable BGPServiceCard BGPLinkCard BGPNodeCard BGPNode

  17. Putting It All Together – Partitioning • Partitions are defined and the information is stored in DB2 • Partitions can be defined using Navigator block builder, console commands like genblock, using Bridge API pm_add_partition, or dynamically created by an external scheduler or mpirun BGPBlock BGPBpBlockMap BGPLinkBlockMap BGPPortBlockMap BGPSmallBlock

  18. Putting It All Together – Booting • Partition information from the database is used to boot the hardware and prepare it for running jobs • Database contains the kernel images for the IO nodes and Compute nodes • Database contains all the switch settings needed to program the link chips in order to create the Torus or Mesh • Database relates block information to specific hardware BGPBlock BGPBpBlockMap BGPLinkBlockMap BGPPortBlockMap BGPSmallBlock BGPMidplane BGPNodeCard BGPLinkCard BGPNode

  19. Putting It All Together – Booting • Database prevents overlap by doing arbitration of nodes, switches, and cables • This allows multiple partitions to be booted, provided they do not share the same nodes, switch ports, or cables • They can, however, share the same switch, which allows for pass-through • Any attempt to boot a partition that overlaps with an already booted partition with fail with a message that the hardware that is already in use BGPBlock BGPBpBlockMap BGPLinkBlockMap BGPPortBlockMap BGPSmallBlock

  20. Putting It All Together – Job execution • Jobs are submitted to booted blocks • Job submission is done via console, mpirun, submit, Bridge APIs, or external scheduler • Submitter must either be block owner or block user • Control system polls hardware for RAS events during job execution and writes them into the RAS Event Log table. Each event is identified by the exact piece of hardware on which it occurred. • Control system polls for job completion and writes into the job history table the start and end time, number of nodes, exit status, etc. BGPJob BGPBlock BGPBlockUsers BGPEventlog BGPJob_History

  21. Navigator: Web Interface to DB • Browser interface to view DB2 data • Supports the viewing of RAS data, configuration data, diagnostics data, environmental data and operational data • Can be used to see how the hardware fits together • Can be used to find trouble areas, hardware anomalies • Eliminates the need to have SQL expertise to view basic Blue Gene information from the database

  22. Blue Gene Navigator (Job History)

  23. Blue Gene Navigator (Blocks)

  24. Blue Gene Navigator (RAS Event Log)

  25. Blue Gene Navigator (Block Visualizer)

  26. Summary • DB2 is the central repository of all control system information • Database information is not just passively recorded, but rather its an integrated communications method for the control system • It has greatly enhanced our product: • Writing reports and queries for job utilization • Querying RAS events and error trends • Building end user tools • Training new people, faster learning curve • Can test the control system without any real hardware • Lower cost of ownership for customers with better tools and accessible data • Stability and performance have been excellent, so its been one thing we have not had to spend a lot of time tuning or debugging

  27. BG/P Security Tom Budnik August 2008

  28. Security Admin Tool (bguser.pl) • The Security Administration tool assigns authority consistently to users who access the Blue Gene system. The tool authorizes users to various predetermined functions on the system by adding their profile to a selected group. The groups are: bgpuser, bgpdeveloper, bgpservice and bgpadmin • The groups are created on the Service Node when the Blue Gene/P system is installed • Can edit the program to define the groups differently • Existing Linux users are added to groups by running the bguser.pl utility: ./bguser.pl [options] options are: --user userName --role [user/developer/service/admin]

  29. Groups db2rasdb db2iadm1 (DB2 client) db2fadm1 (DB2 client) db2asgrp (DB2 client) Users bgpsysdb (DB2 server instance) bgpdb2c (DB2 client instance) bgpadmin mpirun bgpuser bgpdeveloper bgpadmin bgpservice Service Node

  30. Front End Node • Groups • bgpadmin • bgpservice • bgpdeveloper • bgpuser • Users • mpirun • Profile • /etc/profile.d/bgp.sh

  31. Database Access on Service Node • The db.properties file contains the information required to access the database. Typically located in /bgsys/local/etc • Keywords of interest: database_name=bgdb0 database_user=bgpsysdb database_password=thepassword database_schema_name=bgpsysdb system=BGP min_pool_connections=1 • Access to Blue Gene DB from FEN is discouraged for security reasons • Reason for front-end and back-end mpirun

  32. Navigator Security • Authority • There are three roles defined in Navigator: End User, Service, and Administrator. The Linux group to Navigator role mapping is defined in the local Navigator configuration file (/bgsys/local/etc/navigator-config.xml). Note that the value can be a Linux group name or gid. • Administrator groups • Users in these Linux groups have access to all the sections in Navigator. To have multiple groups, add a <administrator-group>groupname</administrator-group> for each group. • Service groups • Users in these Linux groups have access to the service sections in Navigator. To have multiple groups, add a <service-group>groupname</service-group> for each group. • User groups • Users in these Linux groups have access to only the end-user pages in the Navigator. These pages do not allow any updates to the Blue Gene system. To have multiple groups, add a <user-group>groupname</user-group> for each group.

  33. Navigator Security - continued • Navigator runs with the profile of the user that starts bgpmaster (typically bgpadmin) • User that starts bgpmaster needs read access to /etc/shadow files to allow Navigator to perform authentications • Navigator setup script • The DB2 libraries must be made available to Tomcat so that it can access the database, and the Java Authentication and Authorization Service (JAAS) plug-in for interfacing to Linux Pluggable authentication modules (PAM) must be available so that Tomcat can authenticate users. A script is provided to do the setup: $ cd /bgsys/drivers/ppcfloor/navigator $ ./setup-tomcat.sh • Setup anonymous access to end-user pages • By default, the previous setup configures the Navigator to allow only authenticated users to access the Web interface. To enable anonymous access to end-user pages, you need to copy the web-withenduser.xml file into Navigator’s configuration: $ cd /bgsys/drivers/ppcfloor/navigator $ cp web-withenduser.xml catalina_base/webapps/BlueGeneNavigator/WEB-INF/web.xml

  34. Navigator Security - continued • PAM authentication: • Navigator uses the bluegene PAM stack to authenticate users. This is setup by creating a file /etc/pam.d/bluegene: #%PAM-1.0 auth include common-auth auth required pam_nologin.so account include common-account password include common-password session include common-session • SSL Configuration for Navigator Tomcat HTTP server instance: • By default the Tomcat instance requires that Secure Sockets Layer (SSL) be configured and the server listens on port 32072 • By default Navigator uses the /bgsys/local/etc/navigator.keystore keystore. This must be created when the system is configured. To do this, the keytool command is used.

  35. mpirun Security • End user IDs are not required to exist on the service node • Requires mpirun_be to run under bgpadmin • User’s uid and gid are collected from the front-end and propagated to CIOD • Used for file system access permissions

  36. mpirun Security - continued • Challenge protocol • A challenge/response protocol is used to authenticate the mpirun client when connecting to the mpirun daemon on the service node • Authentication uses the OpenSSL Secure Hash Algorithm (SHA-2) and a shared secret • Protocol uses shared secret to create a hash of a random number, thereby verifying the mpirun front end node has access to the secret • To protect secret, it is stored in configuration file only accessible by the bgpadmin user on the service node and by a special mpirun user on the front end node • The front end mpirun binary has its setuid flag enabled so it can change its uid to match the mpirun user and read the config file to access the secret • mpirun.cfg file • The mpirun.cfg file contains the shared secret used by the mpirun daemon • File needs to exist on the SN and any FENs that submit jobs using mpirun • Files need to match exactly for authentication to work • The mpirun.cfg file is in the /bgsys/local/etc/ directory • Only bgpadmin and the special mpirun user needs to have access to the file • mpirun.cfg file example: CHALLENGE_PASSWORD=BGPmpirunPasswd

More Related