HP Operations Orchestration

HP Operations Orchestration HPIT use-case overview David Akers, HPIT Operations Orchestration Practice Lead October 2012

HPIT Who we are • HPIT is HP’s internal IT supporting HP • A computing infrastructure supporting a large company and 340,000 employees • 6 “Data Center Sites” in three “zones”. • Austin 1 <-> Austin2 • Houston9 <-> Houston4 • Atlanta 5 <-> Atlanta 6 • Sites are miles apart, Zones are considerably farther • The HPIT OO infrastructure is present in all six sites • Isolated instances of OO • Each instance is a two-server OO Cluster • Normally, flows are distributed across all 12 nodes, however any flow can run on any instance at any time it is required • We have invested heavily in tools to keep OO running all the time.

Overview of OO in HPIT How HP’s internal IT function uses Operations Orchestration • Use-case catalog • Use-cases define OO flow development • This is where we get return from the OO investment • This is the motivation for using OO • Topics and solutions we addressed so far that are not use-case-specific • How we have approached questions, issues and our HPIT OO Practice • Broader “best practices” for OO • Strategy decisions • OO Architecture • Tools that make this all more productive

Major classifications of OO use-cases • Normal operations • Patching of our Windows server environment • Building windows servers • Less-than-normal operations • Automated responses to IT issues (events and incidents) • Diagnosis, Repair, un-repaired issues (a remainder) are forwarded to people • We are about 50% mature in this area • Working with people to assist human response to IT issues. • We are working on DSS2S (Diagnose, Start, Stop, Site 2 Site flows). • We are about 2% mature in this area

OO flow use-case Catalogue • Normal operations • Patching our Windows server environment • Less-than-normal operations • Unresponsive Win/Unix Server Nodes (monitoring alerts) • Enterprise Scheduled batch job failure response (yes, we run a lot of batch jobs and a small percent fail)

Patching our Windows Server Environment Roster Applications Normal operations App Services Web Services • Components: • OO Orchestrates, contains the “Intelligence” • HPSA does the work • Methodology • All our servers are assigned to one of 28 “maintenance windows” (stored in the uCMDB) • OO interacts with the uCDMB and defines a schedule • OO commands HPSA to do specific patches • OO performs follow-up verification and re-commands HPSA where needed. DB Services Monitoring HPSM uCMDB HPSA OS Services Win Servers Linux Servers UX Servers Network

Unresponsive Windows/Unix Server Nodes Roster Applications Less-than-normal operations App Services Web Services • Roles: • Automated monitors detect an unresponsive node. The alert creates an IM ticket in HPSM • OO Performs diagnostics • Methodology • A “Handler” is dispatched for each ticket (< 90 seconds after the alert) • A simple diagnostic is performed (ping, interactive log-in, monitoring service check) • If the server is OK, the ticket is closed (<5% of the time) • Otherwise, diagnostics are written to the ticket, the ticket moves to a human queue and a “Stakeout” starts • OO gathers a tracert (this is stored in-case this is a communications failure) • The server is pinged every 60 seconds • If the server becomes pingable, the uptime of the server is compared to the ticket age and diagnosis is: (Recent reboot) or (Communications failure – tracert summary is included in the ticket) • After 90 minutes, the Stakeout gives-up and writes data to the ticket Return: 40people hours/month Investment: 4 people hours/month (Investment decreases somewhat as the initial cost is amortized) DB Services Monitoring HPSM uCMDB HPSA OS Services Win Servers Linux Servers UX Servers Network

Unresponsive Windows/unix Server Nodes Roster Applications The story behind the use-case App Services Web Services Lesson: What people ask for is more about what they are doing manually than what is possible through automation. However, simply duplicating the “paradigm of the manually” you can show what needs to be done. • The initial point of view of first-responders was that Unresponsive server nodes were false alarms and that the automation was supposed to duplicate the diagnostic work done manually then close those time-wasting tickets. • The point-of-view made sense when you first looked at an unresponsive server 30-60 minutes into the incident. • When the automation was on-the-case in 30-60 seconds, a very different point-of-view became obvious • The monitors were right (not really a surprise) but after 60 minutes the situation changed external to the support process. • Phase II of the flow was the “stakeout” • Most of the time, the server is actually down (blue-screen or hardware failure or needs a power cycle) • The next most frequent case is that the server was just being rebooted (planned or spontaneous) • There were a small but fascinating number of cases of communications failures • The bigfoot problem • For example: we found a router that was incorrectly configured and was probably that way for 2 years causing occasional trouble but never being diagnosed because live humans could never capture forensic data in-time. DB Services Monitoring HPSM uCMDB HPSA OS Services Win Servers Linux Servers UX Servers Network

Enterprise Scheduled batch job failure response Roster Applications Less-than-normal operations App Services Web Services • Roles: • Tidal is HP’s Enterprise scheduling system. Batch job failures create IM tickets in HPSM • OO performs the first-response tasks • Methodology • A “Handler” is dispatched for each ticket (< 90 seconds after the alert) • Based on a “Knowledge database” (~30,000 entries) OO “figures out”: • Is the job to be auto-restarted? • Has the job been restarted already? • If not, it commands the Tidal server to re-start the job • If so, it “elevates” the ticket to the correct group immediately • If not, it elevates the ticket to the correct group immediately • NB: A single flow (with the external knowledge base) handles all the batch job failures at HP. This strategy is discussed in the Non-use-case topics and solutions section “Blocks and Knowledge” Return: 550people hours/month Investment: 44people hours/month (Investment decreases somewhat as the initial cost is amortized) DB Services Monitoring HPSM uCMDB HPSA OS Services Win Servers Linux Servers UX Servers Network

Non-use-case topics and solutions addressed so far • Techniques/practices and decisions • How do we handle Windows UX and Linux target servers? • How do we interface to HPSM? • Blocks and Knowledge • Tool investments • Canonical Flows

Handling Windows/UX/Linux How we determine what we have and what we do about it Strategies • How we detect the target server OS • In some cases, the flow trigger context defines it (BSM DSSS2S) • In some cases, the ticket being worked through a uCMDB lookup tells us • We do not (in general) have OO flows determine the OS version • We tend to write a flow for Windows and another flow for UX/Linux • Windows tends to be done with WMI, UX/linux tends to be done with ssh • In addition to the mechanism, the actual command s are different • There may be a trend to go through HPSA for many types of flows • Note: OO has broad access to all servers in the HPIT infrastructure • Convinced IT Cyber Security that broad access from OO is safer than keyboard/mouse click risk Techniques Answers Questions Solutions Problems

HPIT – Part II – 2013-02-18

How we interact with HPSM The vision/strategy Strategies Rather than create yet another log to check to understand what is going-on, we decided to use HPSM ticketing as an exchange medium for OO to interact with the monitoring and human worlds. • OO only works with “machine-generated” HPSM tickets • Flow: • Monitoring alerts create tickets • The OO Dispatcher detects a ticket • Figures out what kind it is looking at (signature-based) • The Dispatcher launches an independent OO flow to handle each ticket • The is retry/recycle logic built into the ecosystem • OO flows “write” text to the ticket that sounds just like a human would document similar findings. • It is easy to interpret • It is easy to tell how the interaction between OO and our human counterparts worked together on the ticket. Techniques Answers Questions Solutions Problems

How we interact with HPSM The technical mechanism Strategies The options are built-in integrations or using the HPSM web services. • We use the built-in HPSM web services • The version of HPSM is therefore isolated from the OO environment • The HPIT HPSM team (my peers) can control the logic of what OO is allowed to do • An HPIT Architecture community decision • Issues • OO does a login for every single web service call (~ 2 seconds minimum transaction time) • The web service has not always been 100% stable (various issues on the HPSM end) • We built the HPSM Retry CF (subflow library) • It retries with decreasing ferocity (60 seconds later, 120 second later, etc) • It stops trying after about 40 tries • Highly effective with a “one of 3 service point” failure • Moderately effective for short-duration issues. Techniques Answers Questions Solutions Problems

Blocks and Knowledge We cannot afford to write 150,000 flows, 50 of them seems better Strategies We realized that writing multiple flows per application would be unsustainable. Therefore we did an experiment on two significant applications. It took about 60 flows to make a complete set of D/S/S/E2E (Diagnostic/Start/Stop/Site2Site) flows for two assets. The question was “How many blocks are there, and what are some of them?” Phase1 of our knowledge tool is crude, but effective: • Three use-cases (Tidal job failures, Disk space alerts and web application alerts) • Each is a “Single flow” + knowledge file • The Tidal knowledge file is about 30,000 lines of text • The knowledge files are stored in file shares on each OO Central • We use the OO “Log file scan” operation to rapidly scan the knowledge files – using them as a crude database. • The scans are performed with (sometimes multiple) regular expression patterns • The performance is acceptable Techniques Answers Questions Solutions Problems

Blocks and Knowledge The long-term solution (nearing the first alpha test) Strategies Blocks • We expect to write 20-50 “blocks” – subflows to do very specific things in generalized ways • Stop this list of services (in this order) on this list of servers (in this order) • Start this list of services (in this order) on this list of servers (in this order) • One flow can read the SCR system, determine what to do and then use the “blocks” required Knowledge • We built a simple, but elegant tool – SCR (The Structured Command Repository) – think “Selection lists”^n • “Command Groups” are defined. Examples: “Tidal”, “Disc space alerts”, “Flows available by application” -> “How to Start applications” • SCR allows us to custom-define the “fields” required by each “Command Group”. The information you need to handle Tidal job failures is very different from the list of tools available for application XYZ. • Non-flow developers (Command Group Managers) manage “their part” of SCR. SCR appears to be custom-built for each Command Group • OO interacts with SCR through a web service. A flow can be dispatched to “Start application XYZ”. OO goes to SCR asking for “The to-do list of steps to Start application XYZ” an then uses the “Blocks” necessary to carry out the action. • SCR is intended to be called recursively Techniques Answers Questions Solutions Problems

HPIT – Part III

OO Scheduler An OO flow that we use to trigger our batch flows Strategies We have a pair of Instances (two-node clusters). One is in Atlanta, GA – the other in Houston TX. We want to have an “HA on-top of the clustering” solution to trigger batch flows. The OO Scheduler does this. • Every 15 minutes it changes sites (Houston/Atlanta/Houston/Atlanta) • A single instance failure means a 15-minute delay, not a complete failure • We spread compute load across more servers • We can take manual control and manually “move” processing • For system upgrades • When problems arise • Currently we maintain two schedules per instance, but work is underway to allow an OO flow to “move processing which we hope will ultimately allow us to fully automate periodic service Stop-Starts Techniques Answers Questions Solutions Problems

Structured Command Repository (SCR) Multiple “Lists” in one list

SCR Data management The format is defined by the list

StatefulnessStore- The Yellow sticky notes of OO (Yes, it is a horrible name, but it is a helpful tool) • A (little) data needs to persist between flow runs • Has this Tidal job been restarted today? • Have we already touched this ticket? • We don’t need to remember this data for very long (the most common is 30 hours, the max is 30 days) • A simple structure • (6 “key” strings) (6 “data” strings) One MAXVARCHAR (one Expiration date) • Every record must have an expiration date • An hourly OO flow “Cleans the Store” • A (small) set of Canonical Flows are the interface • Remember one state • Recall one state • Some variants for special use

HP Operations Orchestration