the cern agile infrastructure project configuration and operations tools n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
The CERN Agile Infrastructure Project: Configuration and Operations Tools PowerPoint Presentation
Download Presentation
The CERN Agile Infrastructure Project: Configuration and Operations Tools

Loading in 2 Seconds...

play fullscreen
1 / 30

The CERN Agile Infrastructure Project: Configuration and Operations Tools - PowerPoint PPT Presentation


  • 61 Views
  • Uploaded on

The CERN Agile Infrastructure Project: Configuration and Operations Tools. Helge Meinhard / CERN-IT (replacing Manuel Guijarro ) HEPiX Spring 2012 24 April 2012, Praha. Configuration and Operations Tools. https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'The CERN Agile Infrastructure Project: Configuration and Operations Tools' - marrim


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
the cern agile infrastructure project configuration and operations tools

The CERNAgile Infrastructure Project:Configuration and Operations Tools

HelgeMeinhard / CERN-IT(replacing Manuel Guijarro)

HEPiX Spring 2012

24 April 2012, Praha

configuration and operations tools
Configuration and Operations Tools

https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure

https://agileinf.cern.ch/jira/

Agile Infrastructure - Configuration and Operation Tools

project s cope
Project Scope
  • The project is reviewing the entire CERN computer-centre management toolset
    • What happens from the bare metal up
    • Asset management, inventory
    • Sysadmin tools and maintenance workflows
    • Service management and configuration tools
    • Dynamic configuration for ‘virtual’ hosts
    • Operations monitoring
    • Workflow automation and continuous deployment

Agile Infrastructure - Configuration and Operation Tools

configuration and operations tools1
Configuration and Operations Tools

Agile Infrastructure - Configuration and Operation Tools

slide5
Why?
  • Current production system built around the Quattor toolset is successfully managing O(10k) servers
    • (CERN) Quattor + many CERN components
  • Why are we changing the toolset?

Agile Infrastructure - Configuration and Operation Tools

what are the issues 1
What are the Issues (1)
  • Uncompressible technical debt
    • The cost to develop and maintain our own solution is not reducing and clearly exceeds our resources
    • Small community (less funding) and general support problem. At CERN, we’ve fallen into the “sticky hands” support model
  • We need better automation and integration between the sub-components
    • Lack of automated workflow: everything is a ticket
      • emailScript™ : your added value in the process is often your CERN password
    • The 15-min “CDB commit walk” – context switch cost

Agile Infrastructure - Configuration and Operation Tools

what are the issues 2
What are the Issues (2)
  • Transferrable skills and training
    • Learning curve for our tools is steep and remains high
    • It’s easier to hire people who have skills in a widely-used tool than your internal tools
      • Depending on where you look

Agile Infrastructure - Configuration and Operation Tools

jobs adverts indeed com
Jobs Adverts – indeed.com

Index of millions of worldwide job posts across thousands of job sites

These are the sort of posts our departing staff will be applying for.

Puppet

Quattor

Agile Infrastructure - Configuration and Operation Tools

integration is hard
Integration is Hard
  • IPv6, virtualisation, Windows Server all need a solution
    • We could leverage lots of open source tools
      • But piecemeal integration of these requires high investment due to our complex system
      • Years of organic growth have made the system way too ‘hairy’
      • It’s often easier to reinvent rather than integrate
    • Lack of ‘dynamic-ness’ in the infrastructure
      • We hack the config system for dynamic VMs
  • It’s critical to look at the system as a whole

Agile Infrastructure - Configuration and Operation Tools

use puppet for the core
Use Puppet for the Core
  • The tool space has exploded in the last few years
    • In configuration management and ops
    • Large, shared ‘tool forges’, and lots of experience
  • Puppet and Chef are the clear leaders for the ‘core’ tool
    • other tools in our ‘scope’ try to integrate with those
  • Many large-scale enterprises use Puppet
    • Its declarative approach fits better with what weare used to
    • Large installations: friendly, wide-base community and commercial support and training
    • You can buy books on it

Agile Infrastructure - Configuration and Operation Tools

scaling challenges nodes
Scaling Challenges: Nodes
  • Currently we have O(10k) physical nodes
  • IaaS approach:
    • Moving to virtual machines
    • More (smaller, load-balanced) service nodes
    • VMs for raw compute (batch or pilot jobs)
    • Homogeneous: compute + storage on the same node
  • Add another computer centre, 24/48 SMT cores per node, you get 100k – 300k virtual nodes to be managed
    • 99.6%(1) node update success-rate means 1200 manual interventions to “fix it”(1) in a recent intervention on lxbatch

Agile Infrastructure - Configuration and Operation Tools

scaling challenges people
Scaling Challenges: People
  • Many, diverse applications (“clusters”) managed by different teams
  • ..and 700+ other “unmanaged” Linux nodes in VMs that could benefit from a simple configuration system

Agile Infrastructure - Configuration and Operation Tools

agile infrastructure 1 st try 1
Agile Infrastructure 1stTry (1)
  • First started investigating tools in September 2011 using ‘part-time’ resources from several IT groups
    • Trying iterative “agile-sprint” style (Scrum): short sprints, feedback, sprint review, visible
    • Take first, best-guess at architecture and tool selection, iterate
  • Mixed success with this agile style
    • What works: Good visibility and reviews. Daily “scrum” meeting useful. Weekly review meeting open to management.
    • What doesn’t: The “time boxing” part of Scrum sprints is hard with part-time resources
    • Now more staff available, but still mostly part-time efforts

Agile Infrastructure - Configuration and Operation Tools

agile infrastructure 1 st try 2
Agile Infrastructure 1stTry (2)
  • We’re currently running:
    • OpenStackas cloud software for virtual machines, image management, bulk storage
      • See later presentation
    • Puppet for the configuration management core
    • …with Foreman as a dashboard

Agile Infrastructure - Configuration and Operation Tools

foreman dashboard
Foreman Dashboard

Agile Infrastructure - Configuration and Operation Tools

agile infrastructure 1 st try 21
Agile Infrastructure 1stTry (2)
  • We’re currently running:
    • OpenStackas cloud software for virtual machines, image management, bulk storage
      • See later presentation
    • Puppet for the configuration management core
    • …with Foreman as a dashboard
  • None of the tools are “perfect” out-of-the-box
    • .. but we’d rather submit patches to a good open source tool than re-implement it
    • We’ve experienced very good community support: RFCs and patches are quickly accepted
    • Very active community: often problems are fixed and missing features implemented before you even report them

Agile Infrastructure - Configuration and Operation Tools

agile infrastructure 1 st try 3
Agile Infrastructure 1stTry (3)
  • We’re currently running:
    • yum for software distribution (replacing spma)
    • git for template management: why git?
      • Almost all the Puppet (and Chef) usage schemes out there assume you use git to handle the templates
      • Many of the tools we can benefit from also assume git
      • We should not be different from the rest of the community

Agile Infrastructure - Configuration and Operation Tools

puppet
Puppet
  • Client/server architecture
    • “puppetmaster”: horizontally scalable Rails application
    • X509 cert authenticated nodes: integrate with CERN CA

Agile Infrastructure - Configuration and Operation Tools

puppet1
Puppet
  • Puppet runs on the client, applyingthe configuration changes
    • It detects the current state and only runs if there’s something to do
  • It runs every few minutes
    • new configuration will be ~immediately applied (“fail-fast”).
    • This is a change from CDB where ‘latent’ changes can be stacked up
  • Normal mode is client-side compile (“assume success”)
    • No more CDB commit waits
    • Change from CDB: the compilation fails later
  • Good monitoring is a pre-req: puppet sends reports back to the puppetmaster
    • The Foreman tool can collect these for you

Agile Infrastructure - Configuration and Operation Tools

puppet language
Puppet Language
  • Puppet uses its own Ruby-like language for the templatesto “assert” the desired state of the nodes
    • With Ruby fall-back for hard stuff (we’ve only needed this once)
  • Being declarative rather than procedural, there are quirks
    • Takes a bit of practice to ‘get it’
    • There are books, online docs, online cook-books, and a large community to help
  • It dispenses with the need for ncm components
    • All the work is done by puppet on the node itself – you just provide the template part to assert what you want done
    • Less software -> easier to move to new OS versions

Agile Infrastructure - Configuration and Operation Tools

externals
Externals
  • Puppet uses an external DB for much of the configuration that we currently store in textual CDB templates
  • Node function + hardware
    • Moving a host between clusters is a DB update
  • Your configuration can use variables the node detects itself
    • e.g. reconfigure daemons based on where a newly live-migrated VM has found itself
  • Query the compiled configuration of other hosts
    • e.g. Open my firewall to the lxadm nodes

Agile Infrastructure - Configuration and Operation Tools

moving towards paas
Moving towards PaaS
  • Parametrisable recipes
    • Just fill in the blanks
  • The aim is to make it easy to use “pre-canned” recipes without even touching a Puppet template
    • e.g. stick a standard CERN SSO-enabled apache / mod_wsgi / Django server on my box
    • …with these parameters
  • Moving us in the PaaS direction
    • Ultimately, it would be better if you never even needed to log into this node
      • (J2EE public service, IT web hosting service, MySQL service)

Agile Infrastructure - Configuration and Operation Tools

standard workflow
Standard Workflow

Iterate

n minutes

CDB onlxadm

check out

from CDB

updatetemplates

CDB commit

run and check on test node

notify with nc-client

check on

node(s)

Iterate

1 minute

Puppet onlxadm

notify with mcollective

check on

foreman

check out

from git

updatetemplates

git commit

and push

run and check on test node

Iterate

Puppet-apply on test node

check out

from git onthe test node

updatetemplates

run puppet-apply

check on test node

git commit

and push

notify with mcollective

check on

foreman

Agile Infrastructure - Configuration and Operation Tools

modernising our processes 1
Modernising our Processes (1)
  • Our software processes for the computer centre are fairly limited
    • fire-and-forget broadcasts to project-elfms
  • …and rather manual
    • The manual test/ -> preprod/ -> prod/ template dance
    • Our toolset RPMs are ‘built on laptop’ and uploaded to ‘swrep’ by hand
  • Add standard continuous integration (e.g. Jenkins, Bamboo, Cruise) and automated build (Koji) as the only route to get new packages into the CC
    • .. then automate the testing
    • e.g. suitably tagged RPMs are automatically deployed to /test nodes.

Agile Infrastructure - Configuration and Operation Tools

modernising our processes 2
Modernising our Processes (2)
  • We’re working out which of the many puppet / git models suits us
    • code review, sign-off and automated notification for changes that will affect multiple clusters
    • How to automate the test/preprod/prod advancement
  • Pre-req is flexible monitoring and alarming
    • you need to trust that an automation failure will be signaled to you
  • Script-generated emails are banned
    • Need good monitoring to hang these notifications on
  • Integrate components rather than use emailScript™
    • Script-generated tickets (where your value in the process is your password), are banned

Agile Infrastructure - Configuration and Operation Tools

current tool s napshot liable to change
Current Tool Snapshot (Liable to Change)

PuppetForeman

mcollective, yum

Jenkins

AIMS/PXE

Foreman

JIRA

Openstack Nova

git, SVN

Koji, Mock

Yum repo

Pulp

Lemon

Hardware database

Puppet stored config DB

Agile Infrastructure - Configuration and Operation Tools

preliminary timelines
Preliminary Timelines

Aggressive schedule if we are to make it for new data centre

Agile Infrastructure - Configuration and Operation Tools

initial steps
Initial Steps
  • Decided on tools
  • Integrating them to make a production setup
    • We can still change.. But we’re starting to commit…
  • Looking for early adopters
    • In particular to understand the people-scaling / ACL issues: which of the git/puppet models is best?
      • e.g. PES/OIS services: batch/VMs, JIRA, Drupal
      • https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure/EarlyAdopters2012
    • Help with integration / coding
    • Help with ideas
    • Help with building the task list

Agile Infrastructure - Configuration and Operation Tools

summary
Summary
  • IT has started a new project to move our infrastructure to a new toolset based around industry standard open source components
    • Puppet for the core configuration tool
    • Better integration between components
    • Use of more modern software processes to aid deployment
    • Better monitoring
    • Engage with the community rather than re-implement
  • Overall project scope is wider (see following presentations)
    • Improved monitoring
    • Cloud and virtualisation
  • Actively seeking wide involvement from CERN-IT and feedback from the community
  • https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure

Agile Infrastructure - Configuration and Operation Tools

acknowledgements
Acknowledgements
  • Many colleagues at CERN-IT, including
    • Tim Bell
    • Ian Bird
    • Bernd Panzer-Steindel
    • Gavin McCance
    • Manuel Guijarro

Agile Infrastructure - Configuration and Operation Tools