1 / 9

Improving Change Management Andreas Unterkircher

Improving Change Management Andreas Unterkircher. Post mortem of 2008 releases . Releases of VOMS and BDII (related) rpms caused problems. Update 22: VOMS (short FQANs, FQAN ordering ) Reliance on unsupported functionality Update 30: GFAL (update incompatible with gLite 3.0 BDII )

keilah
Download Presentation

Improving Change Management Andreas Unterkircher

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving Change Management Andreas Unterkircher

  2. Post mortem of 2008 releases • Releases of VOMS and BDII (related) rpms caused problems. • Update 22: VOMS (short FQANs, FQAN ordering) • Reliance on unsupported functionality • Update 30: GFAL (update incompatible with gLite 3.0 BDII) • Reliance on obsoleted services • Update 33: various BDII related problems • 1) bdb backend running out of available locks which was fixed by adding a configuration parameter. #42727 • Insufficient testing in production context • 2) chown bomb for Quattor site, caused by not setting a configuration parameter #42799 • Alternative Fabric Management method • 3) Recursive protection commented out causing information system to fail. - no bug or ggus ticket ('the security bug' fixed in patch #2649 release update 37 • Lack of regression testing • 4) Change in schema file causing WMS to not match when using the CESEBind. - bug #45278 / ggus #44201 • Complex service interaction – bug was in WMS

  3. Post mortem of 2008 releases • Proper fast track fix would have been a rollback of the release. • We did not roll back but tried to produce updates quickly. • This stems from the fact that our release model currently does not support rollbacks properly. • Some of the issues with updates could have been only spotted by running production workflows • E.g. user/framework relying on the order of FQANs • For a quick fix we were able to produce and distribute updated rpms quickly (< 24h), but • Updates were not well documented on the release pages. • Broadcasts were sent by different people potentially confusing the sites. • Information on the release pages has to be improved (GGUS ticket complaining that last release notes are not easy to understand)

  4. Post mortem of 2008 releases • “There is a tendency to bundle too many changes in a release” (CMS) • More, smaller releases? • Much less efficient for the release process. • Would the sites follow? • Less change in general? • Which changes do we drop?

  5. Actions to take • Implement well defined rollback procedure (SA3) • Implement procedure for fast track releases (SA3 & SA1) • Implement a managed rollout of updates in production (SA1) • Improve and maintain quality of the release pages (SA3) • Maximize representation of experiment use cases in certification (SA3)

  6. Rollback • Current status: • One shared repository for all node types • Planned: • “current” repository for every node type (no longer shared) + repositories of the previous update • In case of rollback “current” can point to previous update. This can be done per node. Prevents sites of picking up the bad update. • Rollbacks are per node type, not per individual rpms. • Sites which have already installed the bad update need to downgrade manually. • We can provide a recipe • The release team needs to sort out how to achieve this from a technical point of view (has implications for our scripts, AFS space, we need to ensure consistency of the repositories etc.) • Timeline: end of February 09

  7. Procedure for fast track releases • Will be defined together with SA1 (PPS team) • Roles, broadcasts, release page. • Need a consensus on which issues are treated this way. • If a problem occurs rollback should be the default action. • Timeline: should be available when rollback is implemented.

  8. Managed rollout • Done by SA1. • Identify sites wanting to be the first to install an update. • Node types will be exposed to production • Should capture production workflows. • Could use PPS manpower/resources to do this. • This is already partly being done now • After a new release only few sites upgrade immediately • PPS installed pilot services for certain services (WMS, CREAM) • Timeline: to be determined by SA1

  9. Near future • No major releases until the rollback has been implemented. • Only dedicated, low risk or security updates • Among next update candidates: CREAM, VDT, yaim • Next release only for CREAM • We try to get more experiment use cases and add them to our test base (contact person: Gianni Pucciani).

More Related