1 / 41

Deal with Production Issues

Deal with Production Issues. Suggestions from ITIL. Problems to solve. Long resolution time Neglected issues Issues we lose track of until our users remind us Recurring issues Inconsistency in response time Developers are distracted constantly to resolve issues. Goal.

Download Presentation

Deal with Production Issues

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Deal with Production Issues Suggestions from ITIL

  2. Problems to solve • Long resolution time • Neglected issues • Issues we lose track of until our users remind us • Recurring issues • Inconsistency in response time • Developers are distracted constantly to resolve issues

  3. Goal • Manage issues in a consistent manner • Fast resolution • Reduce client impact • Proactively resolve issues before they impact clients

  4. Basic Concepts • Incidents • Any event which is not part of the standard operation of a service and which causes, or may cause an interruption to or a reduction in, the quality of that service • Problems • A problem is a condition often identified as the cause of multiple incidents that exhibit common symptoms. • Known Errors • A known error is a condition identified by successful diagnosis of the root cause of a problem, and subsequent development of a Work-around

  5. Relationship of the three • Problem is the root cause of the incidents • Incident is the manifest of a underline Problem • One Problem can cause many Incidents • Known error is a problem with known root cause and known workaround

  6. Manage Incident vs. Manage Problem • Different goals • Incident Management focus on restoring the service operation as quickly as possible • Problem management focus on finding and eliminating the root cause • Different actions • Incident management applies workarounds or temporary fixes to quickly restore the services • Problem management issue a change to fundamentally eliminate the root cause • Incident management is reactive and problem management is proactive • Incident management emphasize speed and problem management emphasize quality

  7. Common mistakes • Spend tremendous time and efforts to find root cause before the service level is recovered • Stop the investigation after an incident is fixed by a workaround • Same incident occurs repeatedly without understanding of the root cause

  8. Solutions from ITIL • Separate out Incident Management and Problem Management into two independent but related processes • Handle incidents (restore service) as quickly as possible • Proactively and independently work on resolving problems • Wisely manage Known Errors

  9. Incident Management • Always remember the goal is to “Restore service level as quickly as possible” • How to go fast? • Classification • Match known errors and known workarounds • Appropriate escalation • Go fast, but not go crazy. Don’t miss • Record • Prioritize • Follow up

  10. Incident Management Process

  11. Acceptance And Record • Benefits of recording • Help to diagnosis new incidents based on known incidents • Help Problem Management to find the root cause • Easy to determine the impact • Be able to track and control the issue resolution. • Incident Reporting Channels • User • System Monitor/Alert • IT person

  12. Incident Record • Unique ID • Basic diagnosis info • Timestamp • Symptoms • User info (name, contact info) • Who’s responsible • Additional information • Screenshots • Logs • Status • New, Accepted, Scheduled, Assigned, Active, Suspended, Resolved, Terminated

  13. Classification • Classification • Possible reasons (application, network, database, business logic, etc.) • Supporting group (application group, database group, infrastructure group, network group, etc.) • Prioritize • Priority = Impact X Urgency • Determine resolution timeline (resolve within X hours) based on Service Level Agreement

  14. Preliminary Support • Preliminary Response • Acknowledge of acceptance • Collect basic info • Provide basic help to the user • Service Requests • Service Request is standard service like check status, reset password, etc. • Go through standard procedure to handle service requests

  15. Match • Match known errors • Known solution • Known workaround • Known resolution procedure • Match existing incidents • Link the new incident with the existing incidents • Increase the impact level of the existing incident • If the existing one is already worked on, inform the responsible personal/group

  16. Investigate and Diagnosis • Escalation • Functional escalation (Technical escalation) : Involve more technical experts, involve teams in other functional group, or involve external suppliers • Hierarchical escalation (Management escalation): Escalate to higher level management team

  17. A (Service Desk) B (Second Line) C (Third Line, Supplier) D (Incident Manager) E (Division Management) F (Corporate Management Escalation by Priorities

  18. Investigation Activities • Assign dedicated support person • Collect basic info • Query historical data • Recent releases • Recent changes • Workload trend • Analyze • Again, don’t spend too much time in finding the root cause. Find a workaround as soon as possible!

  19. Resolve and recover • Resolution (workarounds or permanent fix) • Create a Request For Change (RFC) • Approve RFC • Implement Change. • Record the analysis, the root cause, the workaround and the solution • Leave the incident in Open status when resolution hasn’t been found

  20. Termination • Contact the user to confirm incident is resolved • Change the Incident status into “Closed” • Update all the Incident record to reflect the final priority, impact, user and root cause

  21. Track and Monitor • Assign an owner to each incident. Usually it’s the Service Desk person. • Provide feedback to the users after a change • Enforce the escalation based on the priority

  22. Problem Management • Problem Control • Find the root cause of a problem • Turn a problem into a Known Error • Error Control • Control and Monitor the Known Errors until they are appropriately handled • Proactive Problem Management • Resolve problems before they cause any incidents

  23. Problem Control

  24. Identify Problems • Analyze the trends of incidents • Likely to reoccur • Likely more will occur • Likely to have larger impact • Analyze the weakness of the infrastructure • Availability • Capability • A significant incident (outage)

  25. Diagnosis • Recreate incident in testing environment • Link the modules with incidents • Review the latest changes • After the root cause of a problem is found, this problem becomes a Known Error

  26. Temporary Fixes • It’s important to find a temporary fix if the problem causes significant incident • If temporary fix involves changes in the infrastructure, a Request For Change must be submitted. (Later, another RFC may be submitted to fix the root cause) • For urgent problems, Emergency Change Request Process should be initialized.

  27. Error Control

  28. Identify and Record Known Error • Identify • Find the root cause of a problem • Link a problem with a known error • Record • Assign an ID • Symptoms • Root cause • Status • Notification • Notify incident management team. They can associate new incidents with known errors

  29. Determine the solution • Evaluate based on • Service Level Agreement • Impact and Urgency • Cost and benefit • Possible solutions • Temporary fixes • Permanent fixes • No fix (cost is greater than benefits) • Record the decision in Problem Database

  30. Known Errors from other environments • Known errors from development environment • We may choose to release with some minor known issues • Known errors from suppliers • Usually reported in the release notes • Record, Monitor and Track those known errors • Relate problems with those known errors

  31. PIR (Post Implementation Review) • Normal problems • Confirm all the related incidents are closed • Verify if the problem record is complete (symptoms, root cause and solutions) • Change the problem status into Resolved • Significant problems • What went well? • What went wrong? • How to do better next time? • How to prevent the similar issues from happening again?

  32. Track and Monitor • Track the full lifecycle of each known error • Reevaluate impact and urgency. Adjust the priorities accordingly. • Monitor the progress of the diagnosis and implementation of the solution. Monitor the implementation of the RFC.

  33. Proactive Problem Management • Focus on the quality of the service and the infrastructure • Analyze operational trends • Detect the potential incidents and prevent them from happening • Find out the weak points of the infrastructure or the overloaded components

  34. Ideas to improve our Production Support process • Idea 1: Create an independent Problem Management Team. • Idea 2: Create an Problem Database • Idea 3: Define the Production Support Procedure • Idea 4: Review and revise the procedures of using TeamTrack • Idea 5: Enforce Post Implementation Review • Idea 6: Proactively manage problems • Idea 7 (optional): Acquire an Service Desk software to facilitate the process

  35. Create an independent Problem Management Team. • Can be a full time team or a part time team • Appoint a Problem Management Manager. Must be different than the Production Support Manager. Their goals, schedules and requirements are different. • Responsible for managing all the production problems (not incidents) for multiple applications • Identify problems • Record problem • Find and evaluate solutions • Track the progress till closure • Work closely with the existing Production Support team.

  36. Create a Problem Database • A easy to search knowledge database • Include problems and known errors • Track symptoms, root causes, temporary fixes, workarounds, and permanent solutions • Include all the known errors in DEV and unresolved or deferred defects in QA/RATE environments • Maintained by the Problem Management Team • Will be used by Production Support team for match and fast resolution of incidents

  37. Define the Production Support Procedure (Work Instructions) • Create a formal and detailed document. Train Production Support Team to follow the new procedure • Start with ITIL Incident Management Process. Adjust it to our own situation and tools • Clearly define how to calculate priorities • Clearly define the time-bound escalation procedure • Clearly define the monitoring and tracking steps

  38. Review and define the procedure of using TeamTrack • TeamTrack is our existing Incident Tracking system • Review the functions of TeamTrack • Redefine the incident escalation process according to ITIL suggestions • Define the interface between PC Support and IT Production Support Team • Communication channel • Roles and responsibilities • Escalation • Track and Control • Knowledge sharing

  39. Enforce PIR • Contact each user to confirm all the incidents are closed • Make sure the Problem record is complete and useful • Identify issues in the Incident and Problem Management process. Add those to Problem database.

  40. Proactively Manage Problems • Responsibility of the Problem Management Team. • Perform the following activities: • Analyze incidents to find the trend • Analyze infrastructure to identify possible bottleneck • Run fail-over and stress tests • Apply a problem solution across multiple related applications • Establish and maintain the Production Monitor System to proactively detect system anomalies • Evaluate how many problems are proactively identified and resolved

  41. Service Desk Software • Evaluate the existing TeamTrack software and see if it covers out needs • Other popular options • HP Openview Service Desk • Remedy Strategic Service Suite • CA Unicenter Service Desk

More Related