Problem records track the investigation and diagnosis of a root cause of one or more incidents. Problems identify the underlying issue such that resolving the problem not only resolves the incident, it prevents further incidents. Incident Management is concerned with restoring service as quickly as possible, whereas Problem Management is concerned with determining and eliminating root causes, and hence eliminating repeat problems.
The primary objectives of Problem Management are to prevent incidents caused by a similar root situation from recurring and to minimize the impact of incidents that cannot be prevented.
Problem Creation and Overview
When an incident is based on an underlying problem, first-response support technicians create a problem record from the incident, automatically creating a link between the incident record and the problem. The new problem record imports relevant information from the incident, such as the linked configuration item. Problems can also be created without being linked to existing incidents, such as in cases where a problem is discovered internally but no Incidents have been reported.
The submitter, along with their department and contact information, are automatically populated based on the person who creates the problem, and the reporting source of the Problem is also captured in the record. Resolution of problems may require changes to the system. Staff addressing the problem may determine that a shared CI needs to be replaced or modified, and may therefore file a change request. All updates throughout the life cycle of the problem record are logged in the record's history along with the date and time, and updates can also be logged in the Additional Notes field, which logs the updater and time stamp of the update.
When the problem is resolved, a technician updates the record with relevant information and closes the record, in turn prompting automation to begin closing procedures for related Incidents. If the problem cannot be resolved it may be classified as a Known Error and a permanent work-around supplied. This will also update the related Incidents.
Processing of Records
Once created, problem records can be linked to incidents from either the problem record or from an incident record.
Priority, a measure of the problem's urgency and relative importance, is set by default based on the combination of Impact and Urgency and how the Problem Priority Group is configured. The default priority matrix for problems is shown below.
These values may be changed to match a company's preferences. See the Impact, Urgency, and Priority Management section for more details. Problem priority, a measure of impact and risk, particularly with regards to IT service operations, may be determined separately as part of Problem Classification procedures. Together priority and problem priority help IT managers make decisions for scheduling and planning and help determine the category of related change requests.
The problem record follows its own workflow separate from the incident. Ops team members open the problem record in the default status of Pending Diagnosis to indicate that a diagnosis has yet to be determined, or Diagnosed if no further steps are required. The default Problem SLA has specified targets for completing diagnosis, based on the priority, and a diagnosis clock is started upon creation of the problem record. As part of the diagnosis, technicians select the service involved based on existing incident reports or based on the most likely service to be related to the problem. If a particular configuration item is identified as the source of the problem's root cause, staff can quickly link the problem to the CI record.
Once a diagnosis is supplied, staff will move the record into a status of Diagnosed, fill out the Root Cause description and any further analysis, and save the record. They may also suggest a temporary fix, or "workaround", for the problem and push that to related incidents, e.g. "use Printer B3 for now instead", if a permanent solution is not readily available.
Diagnosis and Resolution Clocks
When the status changes from Pending Diagnosis to Diagnosed, Pending Change, Resolved, or Deferred, the diagnosis clock stops, the Diagnosis End Time field automatically populates with the current time, and the resolution clock starts. The diagnosis clock measures the working support hours between creation of the problem and diagnosis of the problem against the SLA diagnosis target.
The resolution clock measures time between when diagnosis is completed and the problem is resolved or deferred, excluding the time when the status is Pending Change, against the SLA resolution target for problems. While the status is in Pending Change, both clocks are stopped, since the people handling problems may not have control over when a change request can be scheduled and implemented.
If a workaround exists while the problem investigation is in progress, you can send the workaround to any related Incidents and the incident submitters. To do this...
- Enter the details of the workaround into the Workaround field.
- On the Related Records tab, change the Workaround Provided field to Yes.
- Click 'Update Incident with Workaround.' This posts the workaround to all related Incidents and changes the status to Workaround Provided. When the Incident status changes to Workaround Provided, the system sends an email to both the end user (submitter) and the assigned staff person of the Incidents, along with any CC's.
Deferring and Seeking More Information
During the root cause diagnosis or determination of the proper solution, staff may need more information from a separate process, for example incident details. In this case, technicians change the problem's Status to Pending More Information and send an email to staff members with requests for information. If a problem is deemed too risky or of lower priority than more imminent issues, change the status to Deferred to reflect no ongoing diagnosis or pending changes.
In most cases where a problem's root cause deals with a configuration item, a change request will be submitted to make the appropriate fixes to the configuration item. When Change Requests are created directly from the problem record, they are instantly linked. While a problem is waiting for change request to be implemented, it can be put in the status of Pending Change. If the change request linked to a problem is closed, the system will notify the problem assignee to take additional steps to close the problem record if it was in a status of Pending Change.
A problem whose root cause is known but for which there is no permanent resolution is considered a Known Error. Known Errors should have workarounds to allow Incident Management to restore service as quickly as possible. When the problem is resolved either directly, or through a change request, all related incidents can be closed at once. On the Related Records tab, the button Update Incidents with Solution can be used to push the Resolution field of the problem into the Resolution field of the linked incidents, closing the incident and triggering an email to the customer with the resolution.
Once the permanent resolution is determined and implemented, staff users enter the description in the Resolution field and set the status to Resolved. A Closure Category for the resolution of the problem can be selected. In addition, a required Review for Knowledgebase field is displayed when closing, and if set to Yes, a knowledge article can be created from the problem for review to be added to the Knowledge Articles table.
Analysis and Major Problem Reviews
It is important to learn from problems to reduce the likelihood of repeat issues. There are two fields intended to facilitate analysis after a major problem.
- Include in Major Problem Review: this is a required field.
- Major Problem Review notes: this is used to write up notes about what was done well in the crisis, what didn't work so well, and what could be done to prevent such a crisis in the future. This field is only visible if the Include in Major Problem Review field has a Yes value.
Problems can be marked as known errors by setting the Known Error field to a Yes value. All known errors can be easily viewed by support technicians using the Known Errors saved search. Additional saved searches for Known Errors Unresolved and Resolved Known Errors show problem records that are still unaddressed, and those that have been resolved, respectively.
For known errors that should be visible to end users, we recommend that they are pushed into a knowledgebase article, which can be done directly from the problem record.
Problems are owned by the staff member who creates the problem record. Since only technical staff generally see problem records, groups may share responsibilities between Incident Management and Problem Management and multiple individuals may be given responsibility for a problem without actually owning it.
This section contains an overview and screenshot examples of the information stored in a Problem record in the out-of-the-box system. The common area is shown in all tabs, and shows the progress of the problem, such as its Status, its Team Assignment, as well as the Assigned Person or owner of the problem record.
The Details tab contains most of the information and updates pertaining to the problem, including the Submitter, Submitter Department and Contact Information, Source of the problem (how it was reported), the Location of the problem, the Business Service and Service impacted by the problem, as well as the Impact, Urgency, and resulting Priority of the Problem. If a CI has been identified, it can also be selected here.
Work Status Tab
All Diagnosis details can be logged here along with the Risk and Root Cause analysis, and any additional working diagnosis notes with time stamps. The Resolution Details section in this tab allows for the detailing of a Workaround for the problem, which can later be updated in the related incidents, or to pull an existing Resolution from the Known Error subset of problems. If the status is Resolved, a Closure Category can also be specified for the problem, and a determination must be made as to whether the problem should be included in a future Major Problem Review. The date the problem was resolved is then automatically logged in the History tab of the record to provide an audit trail.
The Time tab displays the SLA thresholds for the problem, including thresholds for the SLA Diagnosis Time, SLA Resolution Time, and the progress against those thresholds. Amount of time spent working on the problem can also be logged here.
Related Records Tab
The Related Records tab shows the list of incidents related to the problem. If a workaround or resolution has been provided, it can also be pushed from the problem record to all of its related incidents. The problem can also be linked to an existing change request, or can be used to create a new one, copying over all of the relevant information directly from the problem record to the change request. In addition, the Knowledge Management section of this tab also allows the problem to be directly integrated and converted into a record in the Knowledge Articles table with a status of Pending Review.
Reporting and Statistics
There are numerous default reports measuring different metrics for problems, and a few of the most relevant and frequently used are listed here as examples.
Known Error Trend Analysis
This is a Trend graph that shows the number of problems marked as Known Errors over the course of a pre-defined period of time (by default, this is since the beginning of the current year).
All Problems by CI Type, segmented by CI Name
This is a Segmented Bar Chart that shows the number of problems that were reported for each CI type, segmented by the name of the Configuration Item.
Major Problem Review Report
This report shows the major problems by month over the past year. Additional details are in the HTML portion of the report.