Some Key Terms for Incident Management

Introduction Incident Management is an important aspect of network management and control solutions. It deals with the reporting, inspection, correlation, and management of events within the network where those events have a negative effect on the network's ability to forward traffic in an optimal way. Incident management extends to include actions taken that work toward recovery of optimal network behavior. A number of work efforts within the IETF seek to provide components of an Incident Management system, such as YANG models or management protocols. It is important that a common terminology is used so that there is a clear understanding of how the elements of the management and control solutions fit together, and how the incidents will be handled. This document sets out some key terms that are fundamental to a common understanding of Incident Management.

Terminology The terms are presented below in an order that is intended to flow such that it is possible to gain understanding reading top to bottom.

Resource:

A component or commodity that can be used in a valuable way in the performance of some activity.

State:

A particular condition that something is in (at a specific time).

Change:

A modification to the state of a resource in time.

Most changes are not noteworthy (and are not relevant).
Perception of change depends upon the sampling rate/accuracy/detail and perspective.

Occurrence:

A particular relevant change.

The change is potentially without a plan or intent.
An occurrence may be an aggregation or abstraction of smaller occurrences.
Applies to all scales and scopes, i.e., is essentially fractal (can recurse indefinitely).
Note that occurrence is used here with respect to the temporal dimension.

Event:

The state modification in an occurrence.

Compared with a change which is over a period of time, an event happens at a measurable instant.

Incident:

An event that has a negative effect that is not as required/desired.

Problem:

A state regarded as undesirable that needs to be dealt with and overcome.

There is a need to change to a desirable/appropriate state.
Note that there is a historic aspect to this. The current state may be operational, but there was a failure that is unexplained and therefore the network is in a state of unexplained recent failure which, although the network has recovered, is a problem.
Note that whilst a problem is unresolved it requires attention. A record of a resolved problem may be maintained in a log of history.
Note that the network may be in a state which is considered to be a problem from several perspectives (e.g., there is loss of light causing services to fail). A state change (so that the light recovers) may cause the problem to be resolved from one perspective (the services have are now operational) but may still leave the problem as unresolved from another perspective (because the loss of light has not been explained). There can be further developments (the reason for the temporary loss of light is traced to a microbend in the fiber that is repaired) that cause another problem to be resolved. But this leaves a final problem still unresolved (why did the microbend occur in the first place?).

Alert:

The indication of the potential existence of a problem

Notification:

Communication of a state change.

May be an alert.

Alarm:

An indication to a human operator highlighting the potential presence of a problem.

The alarm state change is an event.

Transient:

A state, considered as a problem, that persists for a limited amount of time before becoming resolved without direct action by an operator or control system.

Intermittent:

A state that is not maintained, but keeps occurring in some meaningfully short time frame.

Cause:

The activity, event, etc. that gives rise to an (undesired) event, condition, or behavior.

Detect:

To notice the presence of something (state, activity, form, etc.).

Hence also to notice a change (from the perspective of the viewer).

Condition:

The state of something with regard to its working order.

Here, this term is used where the state is an issue with operation. For example, "signal degraded" is a condition that indicates an issue with the operation.

Security Considerations This document specifies terminology and has no direct effect on the security of implementations or deployments. However, protocol solutions and management models need to be aware of several aspects:

The exposure of information pertaining to incidents may make available knowledge of the internal workings of a network (in particular its vulnerabilities) that may be of use to an attacker.
Systems that generate management information (messages, notifications, etc.) when incidents occur, may be attacked by causing them to generate so much information that the management system is swamped an unable to properly manage the network.
Reporting false information about incidents (or masking reports of incidents) may cause the management system to function incorrectly.

Privacy Considerations In general, Incident Management will not expose information about end-user activities or user data. The main privacy concern is for a network operator to keep control of all information about incidents to protect their privacy and the details of how they operate their network.

IANA Considerations This document makes no requests for IANA action.