Skip to content
This repository has been archived by the owner on Dec 6, 2024. It is now read-only.

Semantic conventions for Uptime Monitoring #185

Closed
wants to merge 5 commits into from
Closed
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 105 additions & 0 deletions text/metrics/0185-uptime-monitoring-semantic-conventions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Guide to Uptime Monitoring for Metric Semantic convetions

A guide to best practices and use cases around uptime monitoring using metrics.

Original authors: Keith Jordy (@kjordy) and Quentin Smith (@quentinmit), adapted for OpenTelemetry by @jsuereth.

## Motivation

Why should we make this change? What new value would it bring? What use cases does it enable?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are just question and not "motivation"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this just the copy of the template: https://github.com/open-telemetry/oteps/blob/main/0000-template.md
@jsuereth probably forgot to update this section :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did, will update later in the week (a bit overloaded today)


## Explanation

Users often want to monitor the health of their long-lived tasks. However, what they mean when they use the word "up" is overloaded.

### Use Cases

We'd like to drive a set of metric conventions around uptime metrics for processes (and possibly other process-like systems) for OpenTelemetry, based around these common use cases:

- Graphing fleet-wide process age
- Alerting on restarts using uptime
- Alerting on current process health
- Alerting on crashing processes

#### Graphing fleet-wide process age

An individual process's uptime produces an "up and to the right" graph when its absolute value is plotted. Graphing many processes' uptimes, either as lines or a heatmap, can identify when there are fleet-wide events that cause processes to restart simultaneously.

This use case can also be met with a restart count metric, but "up and to the right" uptime graphs are more comfortable for users.

#### Alerting on restarts using uptime

Some customers want to configure an alert whenever their process (re)starts. Typically, they do this by configuring an alert that fires when the absolute value of the uptime (time since process start) is below a certain threshold.

This type of alert is considered bad practice because it is, by definition, non-actionable. By the time the uptime is reported as 0, the process has already started and is now running. The alert will self-close as soon as the process has been running for a few minutes.

We recommend in general not configuring alerts on this type of metric, but it is a common practice.

#### Alerting on current process health

Some customers want to configure an alert when a process is unhealthy, or the number of healthy processes falls below a certain level. They can do this using a boolean metric exported from each task that reports if it is currently healthy.

In general, we recommend that services prefer exposing more detailed metrics instead of boolean health. For example, a user should prefer an alert defined as "successful qps > 0.1" instead of an alert defined as "healthy = TRUE". The reason for this is that boolean health metrics essentially move the alert condition definition inside the process, removing the user's ability to control those conditions.

#### Alerting on crashing processes

Users want to know if their processes are restarting frequently (for example, as part of a crash loop). This is the hardest type of metric to report, because restarting processes may not be healthy enough to report that they are crashing.

When there is some kind of supervisor process that is separate from the process being monitored, this can be tracked by having the supervisor process report the number of times a process has restarted. Then an alert can be configured when the rate of that count exceeds a threshold (e.g. "process has restarted ≥ 5 times in one hour").

Since supervisor processes like this do not often exist, other ways to partially achieve this are to configure an alert on the average uptime over a long time period (e.g. "average uptime over 60m window < 15m"), and/or alert when a metric is absent.

## Internal details

We propose the following metrics be used to track uptime within OpenTelemetry:

| Name | Description | Units | Instrument Type |
| ---------------------- | ---------------------------- | ----- | -----------------------------|
| *.uptime | Seconds since last restart | s | Asynchronous Gauge |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general recommendations appear to say that .time suffix needs a dot. Should this be *.up.time if we follow the recommendation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems *.time is normally used for things that have additive property.

| *.health | Availability flag. | 1 | Asynchronous Gauge |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The paragraph above nicely tells the drawbacks of "health" as a metric. As oposed to that I like kubernetes's approach. The "liveliness" and "readyness" are easier to define precisely: "liveliness" means "I am alive, let me run" (and the opposite of that means "I am in trouble, need help, restart me, do something"), and "readyness" means "I can now accept a workload".
I have a hard time assigning a similarly precise meaning to the "Healthy" metric. What does "availability" mean?
Given this, should we perhaps avoid "heath" as a metric altogether and perhaps instead use "ready" and "lively" (or whatever the names) metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is meant to be the readyness flag. We could split this into two with readyness being recommended, and liveliness being optional.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the naming of readyness (readiness?) and liveliness, but if we're going down the road of multiple types, there actually could be arbitrary numbers of health status. For example, a backend could serve internal and external clients, and be unready for external clients but ready for internal clients. Can we just call it readyness and say "add an attribute if there are multiple distinct ready states"?

| *.restart_count | Number of restarts. | 1 | Asynchronous Counter |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a restart_count attribute defined for resources: https://github.com/open-telemetry/opentelemetry-specification/blob/a25d5f03ab58ecf88c09f635df97d2328b5ba237/specification/resource/semantic_conventions/k8s.md#container
It would be useful to tell how these two are related (if they are).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a metric standpoint, having restart_count as a resource attribute is really really bad. I'm suprised no one commented on that, but I think it should be dropped, as it violates resource identity and causes high cardinality.

The TL;DR: though is that restart_count is something you'd want to alert on, and for metric-systems that means it needs to be a metric. They'd likely be the same value, but one is a resource attribute the other is a metric data point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a metric standpoint, having restart_count as a resource attribute is really really bad. I'm suprised no one commented on that, but I think it should be dropped, as it violates resource identity and causes high cardinality.

It was discussed, see open-telemetry/opentelemetry-specification#1945 (review)

The conclusion was that for a k8s container in a pod it is an identifying attribute, not a metric.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Kubernetes restart count is not actually part of a unique-across-time identifier:

https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.21/#containerstatus-v1-core

The number of times the container has been restarted, currently based on the number of dead containers that have not yet been removed. Note that this is calculated from dead containers. But those containers are subject to garbage collection. This value will get capped at 5 by GC.

Arguably that means it's not even useful as a metric, but it's certainly not a unique identifier.



### Uptime
uptime is reported as a gauge with the value of the number of seconds that the process has been up. This is written as a gauge because users want to know the actual value of the number of seconds since the last restart to satisfy the use cases above. Sums are not a good fit for these use cases because most metric backends tend to default cumulative monotonic sums to rate-calculations, and have overflow handling that is undesired for this use case.

Sums report a total value that has accumulated over a time window; it is valid, for instance, to subtract the current value of a cumulative sum and restart the start timestamp to now. (OpenTelemetry's Prometheus receiver does this, for instance.)
An intended use case of a sum is to produce a meaningful value when aggregating away labels using sum. Such aggregations are not meaningful in the above use cases.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing implied here is that aggregating sums doesn't add together the seconds since last restart of all processes. Aggregating counters request time windows to be aligned. That alignment changes the value in the sum to be for the new time window which doesn't preserve the actual value of seconds since last restart as stated in the first paragraph of this section.


### Health
health is a GAUGE which a boolean value (or 0|1) which indicates if the process is available. This satisfies the “Alerting on current process health” use case above. Health often reflects more than just whether the process is alive; e.g. a process that is in the middle of (re)loading data might affirmatively report FALSE during that time. Because metric are sampled periodically, this metric isn’t well suited for use cases of rapidly changing value (i.e. it is likely to miss a restart).

### Restart Count
restart_count is a monotonic sum of the number of times that a process has restarted. This metric should be generated from an external observer of the system. The start timestamp of this metric is the start time of whatever process is observing restarts.
A process *may* report its own restarts, but likely this would need to be done via a DELTA sum which is aggregated by some external observer.


## Trade-offs and mitigations

The biggest tradeoff here is defining `uptime` metrics as non-montonic sums vs. either pure gauge or non-montonic sums. The fundmental question here is whether default sum-based aggregation is meaningful for this metric, in addition to the default-query-capabilities of common backends for cumulative sums. The proposal trades-off allowing an external observer to monitor uptime (with resets) in addition to common assumptions on querying rates for cumulative sums.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, small typo on non-montonic

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also fundmental => fundamental


## Prior art and alternatives

The biggest prior art in this space is Prometheus, which has some built-in uptime-like features:


Prometheus defines some conventions around uptime tracking across its ecosystem:

- `up` is a gauge with a `{0, 1}` value indicating whether Prometheus successfully scraped metrics from the target. If a task is known but can't be scraped, up is reported as `0`. If a task is unknown (e.g. the container is not scheduled), up is not reported.
- `process_uptime` is a counter reporting the number of seconds since the process has started. The Prometheus server does not actually store or use the type of a metric, so `process_uptime` is graphed as an absolute value despite being reported as a counter.
- Restart count is not a "global" or "built-in" feature. However, in Kubernetes deployments with kube-state-metrics, `kube_pod_container_status_restarts_total` reports the number of times a container has restarted.

There are obvious differences between the proposed `health` metric and `up` metric in prometheus. We believe these serve different use cases, but can be complementary. Generally `up` in prometheus is done by an external observer, while `health` can be reported by the process itself.

`uptime` lines up with prometheus, but we allow non-monotonic sums so external observers can report uptime on behalf of a process.

`restart_count` lines up with what is done in kubestat metrics in the prometheus cosystem.


## Open questions

Should OpenTelemetry specify `up` metrics as "exactly what prometheus does"?

## Future possibilities

What are some future changes that this proposal would enable?