Metrics Start-time resource semantic convention #1273

jmacd · 2020-12-03T23:37:13Z

What are you trying to achieve?

There has been some discussion about an Uptime metric. For example, the OpenTelemetry-Go runtime instrumentation includes one:

https://github.com/open-telemetry/opentelemetry-go-contrib/blob/d1534b84593e617bff9a848454a992a7af49385c/instrumentation/runtime/runtime.go#L122

There is a related request for an up metric, meaning something like "was able to produce metrics" in #1078. The uptime metric is different and can be used for monitoring process longevity, for example. There is a question of whether we should standardize a semantic-conventional metric name for uptime.

However, note that when we know the process start time, we are able to deduce the uptime provided we know that a process was up. Logically, a combination of the up metric and a process.start_time resource combine so that we can synthesize an process.uptime metric.

I've encountered a reason to prefer the use of a start_time resource and an up metric as opposed to an process.uptime metric, stated as follows.

An UpDownSumObserver instrument writes an OTLP Non-Monontonic Cumulative Sum data point, there is a well-defined conversion to Gauge in systems such as Prometheus that do not recognize Non-Monotonic Cumulatives. An UpDownCounter instrument writes an OTLP Non-Monotonic Delta Sum data point for the Stateless export configuration, but it is converted to a Cumulative in the default configuration. As long as the state that we maintain in an SDK for Delta-to-Cumulative conversion is never reset, there is no difference to the consumer of an OTLP Non-Monotonic Cumulative Sum (OTLP-NMCS) data point whether it was originally an UpDownSumObserver or an UpDownCounter.

If we move the Delta-to-Cumulative conversion out of the process (e.g., into a sidecar), then there may be a difference between an OTLP-NMCS that was reset and one that was never reset. We could use the start-time resource to detect this difference. This feels significant because ultimately, if the user is going to view a Cumulative Sum as its current, total value, then we should know whether it's the cumulative from the beginning of the process or cumulative from an arbitrary reset point. In a user-interface for a OTLP-NMCS timeseries, I would consider a generating an error to say that for Non-Monotonic Sums that have been reset you should only use Rate views, not Total views.

Concretely speaking the proposed semantic convention would be named process.start_time and would be documented here.

The text was updated successfully, but these errors were encountered:

jmacd · 2020-12-04T17:48:40Z

Another reason for this semantic convention worth mentioning:

There is an interest in translating Prometheus Remote Write streams into OTLP streams, where data points with Cumulative start time SHOULD have a start time. Traditional Prometheus reporting does not include this information, thus it uses a reset heuristic for detecting when cumulative series are reset. When there is a process.start_time resource present, Prometheus Remote Write streams can be converted to OTLP streams with correct start times (note: requires also the recently added Prometheus Remote Write metadata support).

andrzej-stencel · 2022-09-20T15:47:59Z

@jmacd we discussed this briefly during today's SIG Spec meeting. Is my understanding correct that today you would go with the "more traditional" approach of a process.uptime metric vs. the process.start_time attribute?

Maybe we can add both process.uptime metric name and process.start_time attribute to the convention? After all, this does not mean that OpenTelemetry must emit those, it just specifies the canonical name for these things; whether or not these will be used by an OT component is not a concern here. Is this correct? Example: Assuming there's a metrics generation engine that generates the "process uptime" metric (e.g. Telegraf) and a user wants to collect metrics from that engine with OT, that would help them define the OT name for it. Same with the "process start time" attribute. Does it make sense?

jmacd · 2022-09-20T17:18:36Z

Yes. I agree that both specifications are good to have.

process.uptime: defined as a non-monotonic counter to signal that reset is not meaningfully permitted
process.start_time: an attribute with a start timestamp (in a specified format)

It would be nice to establish a semantic connection between these-- that is the suggestion made in this issue originally. If you have are holding a Span object with a process.start_time resource, you may infer semantically that the process had an uptime of Span.start_time - Resource[process.start_time] when it started and Span.end_time - Resource[process.start_time] when it finished.

andrzej-stencel · 2022-09-26T12:15:16Z

I've just noticed that the Elastic Common Schema defines this as process.start with a value of e.g. 2016-05-23T08:05:34.853Z, i.e. a UTC, ISO-formatted time stamp (with a millisecond precision, or perhaps with an undefined precision?). Perhaps this is the way to go, what do you think @jmacd?

dmitryax · 2022-09-27T16:41:53Z

I like process.start_time as more readable option. We also have a similar resource attribute in collector's k8s processor k8s.pod.start_time that should be defined in the spec as well.

jmacd added area:semantic-conventions Related to semantic conventions spec:metrics Related to the specification/metrics directory release:allowed-for-ga Editorial changes that can still be added before GA since they don't require action by SIGs labels Dec 3, 2020

andrewhsu assigned jmacd Dec 8, 2020

andrewhsu added priority:p2 Medium priority level spec:protocol Related to the specification/protocol directory labels Dec 8, 2020

jmacd mentioned this issue Dec 11, 2020

Semantic conventions for up metric on Resources. #1078 #1102

Closed

jmacd added the area:data-model For issues related to data model label Feb 4, 2021

jmacd mentioned this issue Apr 6, 2021

Metrics: Make StartTimeUnixNanos required for aggregation_temporality=CUMULATIVE open-telemetry/opentelemetry-proto#292

Closed

jmacd mentioned this issue Jul 25, 2022

Metrics SDK: define standard circuit breakers for uncollected, infinite cardinality recordings #1891

Closed

jmacd mentioned this issue Aug 5, 2022

Rename or remove runtime.uptime metric from instrumentation/runtime package open-telemetry/opentelemetry-go-contrib#2625

Closed

tigrannajaryan mentioned this issue Sep 20, 2022

Add system uptime metric open-telemetry/semantic-conventions#648

Closed

andrzej-stencel mentioned this issue Sep 23, 2022

Add process.start_time resource attribute to semantic conventions #2825

Closed

andrzej-stencel mentioned this issue Sep 26, 2022

[receiver/hostmetrics] Add process.start resource attribute open-telemetry/opentelemetry-collector-contrib#14479

Closed

jmacd mentioned this issue Nov 15, 2022

How to create an "up" metric #2923

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics Start-time resource semantic convention #1273

Metrics Start-time resource semantic convention #1273

jmacd commented Dec 3, 2020 •

edited

Loading

jmacd commented Dec 4, 2020

andrzej-stencel commented Sep 20, 2022

jmacd commented Sep 20, 2022

andrzej-stencel commented Sep 26, 2022 •

edited

Loading

dmitryax commented Sep 27, 2022 •

edited

Loading

Metrics Start-time resource semantic convention #1273

Metrics Start-time resource semantic convention #1273

Comments

jmacd commented Dec 3, 2020 • edited Loading

jmacd commented Dec 4, 2020

andrzej-stencel commented Sep 20, 2022

jmacd commented Sep 20, 2022

andrzej-stencel commented Sep 26, 2022 • edited Loading

dmitryax commented Sep 27, 2022 • edited Loading

jmacd commented Dec 3, 2020 •

edited

Loading

andrzej-stencel commented Sep 26, 2022 •

edited

Loading

dmitryax commented Sep 27, 2022 •

edited

Loading