Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics Start-time resource semantic convention #1273

Open
jmacd opened this issue Dec 3, 2020 · 5 comments
Open

Metrics Start-time resource semantic convention #1273

jmacd opened this issue Dec 3, 2020 · 5 comments
Assignees
Labels
area:data-model For issues related to data model area:semantic-conventions Related to semantic conventions priority:p2 Medium priority level release:allowed-for-ga Editorial changes that can still be added before GA since they don't require action by SIGs spec:metrics Related to the specification/metrics directory spec:protocol Related to the specification/protocol directory

Comments

@jmacd
Copy link
Contributor

jmacd commented Dec 3, 2020

What are you trying to achieve?

There has been some discussion about an Uptime metric. For example, the OpenTelemetry-Go runtime instrumentation includes one:

https://github.com/open-telemetry/opentelemetry-go-contrib/blob/d1534b84593e617bff9a848454a992a7af49385c/instrumentation/runtime/runtime.go#L122

There is a related request for an up metric, meaning something like "was able to produce metrics" in #1078. The uptime metric is different and can be used for monitoring process longevity, for example. There is a question of whether we should standardize a semantic-conventional metric name for uptime.

However, note that when we know the process start time, we are able to deduce the uptime provided we know that a process was up. Logically, a combination of the up metric and a process.start_time resource combine so that we can synthesize an process.uptime metric.

I've encountered a reason to prefer the use of a start_time resource and an up metric as opposed to an process.uptime metric, stated as follows.

An UpDownSumObserver instrument writes an OTLP Non-Monontonic Cumulative Sum data point, there is a well-defined conversion to Gauge in systems such as Prometheus that do not recognize Non-Monotonic Cumulatives. An UpDownCounter instrument writes an OTLP Non-Monotonic Delta Sum data point for the Stateless export configuration, but it is converted to a Cumulative in the default configuration. As long as the state that we maintain in an SDK for Delta-to-Cumulative conversion is never reset, there is no difference to the consumer of an OTLP Non-Monotonic Cumulative Sum (OTLP-NMCS) data point whether it was originally an UpDownSumObserver or an UpDownCounter.

If we move the Delta-to-Cumulative conversion out of the process (e.g., into a sidecar), then there may be a difference between an OTLP-NMCS that was reset and one that was never reset. We could use the start-time resource to detect this difference. This feels significant because ultimately, if the user is going to view a Cumulative Sum as its current, total value, then we should know whether it's the cumulative from the beginning of the process or cumulative from an arbitrary reset point. In a user-interface for a OTLP-NMCS timeseries, I would consider a generating an error to say that for Non-Monotonic Sums that have been reset you should only use Rate views, not Total views.

Concretely speaking the proposed semantic convention would be named process.start_time and would be documented here.

@jmacd jmacd added area:semantic-conventions Related to semantic conventions spec:metrics Related to the specification/metrics directory release:allowed-for-ga Editorial changes that can still be added before GA since they don't require action by SIGs labels Dec 3, 2020
@jmacd
Copy link
Contributor Author

jmacd commented Dec 4, 2020

Another reason for this semantic convention worth mentioning:

There is an interest in translating Prometheus Remote Write streams into OTLP streams, where data points with Cumulative start time SHOULD have a start time. Traditional Prometheus reporting does not include this information, thus it uses a reset heuristic for detecting when cumulative series are reset. When there is a process.start_time resource present, Prometheus Remote Write streams can be converted to OTLP streams with correct start times (note: requires also the recently added Prometheus Remote Write metadata support).

@andrzej-stencel
Copy link
Member

@jmacd we discussed this briefly during today's SIG Spec meeting. Is my understanding correct that today you would go with the "more traditional" approach of a process.uptime metric vs. the process.start_time attribute?

Maybe we can add both process.uptime metric name and process.start_time attribute to the convention? After all, this does not mean that OpenTelemetry must emit those, it just specifies the canonical name for these things; whether or not these will be used by an OT component is not a concern here. Is this correct? Example: Assuming there's a metrics generation engine that generates the "process uptime" metric (e.g. Telegraf) and a user wants to collect metrics from that engine with OT, that would help them define the OT name for it. Same with the "process start time" attribute. Does it make sense?

@jmacd
Copy link
Contributor Author

jmacd commented Sep 20, 2022

Yes. I agree that both specifications are good to have.

process.uptime: defined as a non-monotonic counter to signal that reset is not meaningfully permitted
process.start_time: an attribute with a start timestamp (in a specified format)

It would be nice to establish a semantic connection between these-- that is the suggestion made in this issue originally. If you have are holding a Span object with a process.start_time resource, you may infer semantically that the process had an uptime of Span.start_time - Resource[process.start_time] when it started and Span.end_time - Resource[process.start_time] when it finished.

@andrzej-stencel
Copy link
Member

andrzej-stencel commented Sep 26, 2022

I've just noticed that the Elastic Common Schema defines this as process.start with a value of e.g. 2016-05-23T08:05:34.853Z, i.e. a UTC, ISO-formatted time stamp (with a millisecond precision, or perhaps with an undefined precision?). Perhaps this is the way to go, what do you think @jmacd?

@dmitryax
Copy link
Member

dmitryax commented Sep 27, 2022

I like process.start_time as more readable option. We also have a similar resource attribute in collector's k8s processor k8s.pod.start_time that should be defined in the spec as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:data-model For issues related to data model area:semantic-conventions Related to semantic conventions priority:p2 Medium priority level release:allowed-for-ga Editorial changes that can still be added before GA since they don't require action by SIGs spec:metrics Related to the specification/metrics directory spec:protocol Related to the specification/protocol directory
Projects
None yet
4 participants