Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus receiver should not use process_start_time_seconds to set start time. #7411

Closed
dashpole opened this issue Jan 27, 2022 · 15 comments · Fixed by #7780
Closed

Prometheus receiver should not use process_start_time_seconds to set start time. #7411

dashpole opened this issue Jan 27, 2022 · 15 comments · Fixed by #7780
Assignees
Labels
comp:prometheus Prometheus related issues

Comments

@dashpole
Copy link
Contributor

The Problem

Support for process_start_time_seconds was added in open-telemetry/opentelemetry-collector#394 in order to simplify setting start time for prometheus metrics. However, as pointed out in open-telemetry/opentelemetry-specification#2266 (review), this isn't correct in the general case.

Describe the solution you'd like

We should deprecate and remove the use_start_time_metric and start_time_metric_regex options from the prometheusreceiver. All metrics cumulative metrics collected by prometheus should follow the specification for handling unknown start times: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/datamodel.md#cumulative-streams-handling-unknown-start-time.

We should plan on a slow deprecation of the feature, to make sure we understand the use-cases users are using it for, and can offer them reasonable replacements. To accomplish this, we should use a feature gate. If the feature gate is enabled, setting either of the deprecation options in the config will return an error. When the feature reaches beta (enabled by default), we should let it remain in beta for a few releases to ensure we can collect feedback from users while they can still disable the feature gate (which would allow them to continue using the feature).

Describe alternatives you've considered
We could consider keeping it, but it does not correctly handle many common cases.

cc @open-telemetry/wg-prometheus

@dashpole
Copy link
Contributor Author

dashpole commented Feb 2, 2022

During alpha: we should log a warning.

@quentinmit
Copy link
Contributor

This config option is important for correctly interpreting counter metrics, but it can't be enabled by default because some counters might be coming from outside the process being scraped.

Why does the fact that it's not always the right thing to do mean the option needs to be removed entirely? It is useful and correct for the typical use case of scraping a user process exposing in-process counters.

@Aneurysm9
Copy link
Member

It's not necessarily correct for in-process counters as new counters can be created after the process start time.

@quentinmit
Copy link
Contributor

quentinmit commented Feb 8, 2022

It's not necessarily correct for in-process counters as new counters can be created after the process start time.

The process start time is still correct for those counters; a counter that doesn't exist yet hasn't counted any events, so it's valid to count the time between the process start time and the counter's creation as part of the time interval. A counter point with a time interval of [start,end] is reporting the count of events that happened in the window [start,end]. The correct value is the same whether the start time is the process start time or the counter's instantiation time.

The only case where this is wrong is when there are events in the window of [process start,counter instantiation] that weren't counted or events before process start that are included in the count, such as when the counter is being read from outside the process. That's why process_start_time_seconds can't always be used as the start time without knowledge of the process being scraped.

@dashpole
Copy link
Contributor Author

dashpole commented Feb 9, 2022

We discussed this at the workgroup today. We should not remove this functionality, as it does work in some use-cases. As is the case today, it should be disabled by default. We should update documentation with a warning to highlight cases where it doesn't work correctly.

@dashpole dashpole self-assigned this Feb 9, 2022
@jmacd
Copy link
Contributor

jmacd commented Feb 9, 2022

I was able to attend part of this discussion. There are other refinements that can be pursued, if users are truly interested. In the https://github.com/lightstep/opentelemetry-prometheus-sidecar I faced a similar question, and we can refine the answer based on whether you've seen the series reset or not.

  1. The first time you see a target produce a series, use the process start time as the start time of the series.
  2. For any Prometheus Counter value received that has a value less than the last-reported value for that series, reset the start time to the last-observed timestamp.

This requires state in the receiver, but the same state is needed to track staleness, so I'd guess it's not much cost to add this sort of variation.

@Aneurysm9
Copy link
Member

  • The first time you see a target produce a series, use the process start time as the start time of the series.
  • For any Prometheus Counter value received that has a value less than the last-reported value for that series, reset the start time to the last-observed timestamp.

The receiver performs this calculation by default currently. The issue in question here is the alternative mechanism it has for determining metric start time, which is using a gauge value from a metric in the scrape, named process_start_time_seconds by default but configurable to use any metric matching a provided regex.

As for staleness, the Prometheus scrape manager handles that for us and the receiver merely passes it through.

@quentinmit
Copy link
Contributor

  • The first time you see a target produce a series, use the process start time as the start time of the series.
  • For any Prometheus Counter value received that has a value less than the last-reported value for that series, reset the start time to the last-observed timestamp.

The receiver performs this calculation by default currently. The issue in question here is the alternative mechanism it has for determining metric start time, which is using a gauge value from a metric in the scrape, named process_start_time_seconds by default but configurable to use any metric matching a provided regex.

As for staleness, the Prometheus scrape manager handles that for us and the receiver merely passes it through.

This calculation as described is not correct. You can never use the process start time as the start time of a point unless specifically requested, because the counter might come from outside the process (as discussed upthread). The best you can do is "now" (the same as what happens when a value is seen to decrease), but you need to make sure that you also subtract the observed value for that point from all subsequent points. i.e. if you don't have a start time metric and your scrapes find:

t=5 5
t=10 10
t=15 3
t=20 6

then the only points you can emit are

[start=5,end=10] 5
[start=15,end=20] 3

It's not correct to emit [start=5,end=5] 5, nor is it correct to emit [start=5,end=10] 10. The former point is claiming that 5 events happened in 0 time, which is an infinite rate. The latter is claiming that 10 events happened from 5 to 10, but they didn't (we know 5 of them happened at some time before 5).

I believe the relevant portion of the spec is:

Sums in OTLP consist of the following:
<snip>
A set of data points, each containing:
<snip>
A time window (of (start, end]) time for which the Sum was calculated.

If the value includes data outside the time window, then the value is not "calculated" for that time window.

This isn't something that can be done by a separate processor/downstream system because that requires the downstream code to maintain indefinite state.

It's also worth noting for completeness that "reset detection" in this form is still not foolproof; if the counter is reset but then increments past the previous value before the next scrape, we will miss the fact that the counter reset. I think the only way to resolve that is to have an explicit start time. (I believe a native Prometheus server would be similarly confused.)

@jmacd
Copy link
Contributor

jmacd commented Feb 10, 2022

@quentinmit By the way, see https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/datamodel.md#cumulative-streams-handling-unknown-start-time for a description of the following.

It's not correct to emit [start=5,end=5] 5

The data model contradicts that statement. It's saying that at time 5 the value was 5, and the observation window also started at time 5. The contribution to the rate by this point is zero events over zero time.

This doesn't invalidate the statement under discussion. Using the process start time is at best correct once, and in a stateless environment where your server may restart w/o the above-mentioned memory, then double-counting will surely result. That's why the discussion about inserting true reset points, which is to zero out the observations: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/datamodel.md#cumulative-streams-inserting-true-reset-points

@quentinmit
Copy link
Contributor

@quentinmit By the way, see https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/datamodel.md#cumulative-streams-handling-unknown-start-time for a description of the following.

It's not correct to emit [start=5,end=5] 5

The data model contradicts that statement. It's saying that at time 5 the value was 5, and the observation window also started at time 5. The contribution to the rate by this point is zero events over zero time.

It's possible to interpret from a [start=5,end=5] 5 point as "zero rate", but it's impossible to reliably interpret a subsequent [start=5,end=10] 10 point without having state from the "zero rate" point - the spec you linked covers this as

If the first point in an unknown start-time reset sequence is lost, the consumer of this data might overcount the rate contribution of the second point, as it then appears like a "true" reset.

This isn't just something that happens in exceptional conditions; any setup with load balancing can process a subset of the points through different paths that don't necessarily have access to earlier state. The only reliable place to handle this is in the receiver - as described in the spec at "inserting true reset points".

IMO it would be better to simply not populate the state timestamp than to populate it with an arbitrary time; you still wouldn't be able to reliably tell when the stream resets, but at least downstream processors would be able to distinguish such streams from streams that do have true reset points.

This doesn't invalidate the statement under discussion. Using the process start time is at best correct once, and in a stateless environment where your server may restart w/o the above-mentioned memory, then double-counting will surely result. That's why the discussion about inserting true reset points, which is to zero out the observations: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/datamodel.md#cumulative-streams-inserting-true-reset-points

I think we agree that "inserting true reset points" is the most reliable approach to take. I don't understand your "using the process start time is at best correct once", though - if you know the counter is valid for the lifetime of the process, why is it not okay to continue using that after a restart?

@jmacd
Copy link
Contributor

jmacd commented Feb 11, 2022

@quentinmit Thanks for adding depth. You're right, I stand corrected.

The distinction that I was maybe reaching for without saying so, is that it depends whether the processor is a "single-writer" or not. Inserting true reset points can't be done in a multi-path collection infrastructure without breaking the single-writer principle.

If process or timeseries resets can happen, and if a single-writer is performing true reset points after the first reset, then it can't safely use the process start time (i.e., not even once).

This was debated -- some -- a year ago. I welcome you to file a new issue to discuss whether unknown start time should be handled differently. I'm happy with the current state, I prefer it to having to use a heuristic to infer start time downstream of unknown resets. I view the point with coordinate [Start=X, End=X] as a better solution than that heuristic, but I'm aware of the conflict or ambiguity here. If we view the cumulative value as significant in itself, as opposed to viewing it primarily as a first derivative that expresses a rate, then some of the logic here fails.

@quentinmit
Copy link
Contributor

@quentinmit Thanks for adding depth. You're right, I stand corrected.

The distinction that I was maybe reaching for without saying so, is that it depends whether the processor is a "single-writer" or not. Inserting true reset points can't be done in a multi-path collection infrastructure without breaking the single-writer principle.

I am confused a bit by your distinction. I think inserting true reset points is something that must be done at the beginning of the pipeline, where it is still single-writer, before those points are sent downstream to a multi-path collection infrastructure. Once you're processing points in a multi-path scenario, you can no longer insert true reset points or properly interpret values that did not have true reset points already inserted, hence why I'm discussing it here in the Prometheus receiver.

If process or timeseries resets can happen, and if a single-writer is performing true reset points after the first reset, then it can't safely use the process start time (i.e., not even once).

What is the scenario where a "process reset" doesn't also change the process start time? I think we violently agree that process start time can't be used if timeseries resets are not correlated with process resets.

This was debated -- some -- a year ago. I welcome you to file a new issue to discuss whether unknown start time should be handled differently. I'm happy with the current state, I prefer it to having to use a heuristic to infer start time downstream of unknown resets. I view the point with coordinate [Start=X, End=X] as a better solution than that heuristic, but I'm aware of the conflict or ambiguity here. If we view the cumulative value as significant in itself, as opposed to viewing it primarily as a first derivative that expresses a rate, then some of the logic here fails.

As I see it, the current state requires that downstream consumers implement a heuristic to interpret points because they are indistinguishable if the start time might not be "true". I agree this is in conflict with having the cumulative value be significant on its own. I'm curious if you have any examples where a value makes sense to be interpreted both as a rate and a gauge? I suspect many of these instances are actually cases where the counter should have been exposed as a gauge to begin with.

Which repo should I file that issue in? I was discussing it here because we're discussing the algorithm the Prometheus receiver could use, which could be a subset of what is allowed in the overall spec.

If we're willing to actually discuss changes at the spec level, it occurs to me that you could get the best of all of these worlds if start_time were paired with start_value inside each point. Then it would be unambiguous how to interpret each point, and consumers could subtract the start_value if they want to get a "true reset" stream, even in a multi-path scenario. I don't know what the appetite would be for adding a field like that to the protos, though.

@jmacd
Copy link
Contributor

jmacd commented Feb 14, 2022

@quentinmit

I realize that I thought you were arguing a different position here, and now I'd recommend that you bring this discussion to the larger group.

Please file an issue in the specification repository, https://github.com/open-telemetry/opentelemetry-specification/issues.

It would be Great IMO to specify a resource attribute that carries the start time, and there are several ways it can be done w/ a semantic convention.

My favorite proposal is to specify a special monotonic cumulative Sum stream named alive that MUST be reset when a process restarts, MUST have the value 1, and MUST have start_time set accordingly. Then for any metric data point with that resource, you know the correct start time. (Waving my hands a bit.)

@jmacd
Copy link
Contributor

jmacd commented Feb 14, 2022

What is the scenario where a "process reset" doesn't also change the process start time?

I believe we are discussing a scenario where the container reports a start time when the container is created, but there's a process inside it that can crash and restart w/o restarting the container. Then you're likely to see counters reset w/o changing the process_start_time.

@quentinmit
Copy link
Contributor

Please file an issue in the specification repository, https://github.com/open-telemetry/opentelemetry-specification/issues.

I'll work on that.

It would be Great IMO to specify a resource attribute that carries the start time, and there are several ways it can be done w/ a semantic convention.

My favorite proposal is to specify a special monotonic cumulative Sum stream named alive that MUST be reset when a process restarts, MUST have the value 1, and MUST have start_time set accordingly. Then for any metric data point with that resource, you know the correct start time. (Waving my hands a bit.)

I think these are both solving different problems (both worthy problems to solve, but I don't think they're directly applicable to the issue at hand). A separate stream doesn't help because there's no guarantee that separate metrics are kept together in a multi-path environment, and as noted above the process start time is not necessarily the correct start time for a metric. A resource attribute could work but again each metric could have a different start time.

What is the scenario where a "process reset" doesn't also change the process start time?

I believe we are discussing a scenario where the container reports a start time when the container is created, but there's a process inside it that can crash and restart w/o restarting the container. Then you're likely to see counters reset w/o changing the process_start_time.

If a metric is reporting the container start time then it's not process_start_time, it's container_start_time :) (and it wouldn't actually be exported by the process itself either way, no? Something in the container runtime would presumably be reporting the container start time.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:prometheus Prometheus related issues
Projects
None yet
4 participants