-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus receiver should not use process_start_time_seconds to set start time. #7411
Comments
During alpha: we should log a warning. |
This config option is important for correctly interpreting counter metrics, but it can't be enabled by default because some counters might be coming from outside the process being scraped. Why does the fact that it's not always the right thing to do mean the option needs to be removed entirely? It is useful and correct for the typical use case of scraping a user process exposing in-process counters. |
It's not necessarily correct for in-process counters as new counters can be created after the process start time. |
The process start time is still correct for those counters; a counter that doesn't exist yet hasn't counted any events, so it's valid to count the time between the process start time and the counter's creation as part of the time interval. A counter point with a time interval of The only case where this is wrong is when there are events in the window of |
We discussed this at the workgroup today. We should not remove this functionality, as it does work in some use-cases. As is the case today, it should be disabled by default. We should update documentation with a warning to highlight cases where it doesn't work correctly. |
I was able to attend part of this discussion. There are other refinements that can be pursued, if users are truly interested. In the https://github.com/lightstep/opentelemetry-prometheus-sidecar I faced a similar question, and we can refine the answer based on whether you've seen the series reset or not.
This requires state in the receiver, but the same state is needed to track staleness, so I'd guess it's not much cost to add this sort of variation. |
The receiver performs this calculation by default currently. The issue in question here is the alternative mechanism it has for determining metric start time, which is using a gauge value from a metric in the scrape, named As for staleness, the Prometheus scrape manager handles that for us and the receiver merely passes it through. |
This calculation as described is not correct. You can never use the process start time as the start time of a point unless specifically requested, because the counter might come from outside the process (as discussed upthread). The best you can do is "now" (the same as what happens when a value is seen to decrease), but you need to make sure that you also subtract the observed value for that point from all subsequent points. i.e. if you don't have a start time metric and your scrapes find:
then the only points you can emit are
It's not correct to emit I believe the relevant portion of the spec is:
If the value includes data outside the time window, then the value is not "calculated" for that time window. This isn't something that can be done by a separate processor/downstream system because that requires the downstream code to maintain indefinite state. It's also worth noting for completeness that "reset detection" in this form is still not foolproof; if the counter is reset but then increments past the previous value before the next scrape, we will miss the fact that the counter reset. I think the only way to resolve that is to have an explicit start time. (I believe a native Prometheus server would be similarly confused.) |
@quentinmit By the way, see https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/datamodel.md#cumulative-streams-handling-unknown-start-time for a description of the following.
The data model contradicts that statement. It's saying that at time 5 the value was 5, and the observation window also started at time 5. The contribution to the rate by this point is zero events over zero time. This doesn't invalidate the statement under discussion. Using the process start time is at best correct once, and in a stateless environment where your server may restart w/o the above-mentioned memory, then double-counting will surely result. That's why the discussion about inserting true reset points, which is to zero out the observations: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/datamodel.md#cumulative-streams-inserting-true-reset-points |
It's possible to interpret from a
This isn't just something that happens in exceptional conditions; any setup with load balancing can process a subset of the points through different paths that don't necessarily have access to earlier state. The only reliable place to handle this is in the receiver - as described in the spec at "inserting true reset points". IMO it would be better to simply not populate the state timestamp than to populate it with an arbitrary time; you still wouldn't be able to reliably tell when the stream resets, but at least downstream processors would be able to distinguish such streams from streams that do have true reset points.
I think we agree that "inserting true reset points" is the most reliable approach to take. I don't understand your "using the process start time is at best correct once", though - if you know the counter is valid for the lifetime of the process, why is it not okay to continue using that after a restart? |
@quentinmit Thanks for adding depth. You're right, I stand corrected. The distinction that I was maybe reaching for without saying so, is that it depends whether the processor is a "single-writer" or not. Inserting true reset points can't be done in a multi-path collection infrastructure without breaking the single-writer principle. If process or timeseries resets can happen, and if a single-writer is performing true reset points after the first reset, then it can't safely use the process start time (i.e., not even once). This was debated -- some -- a year ago. I welcome you to file a new issue to discuss whether unknown start time should be handled differently. I'm happy with the current state, I prefer it to having to use a heuristic to infer start time downstream of unknown resets. I view the point with coordinate [Start=X, End=X] as a better solution than that heuristic, but I'm aware of the conflict or ambiguity here. If we view the cumulative value as significant in itself, as opposed to viewing it primarily as a first derivative that expresses a rate, then some of the logic here fails. |
I am confused a bit by your distinction. I think inserting true reset points is something that must be done at the beginning of the pipeline, where it is still single-writer, before those points are sent downstream to a multi-path collection infrastructure. Once you're processing points in a multi-path scenario, you can no longer insert true reset points or properly interpret values that did not have true reset points already inserted, hence why I'm discussing it here in the Prometheus receiver.
What is the scenario where a "process reset" doesn't also change the process start time? I think we violently agree that process start time can't be used if timeseries resets are not correlated with process resets.
As I see it, the current state requires that downstream consumers implement a heuristic to interpret points because they are indistinguishable if the start time might not be "true". I agree this is in conflict with having the cumulative value be significant on its own. I'm curious if you have any examples where a value makes sense to be interpreted both as a rate and a gauge? I suspect many of these instances are actually cases where the counter should have been exposed as a gauge to begin with. Which repo should I file that issue in? I was discussing it here because we're discussing the algorithm the Prometheus receiver could use, which could be a subset of what is allowed in the overall spec. If we're willing to actually discuss changes at the spec level, it occurs to me that you could get the best of all of these worlds if |
I realize that I thought you were arguing a different position here, and now I'd recommend that you bring this discussion to the larger group. Please file an issue in the specification repository, https://github.com/open-telemetry/opentelemetry-specification/issues. It would be Great IMO to specify a resource attribute that carries the start time, and there are several ways it can be done w/ a semantic convention. My favorite proposal is to specify a special monotonic cumulative Sum stream named |
I believe we are discussing a scenario where the container reports a start time when the container is created, but there's a process inside it that can crash and restart w/o restarting the container. Then you're likely to see counters reset w/o changing the process_start_time. |
I'll work on that.
I think these are both solving different problems (both worthy problems to solve, but I don't think they're directly applicable to the issue at hand). A separate stream doesn't help because there's no guarantee that separate metrics are kept together in a multi-path environment, and as noted above the process start time is not necessarily the correct start time for a metric. A resource attribute could work but again each metric could have a different start time.
If a metric is reporting the container start time then it's not |
The Problem
Support for
process_start_time_seconds
was added in open-telemetry/opentelemetry-collector#394 in order to simplify setting start time for prometheus metrics. However, as pointed out in open-telemetry/opentelemetry-specification#2266 (review), this isn't correct in the general case.Describe the solution you'd like
We should deprecate and remove the
use_start_time_metric
andstart_time_metric_regex
options from theprometheusreceiver
. All metrics cumulative metrics collected by prometheus should follow the specification for handling unknown start times: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/datamodel.md#cumulative-streams-handling-unknown-start-time.We should plan on a slow deprecation of the feature, to make sure we understand the use-cases users are using it for, and can offer them reasonable replacements. To accomplish this, we should use a feature gate. If the feature gate is enabled, setting either of the deprecation options in the config will return an error. When the feature reaches beta (enabled by default), we should let it remain in beta for a few releases to ensure we can collect feedback from users while they can still disable the feature gate (which would allow them to continue using the feature).
Describe alternatives you've considered
We could consider keeping it, but it does not correctly handle many common cases.
cc @open-telemetry/wg-prometheus
The text was updated successfully, but these errors were encountered: