-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics semantic convention: "up" metric #1078
Comments
I'd like to work on this one with @cwildman. You can assign it to me @andrewhsu. |
Wondering if we should mark this release:allowed-for-ga? This is an important step for a push-based metrics pipeline to produce equivalent data as a pull-based metrics pipeline would produce, which is a prerequisite for existing Prometheus users (or a barrier to migration to OTel SDKs), but it is not a requirement (maybe?). |
ObjectiveDefine how to transform a OTel-Collector Prometheus receiver (pull) AND Identify a way to transform an OTel SDK's exporter (push) stream into Background:
|
NaN are actually valid series values, too. Since there are many possible NaN representations Prometheus defines two specific NaN representations "value NaNs" and "staleness NaNs" (see pkg/value). So a proper consumer of Prometheus data has to distinguish between these two NaN variants. In Prometheus itself that happens at read time in the query engine. In OT it would need to happen in the write path.
But also about many other things in between being configured and running correctly. To properly cover all cases, the OT collector would need to run some form of target discovery like in Prometheus and pre-initialize Whether the effort is worth it (to implement but more importantly for the user to configure additional target discovery for a single metric) is questionable. But without it I'm not sure its win for the user to have a simulated
Is In general the proposal means that the OT collector needs to track state for all series that have been passing through? That seems to give up on one of the major benefits of push-based clients. To reliably implement tracking (mostly it's garbage collection) it seems like some form of target discovery in the OT collector would be necessary.
Note that staleness markers are also set in Prometheus if a series disappears across scrapes two successful scrapes. So in principle one has to keep state and diff series of one scrape with the last. (The Prometheus scrape library does this out-of-the-box.) |
@fabxc Thank you! This is very helpful. It begins to look like the |
The problem at hand, also stated in open-telemetry/prometheus-interoperability-spec#8, may be connected with this thread about late-binding resources #1298 (comment). Suppose we replace the "Service Discovery" component in an OTel-Collector Prometheus receiver with an independent service discovery metrics receiver that simply produces the In the terms used in #1298, the |
@jmacd Let me see if I can rephrase this design in josh-bullet-points:
Is that the summary of the proposal? Questions:
|
@jsuereth Yes! I'm going to try to restate an example without referring to "receivers" other than OTLP. Everything pushes OTLP in this example; the collector receives only OTLP from external producers. Let's say the service discovery producer writes
An OpenMetrics "pusher" could subscribe to (a shard of) the In a successful example, the OpenMetrics compnent pushes
Both the service discovery and OpenMetrics components can be modeled as standalone producers of OTLP. Let's suppose that both of these metrics enter an OTel collector. A collector stage can be defined that joins the
The |
Ok, let me restate once again to verify my understanding of responsibilities: Service Discovery
SDK entrypoints
Collector
Implications / Questions
Is there a good way for us to move some of this discussion into a "proposal"/"design" document to verbage up sections/implications and considerations? I'm really having trouble with GH issues and tracking all the things I want to ask or add :) |
As far as semantics go it seems we are on the same page. But I still don't fully understand who would produce It sounds like Prometheus scrapes would also start emitting an I like the idea of a "service-discovery-exporter" for the Assuming a Prometheus backend (just because that's what I'm most familiar with), I'd see no issue for pull-based metrics to be accompanied by |
Is there a reason alive/present should be a part of the metrics spec? It seems like it's a generally useful concept for aliveness, service discovery and reporting resource attributes. Reporting alive/present as a metric is one of the use cases. For example, in PRW exporter's case, it could be turned into an "up" metric and PRW can also rely on this signal to identify staleness. But generally speaking, this topic is bigger than metrics cases and maybe we need a "service discovery and aliveness spec" to tackle the problem. Having said that, part of what alive/present can provide is already captured by service discovery, app directories and health. Is it fair to reposition OpenTelemetry to capture these use cases at this point? What's the cost of rethinking about this problem and making it part of the data model in the long term? For the sake of the simplicity and orthogonality, is it possible to start with producing the "up" at the PWR exporter and think about a more comprehensive solution in the long term? |
I am not well versed in the metrics world but I had to read this discussion three times before understanding |
We had a lot of discussion on this topic across a few SiGs. I'd like to call out a few points and what I think is the consensus.
Should we move this discussion to a "semantic convention" discussion given the above? |
About the terminology proposed above, "present" may not be the greatest term to describe which services are available. Other terms potentially: "available": I believe that Google's Monarch uses this term. |
FYI Lightstep prototyped the push-based metric described here in this collector branch: open-telemetry/opentelemetry-collector@main...lightstep:saladbar We will continue this effort and share here. (CC: @paivagustavo) |
If service discovery is determining part of this then there are at least preliminary checks being done. |
Closing this in favor of whatever is decided in open-telemetry/oteps#185 |
What are you trying to achieve?
up
is a standard metric in Prometheus systems to indicate that a particular combination ofjob
andinstance
was observed to be healthy. This comes from an active role taken by Prometheus servers in collecting metrics, but OpenTelemetry OTLP exporters can synthesize the same information on export to indicate that they are, in fact, up.This issue proposes we introduce and specify this metric. Prometheus specifies this as 0- or 1-valued metric labeled by the
job
,instance
labels. In OpenTelemetry the natural expression of this would be a label-free metric named "up", again 0- or 1-valued, reported along with the monitored Resource.Additional context.
This variable would be synthesized in receivers for other metrics protocols. For example, the OTel collector's Prometheus or OpenMetrics receiver would be changed to generate this metric when it scrapes a target.
The text was updated successfully, but these errors were encountered: