-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Semantic conventions for up
metric on Resources. #1078
#1102
Semantic conventions for up
metric on Resources. #1078
#1102
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this simple proposal.
I have one question for the group. Should we specify exporter behavior with a SHOULD?
They SHOULD inherit the attributes of the encapsulating Resource ...
If not, maybe we should make this a MAY?
@arminru I think I addressed your feedback. Thank you for the review! |
@open-telemetry/specs-metrics-approvers this is ready for review! |
|
||
## Metric Instruments | ||
|
||
The following metric instruments SHOULD be synthesized every time a Resource produces some |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if this SHOULD should be restricted to OTLP exporters? I think we explicitly do not want to export this for a Prometheus exporter, since the receiving end will synthesize up
.
We could elaborate on the purpose. Suppose a process starts a metrics SDK but the application registers no instrumentation packages and no instruments. A push OTLP exporter is installed, and now wants to push a single collection interval, consisting of 0 metrics. This metric exists to show the resource is up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok that is a good use case for an up metric with OTLP exporters that I hadn't thought of.
Regarding whether this SHOULD
should be restricted to OTLP exporters, what about the prometheus remote write exporter? That is non OTLP but I think we want the up metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. @cwildman yes! Maybe we need a term to say when this rule applies, like "original" to indicate when instrumentation is generated, particularly when it is first aggregated into an exposition format. I used the term "aggregated" thinking of the Collector: it will not apply the up
metric in any of its exporters but it could apply the metric in some of its receivers. OTel SDKs that export Prometheus remote write would apply this format, agreed.
|
||
| Name | Instrument | Units | Description | | ||
|----------------------|---------------|--------------|-------------| | ||
| `resource.up` | ValueRecorder | 1 | A value of `1` indicates the Resource is healthy, `0` if the Resource is unhealthy. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name resource.up
is great, but I don't think we should use the word "healthy". Maybe just "produced telemetry"?
For a push exporter, a resource that is not up means "no longer pushing". When might we see a push exporter explicitly push a zero for resource.up
? I was thinking through connection error states where some metric data is dropped. Should the resource.up
variable be set to zero when data is lost?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking as a UpDownSumObserver
. In a metrics pipeline we could drop resource attributes, lowering cardinality, which will naturally aggregate as a sum. This makes the resource.up
metric effectively a cardinality counter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"produced telemetry" makes sense to me.
For a push exporter, a resource that is not up means "no longer pushing". When might we see a push exporter explicitly push a zero for
resource.up
? I was thinking through connection error states where some metric data is dropped. Should theresource.up
variable be set to zero when data is lost?
If I'm following you correctly, isn't it a possibility that the resource.up
metric itself is lost as part of those connection errors? I feel like if that's the case this would be an unreliable signal.
I was thinking as a
UpDownSumObserver
. In a metrics pipeline we could drop resource attributes, lowering cardinality, which will naturally aggregate as a sum. This makes theresource.up
metric effectively a cardinality counter.
I'm curious why UpDownSumObserver
instead of just SumObserver
in that scenario if we never have a value less than zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like if that's the case this would be an unreliable signal.
True. I was thinking but didn't say anything about this case. The resource.up
metric might be treated specially inside an SDK (we might say MUST) in a way that it was buffered even when other timeseries are dropped. This kind of spec could be added in the ../protocol/exporter.md document, maybe.
There are differences in the push model, and one of the reasons we might not want to generate This means that we can standardize any one metric that MUST be builtin to OTel SDKs. In the OTel-Go contrib However an uptime metric doesn't solve one of the problems I was hoping to solve with this. An end user that has built up monitoring and alerting over Prometheus data is used to an Whether (1) or (2), the user's alerting and monitoring should keep working. We could logically synthesize the I worry a little about introducing a name conflict so early on. If OTel uses |
Totally agree, I tried to articulate this in the description but you've said it much clearer.
In case (1) shouldn't the Prometheus receiver be responsible for synthesizing the up metric since it is replacing the prometheus server which does that normally?
What exactly do you mean by standardize translation stages? I'm picturing we could have this as a default relabel rule for places where it's obviously necessary. For example a prometheus remote write exporter should make sure |
I think you've got the picture. This extra relabeling just just more complexity for the sake of using |
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
@jmacd @bogdandrutu per your discussion at last weeks metrics meeting we have two semantic conventions we want to define:
I'd like to keep this PR focused on #1, the compatibility issue. Do we want a separate OTEL issue for #2? |
3caa9d1
to
94ef57b
Compare
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
@jmacd and @bogdandrutu, any thoughts on the above proposal and question from @cwildman? |
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
Fixes #1078
Changes
Describe the
up
metric on Resources.PR Description & Discussion
This PR lays the groundwork for an
up
metric to identify Resources that are healthy or unhealthy. There are a few open questions on where this metric should be generated and if it should have any labels. I'll mention those questions here but for now I tried to leave the semantic convention fairly open-ended.The
up
metric makes sense in a telemetry system using a pull model. This is because the component doing the pulling already has to keep state about which targets it is pulling metrics from and when. If a pull from a specific target ever fails, this component can synthesize theup
metric with a value of 0 indicating it is unhealthy.In a telemetry system using a push model, it's less clear how beneficial the
up
metric is. If the server side was to synthesize theup
metric in a push system, there are a few problems: 1. the server would be forced to keep state about the Resources that have reported so that it can generate the "unhealthy"up
metric when they stop reporting, 2. is a Resource that stops reporting always unhealthy? Or perhaps it simply no longer exists. If the client side was to synthesize theup
metric in a push system can it ever really report unhealthy? If the process pushing these metrics has died then it can't report as unhealthy. Then the indication that this Resource is unhealthy becomes anytime this metric is not present. That seems superfluous since you could use any metric from this Resource to try to determine that.In summary I think it would be useful to generate the
up
metric in receivers that are using a pull model. I am less convinced it makes sense to synthesize theup
metric in all exporters, where unhealthy just becomes the lack of presence of a metric. I could see generating theup
metric in prometheus exporters that push data (prometheus remote write) for backwards compatibility. A prometheus exporter that simply acts as a scrape target shouldn't need to synthesize up since the prometheus server will do it automatically.I agree that this metric doesn't have any explicit labels but, we at least want some attributes from the Resource this metric is reported for added to the metric when it is exported. Without those labels the metric is useless. Maybe that is obvious and doesn't need to be called out in the semantic convention.
Related issues #1078
Mentioning @jmacd and @bogdandrutu for reviews.