-
Notifications
You must be signed in to change notification settings - Fork 836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to remove/flush unused attributes in metric instruments #2997
Comments
@legendecas There are at least four issues being discussed here and in #3105. If you're using delta temporality, you should be able to avoid memory buildup regardless of which attributes are in use. Memory buildup is a problem for the consumer downstream, in this case. If you're using cumulative temporality and you find a need to If you're using cumulative temporality and you simply want to strip attributes, so that less memory is needed in the long term, restarting (or reconfiguring) the SDK with a View that eliminates those attributes should give the correct results. That is the sense in which open-telemetry/opentelemetry-specification#1297 defines safe attribute removal for SDKs--if you've implemented views you already have the machinery in place to safely remove attributes. The final potential topic here is being discussed in open-telemetry/opentelemetry-specification#1891, question is what we want to do when cardinality is high and there are stale entries that could be removed from memory. Please join that discussion. We're able to do this with per-series start timestamps; however this complicates other things, and I'd like official guidance from Prometheus on this. |
When using delta temporality with an async instrument, the SDK needs to export delta values for the async instrument thus it has to remember the last reported metric streams: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/supplementary-guidelines.md#asynchronous-example-delta-temporality. |
Thanks @legendecas. You are correct. My statement about delta temporality only applies to synchronous instruments. This is the sense in which Lightstep would like to support a temporality preference named "stateless" that would configure delta temporality for synchronous and cumulative temporality for asynchronous instruments. As you point out, asynchronous instruments when configured with delta temporality have the same problem as synchronous instruments configured with cumulative temporality, making async/delta have roughly the same problem as discussed in open-telemetry/opentelemetry-specification#1891. Please head to that discussion, I will continue there with a takeaway from this one. |
@legendecas what would you think about documenting this issue and handling handling attribute removal after GA? I would like to get the GA release out before kubecon and this is the last big topic holding it. |
Given that the specification of metrics api is already declared stable, I'm fine with tagging this issue as after-GA. |
I've raised an issue for adding Delete() for an instrument in the API specifications: open-telemetry/opentelemetry-specification#3062 |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days. |
Does this issue cover asynchronous instruments as well? I have encountered the issue where SDK keeps exporting metrics which no longer get reported by my callbacks: this.objectState = capiMeter().createObservableGauge('capi.object.state', {
valueType: ValueType.INT,
unit: '{state}',
description: 'The current state of an object'
});
this.objectState.addCallback(async (result) => {
for (const obj of await this.objectStorage.getMetrics()) {
result.observe(obj.active ? 1 : 0, {
'capi.object.name': obj.name
});
}
});
// SDK will keep exporting all metrics for all attribute sets that ever got observed inside the callback, not just the recent call |
@eplightning Yep, that's also something that would be covered by it. Does it keep reporting also when un-registering the callback (calling |
Doesn't seem to change anything: old metrics remain. Tested on Prometheus and console exporters: const cb = async (result: ObservableResult<ShardMetricLabels>) => {
for (const shard of await this.shardStorage.getMetrics()) {
result.observe(shard.active ? 1 : 0, {
'capi.shard.name': `${shard.name}`
});
}
};
this.metricShardState.addCallback(cb);
console.log('registered');
setTimeout(() => {
this.metricShardState.removeCallback(cb);
console.log('unregistered');
}, 1000 * 120); |
Spec update was recently merged which I believe concerns this issue as well (at least the async instruments): open-telemetry/opentelemetry-specification#3242 |
@eplightning thanks for mentioning this and for double-checking! I created an issue to implement the spec for asynchronous instruments. #4096 I think we can keep this issue here at |
@pichlermarc I was wondering what the status is on this one. There seems to be no way (outside of restarting the application) to flush any old attribute sets that were not observed for a long time. Is there any workaround for this that you know of? |
This is a problem for Kafka Consumer. Due to how they work, partitions are assigned dynamically. Also, metrics are commonly reported per partition. Due to this issue, when a partition is unassigned OpenTelemetry will keep reporting the "last known value" for the metrics of partitions for which the consumer that is no longer assigned. This causes all kind of problems. The only workaround we could find is to send "0" as the value for metrics corresponding to partitions that are not currently assigned. This is far from ideal, as 0 is actually a valid value. We end up having some logic when consuming the metrics so we can exclude the invalid values. |
@pkeuter there is currently no workaround for this. The way forward here is to implement #4095 (which would limit memory use as old metrics age out) or drive the specification issue open-telemetry/opentelemetry-specification#1297, then implement the outcome here. Contributions are welcome, but keep in mind that Implementing #4095 is likely non-trivial. |
How about using async instruments? Then you control which partition metrics is reported for. |
Removing |
This is such a fundamental limitation with the opentelemetry SDK (and specification) that I recommend my clients to skip the SDK for dynamic attributes and emits metrics datapoints themselves, e.g. by attaching a custom metrics producer using the experimental
|
Removing/flushing unused dynamic attributes in metric instruments is very important to not leaking memories.
Spec issue: open-telemetry/opentelemetry-specification#1297
Prometheus Suggestion: https://prometheus.io/docs/instrumenting/writing_clientlibs/#labels
Opening this issue to keep track of this problem.
The text was updated successfully, but these errors were encountered: