Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to remove/flush unused attributes in metric instruments #2997

Open
legendecas opened this issue May 27, 2022 · 18 comments
Open

Ability to remove/flush unused attributes in metric instruments #2997

legendecas opened this issue May 27, 2022 · 18 comments
Labels
feature-request never-stale signal:metrics Issues and PRs related to general metrics signal waiting-for-spec

Comments

@legendecas
Copy link
Member

Removing/flushing unused dynamic attributes in metric instruments is very important to not leaking memories.

Spec issue: open-telemetry/opentelemetry-specification#1297
Prometheus Suggestion: https://prometheus.io/docs/instrumenting/writing_clientlibs/#labels

Metrics with labels SHOULD support a remove() method with the same signature as labels() that will remove a Child from the metric no longer exporting it, and a clear() method that removes all Children from the metric. These invalidate caching of Children.

Opening this issue to keep track of this problem.

@dyladan dyladan added this to the Metrics GA milestone Jun 1, 2022
@dyladan dyladan added the up-for-grabs Good for taking. Extra help will be provided by maintainers label Jun 1, 2022
@legendecas legendecas mentioned this issue Jul 25, 2022
2 tasks
@jmacd
Copy link

jmacd commented Jul 27, 2022

@legendecas There are at least four issues being discussed here and in #3105.

If you're using delta temporality, you should be able to avoid memory buildup regardless of which attributes are in use. Memory buildup is a problem for the consumer downstream, in this case.

If you're using cumulative temporality and you find a need to delete() the way Prometheus suggersts, you may instead (in the current OTel spec) use an asynchronous instrument and unregister the callback producing data for that instrument when you want to stop reporting it. Support for callback unregister was explicitly for this purpose, as (IMO) we do not want to try to define the delete operation. This was debated in open-telemetry/opentelemetry-specification#2317.

If you're using cumulative temporality and you simply want to strip attributes, so that less memory is needed in the long term, restarting (or reconfiguring) the SDK with a View that eliminates those attributes should give the correct results. That is the sense in which open-telemetry/opentelemetry-specification#1297 defines safe attribute removal for SDKs--if you've implemented views you already have the machinery in place to safely remove attributes.

The final potential topic here is being discussed in open-telemetry/opentelemetry-specification#1891, question is what we want to do when cardinality is high and there are stale entries that could be removed from memory. Please join that discussion. We're able to do this with per-series start timestamps; however this complicates other things, and I'd like official guidance from Prometheus on this.

@legendecas
Copy link
Member Author

If you're using delta temporality, you should be able to avoid memory buildup regardless of which attributes are in use. Memory buildup is a problem for the consumer downstream, in this case.

When using delta temporality with an async instrument, the SDK needs to export delta values for the async instrument thus it has to remember the last reported metric streams: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/supplementary-guidelines.md#asynchronous-example-delta-temporality.

@jmacd
Copy link

jmacd commented Aug 4, 2022

Thanks @legendecas. You are correct. My statement about delta temporality only applies to synchronous instruments. This is the sense in which Lightstep would like to support a temporality preference named "stateless" that would configure delta temporality for synchronous and cumulative temporality for asynchronous instruments.

As you point out, asynchronous instruments when configured with delta temporality have the same problem as synchronous instruments configured with cumulative temporality, making async/delta have roughly the same problem as discussed in open-telemetry/opentelemetry-specification#1891. Please head to that discussion, I will continue there with a takeaway from this one.

@dyladan
Copy link
Member

dyladan commented Sep 14, 2022

@legendecas what would you think about documenting this issue and handling handling attribute removal after GA? I would like to get the GA release out before kubecon and this is the last big topic holding it.

@legendecas
Copy link
Member Author

Given that the specification of metrics api is already declared stable, I'm fine with tagging this issue as after-GA.

@asafm
Copy link

asafm commented Feb 22, 2023

I've raised an issue for adding Delete() for an instrument in the API specifications: open-telemetry/opentelemetry-specification#3062

@github-actions
Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days.

@eplightning
Copy link

eplightning commented Aug 4, 2023

Does this issue cover asynchronous instruments as well?

I have encountered the issue where SDK keeps exporting metrics which no longer get reported by my callbacks:

    this.objectState = capiMeter().createObservableGauge('capi.object.state', {
      valueType: ValueType.INT,
      unit: '{state}',
      description: 'The current state of an object'
    });
    
    this.objectState.addCallback(async (result) => {
      for (const obj of await this.objectStorage.getMetrics()) {
        result.observe(obj.active ? 1 : 0, {
          'capi.object.name': obj.name
        });
      }
    });
    
    // SDK will keep exporting all metrics for all attribute sets that ever got observed inside the callback, not just the recent call

@pichlermarc
Copy link
Member

Does this issue cover asynchronous instruments as well?

I have encountered the issue where SDK keeps exporting metrics which no longer get reported by my callbacks:

    this.objectState = capiMeter().createObservableGauge('capi.object.state', {
      valueType: ValueType.INT,
      unit: '{state}',
      description: 'The current state of an object'
    });
    
    this.objectState.addCallback(async (result) => {
      for (const obj of await this.objectStorage.getMetrics()) {
        result.observe(obj.active ? 1 : 0, {
          'capi.object.name': obj.name
        });
      }
    });
    
    // SDK will keep exporting all metrics for all attribute sets that ever got observed inside the callback, not just the recent call

@eplightning Yep, that's also something that would be covered by it. Does it keep reporting also when un-registering the callback (calling .removeCallback() on objectState)? 🤔

@eplightning
Copy link

eplightning commented Aug 11, 2023

@pichlermarc

Doesn't seem to change anything: old metrics remain. Tested on Prometheus and console exporters:

    const cb = async (result: ObservableResult<ShardMetricLabels>) => {
      for (const shard of await this.shardStorage.getMetrics()) {
        result.observe(shard.active ? 1 : 0, {
          'capi.shard.name': `${shard.name}`
        });
      }
    };

    this.metricShardState.addCallback(cb);
    console.log('registered');

    setTimeout(() => {
      this.metricShardState.removeCallback(cb);
      console.log('unregistered');
    }, 1000 * 120);

@eplightning
Copy link

eplightning commented Aug 11, 2023

Spec update was recently merged which I believe concerns this issue as well (at least the async instruments): open-telemetry/opentelemetry-specification#3242

@pichlermarc
Copy link
Member

Spec update was recently merged which I believe concerns this issue as well (at least the async instruments): open-telemetry/opentelemetry-specification#3242

@eplightning thanks for mentioning this and for double-checking! I created an issue to implement the spec for asynchronous instruments. #4096

I think we can keep this issue here at waiting-for-spec as we're waiting for open-telemetry/opentelemetry-specification#1297

@pkeuter
Copy link

pkeuter commented Mar 28, 2024

@pichlermarc I was wondering what the status is on this one. There seems to be no way (outside of restarting the application) to flush any old attribute sets that were not observed for a long time. Is there any workaround for this that you know of?

@enzo-cappa
Copy link

This is a problem for Kafka Consumer. Due to how they work, partitions are assigned dynamically. Also, metrics are commonly reported per partition.

Due to this issue, when a partition is unassigned OpenTelemetry will keep reporting the "last known value" for the metrics of partitions for which the consumer that is no longer assigned. This causes all kind of problems.

The only workaround we could find is to send "0" as the value for metrics corresponding to partitions that are not currently assigned. This is far from ideal, as 0 is actually a valid value. We end up having some logic when consuming the metrics so we can exclude the invalid values.

@pichlermarc
Copy link
Member

@pichlermarc I was wondering what the status is on this one. There seems to be no way (outside of restarting the application) to flush any old attribute sets that were not observed for a long time. Is there any workaround for this that you know of?

@pkeuter there is currently no workaround for this.

The way forward here is to implement #4095 (which would limit memory use as old metrics age out) or drive the specification issue open-telemetry/opentelemetry-specification#1297, then implement the outcome here.

Contributions are welcome, but keep in mind that Implementing #4095 is likely non-trivial.

@asafm
Copy link

asafm commented Jun 1, 2024

How about using async instruments? Then you control which partition metrics is reported for.

@pichlermarc pichlermarc removed the up-for-grabs Good for taking. Extra help will be provided by maintainers label Oct 15, 2024
@pichlermarc
Copy link
Member

Removing up-for-grabs as this is not actionable without a specification to implement.

@netiperher
Copy link

This is such a fundamental limitation with the opentelemetry SDK (and specification) that I recommend my clients to skip the SDK for dynamic attributes and emits metrics datapoints themselves, e.g. by attaching a custom metrics producer using the experimental metricProducers api:

  const fooMetric: GaugeMetricData = {
    dataPointType: DataPointType.GAUGE,
    descriptor: {
      name: 'foo',
      description: 'Foo metrics',
      unit: '',
      type: InstrumentType.OBSERVABLE_GAUGE,
      valueType: ValueType.DOUBLE,
    },
    aggregationTemporality: AggregationTemporality.CUMULATIVE,
    dataPoints: [],
  };
  const startTime = millisToHrTime(Date.now());

  const rawMetricsProducer: MetricProducer = {
    collect: async () => {
      return {
        resourceMetrics: {
          resource,
          scopeMetrics: [
            {
              scope: {
                name: 'raw',
              },
              metrics: [
                {
                  ...fooMetric,
                  dataPoints: [
                    {
                      startTime,
                      endTime: millisToHrTime(Date.now()),
                      attributes: { 'some-attribute': 'some-dynamic-value' },
                      value: 42,
                    },
                  ],
                },
              ],
            },
          ],
        },
        errors: [],
      };
    },
  };

  const meterProvider = new MeterProvider({
    resource,
    readers: [
      new PeriodicExportingMetricReader({
        exporter: new OTLPMetricExporter(),
        metricProducers: [rawMetricsProducer],
      }),
    ],
  });

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request never-stale signal:metrics Issues and PRs related to general metrics signal waiting-for-spec
Projects
None yet
Development

No branches or pull requests

9 participants