Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/vcenter] Network Packet Metrics Have Metadata Issues #32835

Closed
StefanKurek opened this issue May 2, 2024 · 3 comments
Closed

[receiver/vcenter] Network Packet Metrics Have Metadata Issues #32835

StefanKurek opened this issue May 2, 2024 · 3 comments
Labels
bug Something isn't working receiver/vcenter

Comments

@StefanKurek
Copy link
Contributor

Component(s)

receiver/vcenter

What happened?

Description

Currently there are some issues with the way the packet metrics are presented & their metadata. The vcenter.*.network.packet.count metrics are both incorrectly marked with rate units in the metadata. They are also marked as non-monotonic cumulative sums. The actual returned datapoints represent delta sums of packets transmitted over previous 20s intervals in succession.

A similar issue exists for vcenter.host.network.packet.errors, but only for the discrepancy between non-monotonic cumulative sum & delta sums.

Steps to Reproduce

Collect against any vCenter environment with VMs.

Expected Result

vcenter.vm.network.packet.count would be returned where each datapoint represents a packet transmission rate (for each VM).
vcenter.host.network.packet.count is returned with datapoints each representing a packet transmission rate (for each Host).
vcenter.host.network.packet.errors is returned with datapoints each representing a packet error rate (for each Host).

Actual Result

vcenter.vm.network.packet.count is returned with a single datapoint representing 20s of accumulated packet count data (for each VM).
vcenter.host.network.packet.count is returned with 5 datapoints each representing the previous 20s of accumulated packet count data (for each Host).
vcenter.host.network.packet.errors is returned with 5 datapoints each representing the previous 20s of accumulated packet error count data (for each Host).

Collector version

v1.6.0/v0.99.0

Environment information

No response

OpenTelemetry Collector configuration

extensions:
  basicauth/prom:
    client_auth:
      username: [PROMUSER]
      password: [PROMPASS]

exporters:
  prometheusremotewrite:
    endpoint: [PROMENDPOINT]
    auth:
      authenticator: basicauth/prom
    resource_to_telemetry_conversion:
      enabled: true # Convert resource attributes to metric labels

processors:
  batch:
    # https://github.com/open-telemetry/opentelemetry-collector/tree/main/processor/batchprocessor

receivers:
  vcenter:
    endpoint: https://[VCENTERHOST]
    username: [VCENTERUSER]
    password: [VCENTERPASS]
    tls:
      insecure: true
    collection_interval: 1m
    initial_delay: 1s

service:
  extensions: [basicauth/prom]
  pipelines:
    metrics:
      receivers: [vcenter]
      processors: [batch]
      exporters: [prometheusremotewrite]

Log output

No response

Additional context

@djaglowski @schmikei I don't actually want to change the metadata to monotonic cumulative deltas. This would cause these metrics to not work with the prometheus exporters, and that would be problematic for my current use case.

Instead, I think it might make more sense to do something like convert them to rates (by dividing the values returned by the interval). We could make/keep the units as rates in this case. We then could either convert to Gauges (or I guess keep as they are as I don't think it hurts anything).

If we want to make it a rate, we also have the option to use a larger interval (5m) to get the deltas for these.

Whether or not we convert to rates, we also have the option to backfill these datapoints to try and "fill up" the collection interval to make this data more "honest" (not sure if that is the right word) to the end user.

In my use case (Importing into Grafana with a prometheus based exporter which just ends up taking the latest datapoint), I could always just take my sample and convert to a 20s rate on the Grafana side of things.

@StefanKurek StefanKurek added bug Something isn't working needs triage New item requiring triage labels May 2, 2024
@djaglowski
Copy link
Member

I don't actually want to change the metadata to monotonic cumulative deltas.

I don't think a delta is the correct way to model these 20s long intervals anyways, unless we can guarantee the scraper is not getting data points whose intervals overlap.

Instead, I think it might make more sense to do something like convert them to rates (by dividing the values returned by the interval).

I think this is the most correct way to represent the data, given the limitations of the API.

We then could either convert to Gauges (or I guess keep as they are as I don't think it hurts anything).

I would leave them as the sum across multiple data points streams is still useful information. e.g. total rate of packets received across all VMs.

@StefanKurek
Copy link
Contributor Author

@djaglowski OK. I'm not as worried about whether or not it's labeled as a Cumulative Sum vs. Gauge (other than wanting it to be accurate).

As far as turning these metrics to rates, how do you feel about how the collector currently gathers any performance metrics in general (including these packet metrics)?

For VMs, it grabs a sample which represents an aggregation of the last 20s of data (might be just accumulated counts, might be an average, a max, or even an avg rate calculated over this time) which we then turn into a single datapoint.
For Hosts, this is mostly the same, except we specifically ask for 5 datapoints from the API (1 for the last 20s interval, 1 for the 20s interval before that, etc). We then turn all of these into datapoints for the OTEL metric.

So for for something short like a 1m collection interval, VMs will report these metrics with a single datapoint of 20s of some sort of aggregation. 40s of information is ultimately "lost" at the moment. This would compound for a larger collection time like 30m. We'd ultimately have 20s of data represented in a single datapoint for every 30m collection.

For Hosts and a short 1m collection interval, like I mentioned hosts will report 5 of these datapoints. So we get 100s of previous data split across 5 datapoints, but that implies that there is overlap with the previous collection's data as well. With a 30m interval, we no longer have overlap but also still have a lot of missing information that might not be apparent to a use (considering that we're marking this metric as a cumulative sum?)

I know this is a bit tangential, but I feel like these packet metrics are where I'm noticing this "issue" in the forefront of my mind.

Now I'll circle back and ask the question again on if we should do something like change the intervals for just these packet metrics to something like 5m, and then label these metrics in the receiver as "packets.rate.average" or "packet.error.rate.average" or something like that?

Copy link
Contributor

github-actions bot commented May 3, 2024

Pinging code owners for receiver/vcenter: @djaglowski @schmikei @StefanKurek. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@crobert-1 crobert-1 removed the needs triage New item requiring triage label May 3, 2024
djaglowski pushed a commit that referenced this issue May 7, 2024
**Description:** <Describe what has changed.>
Adds new rate packet metrics which correctly report as per second rates
(avg over 20s).

Adds warnings for existing packet metrics that they will be removed in
v0.102.0.

**Link to tracking Issue:** <Issue number if applicable>
#32835 

**Testing:** <Describe what testing was performed and which tests were
added.>
Unit/integration tests updated and tested. Local environment tested.

**Documentation:** <Describe the documentation added.>
New documentation generated based on the metadata.
djaglowski pushed a commit that referenced this issue May 22, 2024
**Description:** <Describe what has changed.>
This mostly replaced deprecated packet metrics which already had a
warning. Namely `vcenter.host.network.packet.errors`,
`vcenter.host.network.packet.count`, and
`vcenter.vm.network.packet.count`. They are replaced through enabling by
default (while removing the existing warnings) the metrics
`vcenter.host.network.packet.error.rate`,
`vcenter.host.network.packet.rate`, and `vcenter.vm.network.packet.rate`
respectively.

The metric `vcenter.vm.network.packet.rate` is also enabled by default
(while removing its current warning).

**Link to tracking Issue:** <Issue number if applicable>
#32929 #32835

**Testing:** <Describe what testing was performed and which tests were
added.>
Unit tests updated/ran. Integration tests ran. Local environment
checked.

**Documentation:** <Describe the documentation added.>
Documentation regenerated from metadata.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working receiver/vcenter
Projects
None yet
Development

No branches or pull requests

3 participants