-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[receiver/vcenter] Network Packet Metrics Have Metadata Issues #32835
Comments
I don't think a delta is the correct way to model these 20s long intervals anyways, unless we can guarantee the scraper is not getting data points whose intervals overlap.
I think this is the most correct way to represent the data, given the limitations of the API.
I would leave them as the sum across multiple data points streams is still useful information. e.g. total rate of packets received across all VMs. |
@djaglowski OK. I'm not as worried about whether or not it's labeled as a Cumulative Sum vs. Gauge (other than wanting it to be accurate). As far as turning these metrics to rates, how do you feel about how the collector currently gathers any performance metrics in general (including these packet metrics)? For VMs, it grabs a sample which represents an aggregation of the last 20s of data (might be just accumulated counts, might be an average, a max, or even an avg rate calculated over this time) which we then turn into a single datapoint. So for for something short like a 1m collection interval, VMs will report these metrics with a single datapoint of 20s of some sort of aggregation. 40s of information is ultimately "lost" at the moment. This would compound for a larger collection time like 30m. We'd ultimately have 20s of data represented in a single datapoint for every 30m collection. For Hosts and a short 1m collection interval, like I mentioned hosts will report 5 of these datapoints. So we get 100s of previous data split across 5 datapoints, but that implies that there is overlap with the previous collection's data as well. With a 30m interval, we no longer have overlap but also still have a lot of missing information that might not be apparent to a use (considering that we're marking this metric as a cumulative sum?) I know this is a bit tangential, but I feel like these packet metrics are where I'm noticing this "issue" in the forefront of my mind. Now I'll circle back and ask the question again on if we should do something like change the intervals for just these packet metrics to something like 5m, and then label these metrics in the receiver as "packets.rate.average" or "packet.error.rate.average" or something like that? |
Pinging code owners for receiver/vcenter: @djaglowski @schmikei @StefanKurek. See Adding Labels via Comments if you do not have permissions to add labels yourself. |
**Description:** <Describe what has changed.> Adds new rate packet metrics which correctly report as per second rates (avg over 20s). Adds warnings for existing packet metrics that they will be removed in v0.102.0. **Link to tracking Issue:** <Issue number if applicable> #32835 **Testing:** <Describe what testing was performed and which tests were added.> Unit/integration tests updated and tested. Local environment tested. **Documentation:** <Describe the documentation added.> New documentation generated based on the metadata.
**Description:** <Describe what has changed.> This mostly replaced deprecated packet metrics which already had a warning. Namely `vcenter.host.network.packet.errors`, `vcenter.host.network.packet.count`, and `vcenter.vm.network.packet.count`. They are replaced through enabling by default (while removing the existing warnings) the metrics `vcenter.host.network.packet.error.rate`, `vcenter.host.network.packet.rate`, and `vcenter.vm.network.packet.rate` respectively. The metric `vcenter.vm.network.packet.rate` is also enabled by default (while removing its current warning). **Link to tracking Issue:** <Issue number if applicable> #32929 #32835 **Testing:** <Describe what testing was performed and which tests were added.> Unit tests updated/ran. Integration tests ran. Local environment checked. **Documentation:** <Describe the documentation added.> Documentation regenerated from metadata.
Component(s)
receiver/vcenter
What happened?
Description
Currently there are some issues with the way the packet metrics are presented & their metadata. The
vcenter.*.network.packet.count
metrics are both incorrectly marked with rate units in the metadata. They are also marked as non-monotonic cumulative sums. The actual returned datapoints represent delta sums of packets transmitted over previous 20s intervals in succession.A similar issue exists for
vcenter.host.network.packet.errors
, but only for the discrepancy between non-monotonic cumulative sum & delta sums.Steps to Reproduce
Collect against any vCenter environment with VMs.
Expected Result
vcenter.vm.network.packet.count
would be returned where each datapoint represents a packet transmission rate (for each VM).vcenter.host.network.packet.count
is returned with datapoints each representing a packet transmission rate (for each Host).vcenter.host.network.packet.errors
is returned with datapoints each representing a packet error rate (for each Host).Actual Result
vcenter.vm.network.packet.count
is returned with a single datapoint representing 20s of accumulated packet count data (for each VM).vcenter.host.network.packet.count
is returned with 5 datapoints each representing the previous 20s of accumulated packet count data (for each Host).vcenter.host.network.packet.errors
is returned with 5 datapoints each representing the previous 20s of accumulated packet error count data (for each Host).Collector version
v1.6.0/v0.99.0
Environment information
No response
OpenTelemetry Collector configuration
Log output
No response
Additional context
@djaglowski @schmikei I don't actually want to change the metadata to monotonic cumulative deltas. This would cause these metrics to not work with the prometheus exporters, and that would be problematic for my current use case.
Instead, I think it might make more sense to do something like convert them to rates (by dividing the values returned by the interval). We could make/keep the units as rates in this case. We then could either convert to Gauges (or I guess keep as they are as I don't think it hurts anything).
If we want to make it a rate, we also have the option to use a larger interval (5m) to get the deltas for these.
Whether or not we convert to rates, we also have the option to backfill these datapoints to try and "fill up" the collection interval to make this data more "honest" (not sure if that is the right word) to the end user.
In my use case (Importing into Grafana with a prometheus based exporter which just ends up taking the latest datapoint), I could always just take my sample and convert to a 20s rate on the Grafana side of things.
The text was updated successfully, but these errors were encountered: