Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/vcenter] Repeated timeseries values fixed #23669

Conversation

BominRahmani
Copy link
Contributor

@BominRahmani BominRahmani commented Jun 25, 2023

Description:
There was an issue where multiple same attribute values of a metric (time-series) would have the same timestamps.This PR adds a new resource attribute that allows the metrics that are being emitted with the repeating timestamps in the time-series to be more accurately represented.

// Increment values of timestamps when finished with all nested values
if j == len(val.Value)-1 {
for ind, metricMapVal := range metricMap[val.Name] {
incrementedTime := time.Duration(metricMapVal.Interval) * time.Duration(j+1) * time.Nanosecond
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand what this is doing?

Is vcenter returning multiple data points with the same timestamp and no other differentiating information?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. processHostPerformance() goes through a map with a Value array (image 1)
image
The issue is that several Values of the same name
image
will go through and report their datapoint values. There are 5 example timestamps that are supposed to go 1-1 with the length of the inner value it seems, but since they get reported multiple times, the timestamps get recycled. the above code you mentioned is just me keeping track and incrementing those timestamps so when they get encountered, its not being recycled. The differentiating factor between these values seems to be the Instance
image
but the recording of the data doesn't take the difference of the Instance into account.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the recording of the data doesn't take the difference of the Instance into account.

That seems to be the problem. If I'm understanding correctly, we need to differentiate these data points with a resource attribute. In other words, we need to emit multiple resources, each with one of these data points.

@schmikei, WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, was on vacation!

It feels like this is best solved by adding a vcenter.host.disk.instance attribute or something. What's being reported from the PerformanceManager API for this particular host seems to have 2 drives and that is the crux of the case you're showing right now. Seems like that is the approach I think makes the most sense @BominRahmani messing with the timestamps seems like the incorrect approach.

@BominRahmani BominRahmani requested a review from djaglowski June 27, 2023 13:27
@BominRahmani BominRahmani force-pushed the bugfix/vcenter-repeated-timeseries branch from 941844e to 2fc1a85 Compare July 10, 2023 17:34
Comment on lines 251 to 234
if idx >= 1 && val.Instance != m.Value[idx-1].Instance {
v.mb.EmitForResource(metadata.WithVcenterClusterName(clusterName),
metadata.WithVcenterHostName(hostname),
metadata.WithVcenterVMID(vmUUID),
metadata.WithVcenterVMName(vmMain.Name()),
metadata.WithVcenterSystemDeviceID(m.Value[idx-1].Instance))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is going on here? What are we checking for in the condition and why are we emitting a resource when we find it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the new logic is saying "If we just finished all of the metrics for one Instance, emit those metrics," but I think it's going to emit the metrics for the first item of each instance with the metrics for the previous instance.

Instead of just sorting on line 219, we should instead create a new map[string][]object & use it to group by instance, then iterate over the map and emit after parsing each grouped list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah just chiming in and saying that the approach @dehaansa suggests would make what you're trying to do here more readable @BominRahmani. I also would probably want to validate more on if there is a content in the Instance variable. I know at least particularly for the VM network metrics that the instance can actually be the VM name itself which is an auto aggregation of the rest of the network interfaces...

All that to say is that we should do our due diligence and make sure content on this change is validated and the expected.yaml is strictly looked at to ensure we know which of the performance metrics are going to be affected.

image

Comment on lines 251 to 234
if idx >= 1 && val.Instance != m.Value[idx-1].Instance {
v.mb.EmitForResource(metadata.WithVcenterClusterName(clusterName),
metadata.WithVcenterHostName(hostname),
metadata.WithVcenterVMID(vmUUID),
metadata.WithVcenterVMName(vmMain.Name()),
metadata.WithVcenterSystemDeviceID(m.Value[idx-1].Instance))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the new logic is saying "If we just finished all of the metrics for one Instance, emit those metrics," but I think it's going to emit the metrics for the first item of each instance with the metrics for the previous instance.

Instead of just sorting on line 219, we should instead create a new map[string][]object & use it to group by instance, then iterate over the map and emit after parsing each grouped list.

Comment on lines 251 to 234
if idx >= 1 && val.Instance != m.Value[idx-1].Instance {
v.mb.EmitForResource(metadata.WithVcenterClusterName(clusterName),
metadata.WithVcenterHostName(hostname),
metadata.WithVcenterVMID(vmUUID),
metadata.WithVcenterVMName(vmMain.Name()),
metadata.WithVcenterSystemDeviceID(m.Value[idx-1].Instance))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah just chiming in and saying that the approach @dehaansa suggests would make what you're trying to do here more readable @BominRahmani. I also would probably want to validate more on if there is a content in the Instance variable. I know at least particularly for the VM network metrics that the instance can actually be the VM name itself which is an auto aggregation of the rest of the network interfaces...

All that to say is that we should do our due diligence and make sure content on this change is validated and the expected.yaml is strictly looked at to ensure we know which of the performance metrics are going to be affected.

image

@@ -31,6 +31,10 @@ resource_attributes:
description: The instance UUID of the virtual machine.
enabled: true
type: string
vcenter.system.device.id:
description: The unique identifier of the specific hardware or virtual component being utilized in the vCenter environment.
enabled: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
enabled: true
enabled: false

@djaglowski I know this receiver is still in alpha state, but what's your thoughts on this RA being enabled by default?

I have a slight concern about cardinality increase for current users even if it dropped information previously

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mostly concerned with whether or not data points are uniquely identifiable. If we disable the attribute, then we can't tell the data points apart, so I think it has to correlate with disabling the data points as well.

That said, would you feel better about it if we switch back to having device id as a data point attribute? I've recently been convinced that this is allowable based on #23565 (comment), specifically the suggestion that "We typically put multiple entities in a Resource."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm of the opinion that the first fix for this case might just be to find and only report the datapoint referring to the correct instance of a host or VM. I caught that at least for most of these metrics that there exists a datapoint that correlates correctly to the host system and the VM. So rather than recording on every datapoint, we just find the corresponding correct resource and record only that datapoint for the current metrics.

I think further work beyond that would be to effectively record Network Interfaces, Disks, and other fields of the Instance fields on the sampleInfo.

Like for this example with "net.bytesRx.average", we can see that there is an already aggregated value for the instance of the VM along with the corresponding Network Interfaces.

image

So there is more hierarchy here
| VM
| -- vmnic-0
| -- vmnic-1
| -- vmnic-2
and we have a metric vcenter.vm.network.throughput

So that is a peculiarity that is interesting in this case. I believe that the "correct" is to correlate that datapoint only and then we can add more metrics later for the different network interfaces and other instances and create identifying resource attributes on those.

Those are just my opinions though and you probably know a tad bit more about the correctness of the data model so I'll defer to you, but I just wanted to write out my thoughts on what seemed like the correct and least destructive approach.

Let me know if those ramblings make sense or where we disagree on the approach. I could imagine an alternative approach that we move away from a metric named vcenter.vm.network.throughput in favor of vcenter.vm.network.interface.throughput, but I'm not sure if that would warrant the removal of vcenter.vm.network.throughput.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're confident that we can isolate only one data point which correctly represents the entity, then let's do that.

@BominRahmani BominRahmani force-pushed the bugfix/vcenter-repeated-timeseries branch 3 times, most recently from dc9ce9f to 81e788b Compare August 4, 2023 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants