-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metrics node_network_receive_bytes_total error, and node average network traffic received is too big #1849
Comments
The node_exporter only reports what the kernel reports via Starting at
If you could copy-and-paste the text from the Value column into a file and attach it to the issue, I would like to look over the raw sample data. |
This is raw sample data for one day. There are many machines in our kubernetes cluster, all CPU node's network metrics are normal, about half of GPU node's network metrics are abnormal. 10.196.1.6 is a gpu node, the kernel version is 3.10.0, the GPU is Nvidia Tesla V100. |
This very much seems like something wrong with these systems. If you look at the raw values with a simple awk script, you can see the jumps in the data:
1833299510051384000 1600570841.518 Like I said, the node_exporter does not alter data, it takes the values from the kernel and reports them directly. |
@null-test-7 Did you ever resolve the source of this issue? |
This seems to be a node / kernel issue. Closing as stale. |
I just observed this behavior on hosts running kernel |
I've observed this starting from kernel 6.2. Up to kernel 6.1 (and including) I didn't have such spikes. It is definitely a kernel ICE driver bug. I've contacted Intel support but they weren't able to reproduce it. Last I've tried is kernel 6.5 and spikes still occur. No luck for now. |
The spikes are there, even in active-backup bonding mode, so it's not LACP related. |
@Nebojsa92 I posted to the Intel driver ML - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231113/038041.html with two spikes documented via continuous logging of |
I have a similar issue in Linux kernel 5.15 with intel ICE driver, and no solution was found.
p.s. |
@frittentheke we have tried version of Node Exporter that uses |
@Nebojsa92 If you can reproduce it, I would recommend writing a very simple script that pulls the data from |
@Nebojsa92 I posted my findings to the ML at https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231113/038041.html, please kindly join in and provide your debug info there as well. |
@Nebojsa92 @null-test-7 it seems Intel has indeed found an issue and will provide a fix soon. Would be cool, if we could all then verify the issue is gone for good. Seems it really depends on the way the NIC is used, otherwise others would have complained a lot more already. |
@Nebojsa92 @null-test-7 @YZ775 the patch was sent in for the next kernel now. See: https://lore.kernel.org/netdev/20240227143124.21015-1-przemyslaw.kitszel@intel.com/ |
The average network traffic received, per second, over the last 2 minute. The result is too high and exceeds the limit of the network interface card.
Then I query the metric node_network_receive_bytes_total in promethues.
(1571379713472972500 - 1567490476219662000) * 8 / 60 / 1024 / 1024 / 1024 / 1024 = 471.63 Tbps
471.63Tbps is a error value, because it is too big
The node_exporter is deploy as a Docker container, the deploy file is https://github.com/kayrus/prometheus-kubernetes/blob/master/node-exporter-ds.yaml, the image is prom/node-exporter:v1.0.1
The text was updated successfully, but these errors were encountered: