Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics node_network_receive_bytes_total error, and node average network traffic received is too big #1849

Closed
null-test-7 opened this issue Sep 18, 2020 · 15 comments

Comments

@null-test-7
Copy link

The average network traffic received, per second, over the last 2 minute. The result is too high and exceeds the limit of the network interface card.

78b22a49-63a6-47cd-adee-2504c3263df7

Then I query the metric node_network_receive_bytes_total in promethues.

dea50c2c-7f52-4931-b530-88f468f0f3f2

(1571379713472972500 - 1567490476219662000) * 8 / 60 / 1024 / 1024 / 1024 / 1024 = 471.63 Tbps
471.63Tbps is a error value, because it is too big

The node_exporter is deploy as a Docker container, the deploy file is https://github.com/kayrus/prometheus-kubernetes/blob/master/node-exporter-ds.yaml, the image is prom/node-exporter:v1.0.1

@SuperQ
Copy link
Member

SuperQ commented Sep 20, 2020

The node_exporter only reports what the kernel reports via /proc/net/dev. There's no translation or other manipulation of the data. So this is likely a kernel bug. What would be useful for debugging is if you could paste the raw data for a full day.

Starting at 2020-09-17 14:00

node_network_receive_bytes_total{instance="10.196.1.6",device="bond0"}[1d]

If you could copy-and-paste the text from the Value column into a file and attach it to the issue, I would like to look over the raw sample data.

@null-test-7
Copy link
Author

This is raw sample data for one day.
node_network_receive_bytes_total{instance="10.196.1.6",device="bond2"}[1d]
raw.txt

There are many machines in our kubernetes cluster, all CPU node's network metrics are normal, about half of GPU node's network metrics are abnormal.

10.196.1.6 is a gpu node, the kernel version is 3.10.0, the GPU is Nvidia Tesla V100.
2465276a-d91e-4097-af5e-02343ce04d73

@SuperQ
Copy link
Member

SuperQ commented Sep 24, 2020

This very much seems like something wrong with these systems.

If you look at the raw values with a simple awk script, you can see the jumps in the data:

$ awk 'NR == 1{old = $1; line = 1 ; next} {print line, $1 - old; line += 1 ; old = $1}' raw.txt
1 180504452352
2 184214446848
3 182921131264
4 183894123520
5 179318487808
6 182999593216
7 186817020672
8 3311125748826624
9 186636443904
10 177499428608
11 178987272448
12 183984390400
13 152058929920
14 171973385728
15 171997525248
16 170380139264
17 165839404032
18 1722169465976320
19 168533415424
20 167896554496

1833299510051384000 1600570841.518
1833299690555836400 1600570901.518
1833299874770283300 1600570961.518
1833300057691414500 1600571021.518
1833300241585538000 1600571081.518
1833300420904025900 1600571141.518
1833300603903619000 1600571201.518
1833300790720639700 1600571261.518
1836611916469466400 1600571321.518
1836612103105910300 1600571381.697
1836612280605339000 1600571441.518
1836612459592611300 1600571501.518
1836612643577001700 1600571561.518
1836612795635931600 1600571621.518
1836612967609317400 1600571681.518
1836613139606842600 1600571741.518
1836613309986982000 1600571801.518
1836613475826386000 1600571861.518
1838335645292362200 1600571921.518
1838335813825777700 1600571981.518
1838335981722332200 1600572041.518

Like I said, the node_exporter does not alter data, it takes the values from the kernel and reports them directly.

@Nebojsa92
Copy link

@null-test-7 Did you ever resolve the source of this issue?

@SuperQ
Copy link
Member

SuperQ commented May 23, 2023

This seems to be a node / kernel issue. Closing as stale.

@SuperQ SuperQ closed this as completed May 23, 2023
@frittentheke
Copy link
Contributor

frittentheke commented Nov 7, 2023

@Nebojsa92 @null-test-7

I just observed this behavior on hosts running kernel Linux hostname 6.2.0-36-generic #37~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Oct 9 15:34:04 UTC 2 x86_64 x86_64 x86_64 GNU/Linux and a 2x100G NIC Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02) running in bond (LACP).

@Nebojsa92
Copy link

I've observed this starting from kernel 6.2. Up to kernel 6.1 (and including) I didn't have such spikes. It is definitely a kernel ICE driver bug. I've contacted Intel support but they weren't able to reproduce it. Last I've tried is kernel 6.5 and spikes still occur.

No luck for now.

@frittentheke
Copy link
Contributor

[...] a 2x100G NIC Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02) running in bond (LACP).

The spikes are there, even in active-backup bonding mode, so it's not LACP related.

@frittentheke
Copy link
Contributor

@Nebojsa92 I posted to the Intel driver ML - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231113/038041.html with two spikes documented via continuous logging of /proc/net/dev. Did you ever try the netlink mode (#2777) from the Node-Exporter and also observe those spikes?

@YZ775
Copy link

YZ775 commented Nov 29, 2023

I have a similar issue in Linux kernel 5.15 with intel ICE driver, and no solution was found.
node_network_transmit_bytes_total and node_network_recieve_bytes_total spikes sometimes in a day.
Does anyone have a solution?

node_exporter:
  version 15.0

NIC:
  Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)

result of ethtool:
  driver: ice
  version: 5.15.122-flatcar
  firmware-version: 4.00 0x800118b4 21.5.9
  expansion-rom-version: 
  bus-info: 0000:63:00.1
  supports-statistics: yes
  supports-test: yes
  supports-eeprom-access: yes
  supports-register-dump: yes
  supports-priv-flags: yes

p.s.
I think #2777 is a setting for ARP metrics.
From version 1.4.0, node_exporter uses netlink instead of parsing /proc/net/dev by default.
We can force node_exporter to use old /proc/net/dev by this option(#2509).

@Nebojsa92
Copy link

@frittentheke we have tried version of Node Exporter that uses /proc/net/dev and newer ones that utilizes netlink, but behavior is same. Something is odd with ICE driver. I did raise a kernel bug with Intel maintainers, but they couldn't reproduce it on their side.

@SuperQ
Copy link
Member

SuperQ commented Dec 26, 2023

@Nebojsa92 If you can reproduce it, I would recommend writing a very simple script that pulls the data from /proc/net/dev every 5-15 seconds and to prove to Intel that they have a bug.

@frittentheke
Copy link
Contributor

@Nebojsa92 I posted my findings to the ML at https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231113/038041.html, please kindly join in and provide your debug info there as well.

@frittentheke
Copy link
Contributor

@Nebojsa92 I posted my findings to the ML at https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231113/038041.html, please kindly join in and provide your debug info there as well.

@Nebojsa92 @null-test-7 it seems Intel has indeed found an issue and will provide a fix soon. Would be cool, if we could all then verify the issue is gone for good. Seems it really depends on the way the NIC is used, otherwise others would have complained a lot more already.

@frittentheke
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants