metrics node_network_receive_bytes_total error, and node average network traffic received is too big #1849

null-test-7 · 2020-09-18T07:43:02Z

The average network traffic received, per second, over the last 2 minute. The result is too high and exceeds the limit of the network interface card.

Then I query the metric node_network_receive_bytes_total in promethues.

(1571379713472972500 - 1567490476219662000) * 8 / 60 / 1024 / 1024 / 1024 / 1024 = 471.63 Tbps
471.63Tbps is a error value, because it is too big

The node_exporter is deploy as a Docker container, the deploy file is https://github.com/kayrus/prometheus-kubernetes/blob/master/node-exporter-ds.yaml, the image is prom/node-exporter:v1.0.1

SuperQ · 2020-09-20T12:05:53Z

The node_exporter only reports what the kernel reports via /proc/net/dev. There's no translation or other manipulation of the data. So this is likely a kernel bug. What would be useful for debugging is if you could paste the raw data for a full day.

Starting at 2020-09-17 14:00

node_network_receive_bytes_total{instance="10.196.1.6",device="bond0"}[1d]

If you could copy-and-paste the text from the Value column into a file and attach it to the issue, I would like to look over the raw sample data.

null-test-7 · 2020-09-21T03:40:38Z

This is raw sample data for one day.
node_network_receive_bytes_total{instance="10.196.1.6",device="bond2"}[1d]
raw.txt

There are many machines in our kubernetes cluster, all CPU node's network metrics are normal, about half of GPU node's network metrics are abnormal.

10.196.1.6 is a gpu node, the kernel version is 3.10.0, the GPU is Nvidia Tesla V100.

SuperQ · 2020-09-24T13:04:53Z

This very much seems like something wrong with these systems.

If you look at the raw values with a simple awk script, you can see the jumps in the data:

$ awk 'NR == 1{old = $1; line = 1 ; next} {print line, $1 - old; line += 1 ; old = $1}' raw.txt
1 180504452352
2 184214446848
3 182921131264
4 183894123520
5 179318487808
6 182999593216
7 186817020672
8 3311125748826624
9 186636443904
10 177499428608
11 178987272448
12 183984390400
13 152058929920
14 171973385728
15 171997525248
16 170380139264
17 165839404032
18 1722169465976320
19 168533415424
20 167896554496

1833299510051384000 1600570841.518
1833299690555836400 1600570901.518
1833299874770283300 1600570961.518
1833300057691414500 1600571021.518
1833300241585538000 1600571081.518
1833300420904025900 1600571141.518
1833300603903619000 1600571201.518
1833300790720639700 1600571261.518
1836611916469466400 1600571321.518
1836612103105910300 1600571381.697
1836612280605339000 1600571441.518
1836612459592611300 1600571501.518
1836612643577001700 1600571561.518
1836612795635931600 1600571621.518
1836612967609317400 1600571681.518
1836613139606842600 1600571741.518
1836613309986982000 1600571801.518
1836613475826386000 1600571861.518
1838335645292362200 1600571921.518
1838335813825777700 1600571981.518
1838335981722332200 1600572041.518

Like I said, the node_exporter does not alter data, it takes the values from the kernel and reports them directly.

Nebojsa92 · 2023-05-22T09:34:03Z

@null-test-7 Did you ever resolve the source of this issue?

SuperQ · 2023-05-23T11:19:10Z

This seems to be a node / kernel issue. Closing as stale.

frittentheke · 2023-11-07T15:10:54Z

@Nebojsa92 @null-test-7

I just observed this behavior on hosts running kernel Linux hostname 6.2.0-36-generic #37~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Oct 9 15:34:04 UTC 2 x86_64 x86_64 x86_64 GNU/Linux and a 2x100G NIC Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02) running in bond (LACP).

Nebojsa92 · 2023-11-07T15:20:31Z

I've observed this starting from kernel 6.2. Up to kernel 6.1 (and including) I didn't have such spikes. It is definitely a kernel ICE driver bug. I've contacted Intel support but they weren't able to reproduce it. Last I've tried is kernel 6.5 and spikes still occur.

No luck for now.

frittentheke · 2023-11-12T20:42:14Z

[...] a 2x100G NIC Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02) running in bond (LACP).

The spikes are there, even in active-backup bonding mode, so it's not LACP related.

frittentheke · 2023-11-17T09:57:30Z

@Nebojsa92 I posted to the Intel driver ML - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231113/038041.html with two spikes documented via continuous logging of /proc/net/dev. Did you ever try the netlink mode (#2777) from the Node-Exporter and also observe those spikes?

YZ775 · 2023-11-29T02:32:08Z

I have a similar issue in Linux kernel 5.15 with intel ICE driver, and no solution was found.
node_network_transmit_bytes_total and node_network_recieve_bytes_total spikes sometimes in a day.
Does anyone have a solution?

node_exporter:
  version 15.0

NIC:
  Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)

result of ethtool:
  driver: ice
  version: 5.15.122-flatcar
  firmware-version: 4.00 0x800118b4 21.5.9
  expansion-rom-version: 
  bus-info: 0000:63:00.1
  supports-statistics: yes
  supports-test: yes
  supports-eeprom-access: yes
  supports-register-dump: yes
  supports-priv-flags: yes

p.s.
I think #2777 is a setting for ARP metrics.
From version 1.4.0, node_exporter uses netlink instead of parsing /proc/net/dev by default.
We can force node_exporter to use old /proc/net/dev by this option(#2509).

Nebojsa92 · 2023-12-26T09:31:03Z

@frittentheke we have tried version of Node Exporter that uses /proc/net/dev and newer ones that utilizes netlink, but behavior is same. Something is odd with ICE driver. I did raise a kernel bug with Intel maintainers, but they couldn't reproduce it on their side.

SuperQ · 2023-12-26T09:39:08Z

@Nebojsa92 If you can reproduce it, I would recommend writing a very simple script that pulls the data from /proc/net/dev every 5-15 seconds and to prove to Intel that they have a bug.

frittentheke · 2024-01-05T12:38:08Z

@Nebojsa92 I posted my findings to the ML at https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231113/038041.html, please kindly join in and provide your debug info there as well.

frittentheke · 2024-02-23T14:11:16Z

@Nebojsa92 I posted my findings to the ML at https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231113/038041.html, please kindly join in and provide your debug info there as well.

@Nebojsa92 @null-test-7 it seems Intel has indeed found an issue and will provide a fix soon. Would be cool, if we could all then verify the issue is gone for good. Seems it really depends on the way the NIC is used, otherwise others would have complained a lot more already.

frittentheke · 2024-02-28T10:10:12Z

@Nebojsa92 @null-test-7 @YZ775 the patch was sent in for the next kernel now. See: https://lore.kernel.org/netdev/20240227143124.21015-1-przemyslaw.kitszel@intel.com/

SuperQ added the require/feedback label Sep 30, 2020

SuperQ closed this as completed May 23, 2023

frittentheke mentioned this issue Jan 29, 2024

ethtool: node_ethtool_received_bytes_nic / node_ethtool_transmitted_bytes_nic cause errors about wrong help being logged constantly #2893

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metrics node_network_receive_bytes_total error, and node average network traffic received is too big #1849

metrics node_network_receive_bytes_total error, and node average network traffic received is too big #1849

null-test-7 commented Sep 18, 2020

SuperQ commented Sep 20, 2020

null-test-7 commented Sep 21, 2020

SuperQ commented Sep 24, 2020

Nebojsa92 commented May 22, 2023

SuperQ commented May 23, 2023

frittentheke commented Nov 7, 2023 •

edited

Loading

Nebojsa92 commented Nov 7, 2023

frittentheke commented Nov 12, 2023

frittentheke commented Nov 17, 2023

YZ775 commented Nov 29, 2023

Nebojsa92 commented Dec 26, 2023

SuperQ commented Dec 26, 2023

frittentheke commented Jan 5, 2024

frittentheke commented Feb 23, 2024

frittentheke commented Feb 28, 2024

metrics node_network_receive_bytes_total error, and node average network traffic received is too big #1849

metrics node_network_receive_bytes_total error, and node average network traffic received is too big #1849

Comments

null-test-7 commented Sep 18, 2020

SuperQ commented Sep 20, 2020

null-test-7 commented Sep 21, 2020

SuperQ commented Sep 24, 2020

Nebojsa92 commented May 22, 2023

SuperQ commented May 23, 2023

frittentheke commented Nov 7, 2023 • edited Loading

Nebojsa92 commented Nov 7, 2023

frittentheke commented Nov 12, 2023

frittentheke commented Nov 17, 2023

YZ775 commented Nov 29, 2023

Nebojsa92 commented Dec 26, 2023

SuperQ commented Dec 26, 2023

frittentheke commented Jan 5, 2024

frittentheke commented Feb 23, 2024

frittentheke commented Feb 28, 2024

frittentheke commented Nov 7, 2023 •

edited

Loading