Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internet Speed Monitor Input Plugin - fetching server list failed: unable to retrieve server list #9852

Closed
phantomski77 opened this issue Oct 3, 2021 · 10 comments
Labels
bug unexpected problem or unintended behavior

Comments

@phantomski77
Copy link

Relevant telegraf.conf:

# Monitors internet speed in the network
[[inputs.internet_speed]]
  ## Sets if runs file download test
  ## Default: false
  enable_file_download = false
  interval = "30m"

System info:

System: Raspberry Pi 4 Model B Rev 1.4 8GB
OS/Kernel: Ubuntu 21.04 (Linux 5.11.0-1019-raspi) arm64 arch
Version: TELEGRAF_VERSION 1.20.0 from hub.docker.com tag telegraf:1.20.0

Docker

Docker CE 20.10.8
Separate Overlay network for InfluxDB 2.0.8 and Telegraf 1.20.0.
All other plugins and Telegraf container able to communicate with the internet.

Steps to reproduce:

  1. Add plugin in telegraf.conf as per code above
  2. Restart telegraf container.
  3. Wait for interval to execute the plugin
  4. No data stored in InfluxDB and error message produced

Expected behavior:

Internet Speed measured by plugin and stored in InfluxDB database.

Actual behavior:

Telegraf container log:

2021-10-03T18:00:00Z E! [inputs.internet_speed] Error in plugin: fetching server list failed: unable to retrieve server list

The same error is produced on every interval attempt.

Additional info:

When I've used the plugin without interval parameter (so defaulting to agent's 10s) or with shorter duration (5m) during initial experiments, it was working fine for about half an hour. Then it suddenly stopped working without any other configuration changes. Other plugins are still working fine collecting other local system, local network and internet based metrics. Restarting telegraf container, InfluxDB container or the whole system doesn't improve the situation, which now lasts for about 24 hours.

@phantomski77 phantomski77 added the bug unexpected problem or unintended behavior label Oct 3, 2021
@phantomski77
Copy link
Author

Just to update - problem still occurring in v 1.20.2 and Docker CE 20.10.9.
Not a single internet_speed check executed without error for 7 days.

@powersj
Copy link
Contributor

powersj commented Oct 11, 2021

What do you have the agent interval set to? The default of 10s?

Because you are still successfully collecting other metrics and based on the error message you received, I do not think this is a bug in Telegraf. The error message fetching server list failed: unable to retrieve server list comes from the internet speed test go library and not telegraf.

This could mean you cannot get to the service. From the system where you are seeing errors can you see the following page:

https://www.speedtest.net/speedtest-servers-static.php

@phantomski77
Copy link
Author

What do you have the agent interval set to? The default of 10s?

Global Agent - yes, 10s. This particular plugin - no, changed locally to 30m (as per code above).

Because you are still successfully collecting other metrics and based on the error message you received, I do not think this is a bug in Telegraf. The error message fetching server list failed: unable to retrieve server list comes from the internet speed test go library and not telegraf.

This could mean you cannot get to the service. From the system where you are seeing errors can you see the following page:

https://www.speedtest.net/speedtest-servers-static.php

I can connect without any issues from the Telegraf container - wget https://www.speedtest.net/speedtest-servers-static.php returns 200 and fetches the file (with correct server list contents) successfully, other locations are retrieved also, ping works too. It’s just this plugin.

@powersj
Copy link
Contributor

powersj commented Oct 11, 2021

In the same place as you are running telegraf, can you try running the speedtest-go binary itself? Again, the error message is coming from the library, not telegraf, so something is not working on that side.

@powersj
Copy link
Contributor

powersj commented Oct 11, 2021

I am going to try running your config locally overnight and see if mine starts erroring like yours. I did try using the speedtest-go binary for 30mins, running it every minute and only saw one issue during an upload with a broken pipe.

@powersj
Copy link
Contributor

powersj commented Oct 12, 2021

I ran the this config overnight. I did get a handful of failures to get the server list, but hundreds of successful speed tests. The common thing I see across the errors is the date/time. Every error occurs exactly at the top or bottom of the hour.

Is your test attempting to run then? Do your logs show something similar?

@achurak
Copy link

achurak commented Oct 12, 2021

My logs show exactly the same, I haven't had a single successful run, but all of the failures do seem to happen at :00 (the interval is set to 60 minutes). That still seems like a bug to me (not sure if telegraf or the speedtest-go though).

@powersj
Copy link
Contributor

powersj commented Oct 13, 2021

I took the upstream speedtest-go binary and had it set to run as a cron at the top and bottom of the hour:

0 * * * * /home/ubuntu/speedtest-go >> /home/ubuntu/cron.log 2>&1
30 * * * * /home/ubuntu/speedtest-go >> /home/ubuntu/cron.log 2>&1

It also returned the same issue:

$ cat cron.log 
Testing From IP: 75.174.219.60, (CenturyLink) [43.5784, -116.2179]
2021/10/13 15:30:01 unable to retrieve server list
Testing From IP: 75.174.219.60, (CenturyLink) [43.5784, -116.2179]
2021/10/13 16:00:01 unable to retrieve server list

I am going to go ahead and close this as it is not specific to Telegraf, however, I would suggest opening a bug with the speedtest-go project and see what they say as well.

Thanks!

@powersj powersj closed this as completed Oct 13, 2021
@phantomski77
Copy link
Author

Thank you very much for your testing and effort @powersj
I have done similar tests and the results are exactly the same indeed. Somehow, it does have the problem with running tests at exactly :00 and :30. When I change the interval to 33m, everything runs well all the time.

@Hipska
Copy link
Contributor

Hipska commented May 20, 2022

You could still run every 60m or 30m but add some offset to it, like 5m or it might even work after a few seconds..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

4 participants