-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wish: implement a retry policy #12
Comments
Oh, that's nice - yeah, should be doable 🙂 Would you see such |
Whatever is easier to implement, I don't actually have a use case to have it separate per policy, as I don't see how exec(lxc) can fail differently depending on the policy, maybe if snapshot removal on zfs level is slower depending on the snapshots around it, but it is a stretch. although... openzfs/zfs#11933, but still a stretch ;) |
one more thing here, somewhat related: lxc can also get stuck and never exit, so it would be nice to have a timeout on exec calls to lxc. Happened to me just now and since it was in a systemd unit that was missing TimeoutStartSec, it was happily hanging in there "as a service" for two weeks until I've realized there are no more snapshots ;) |
Oh, this I can implement pretty quickly! 😄 Check out current master, I've just added there |
i've tried it out and it actually makes every call to lxc take lxc-timeout time instead of timing it out ;)
|
Huh, that's pretty random - I've just re-checked on my machine and everything seems to be working as intended, i.e. the commands complete without any extra delay:
Which OS and kernel are you using? 👀 |
I'm on Centos 8-Streams, 4.18.0-408.el8.x86_64 maybe you should add one more machine at least to be able see it, as in my case the delays are between the machines, i.e. [OK] appears immediately, but it then waits lxc-timeout time to skip to the next one. |
Yeah, I did check on multiple machines - even with a few different kernel versions (4.14, 4.9 & 5.4) 🤔 Would you mind checking this binary? (it's lxd-snapper built via Nix, through |
same result. I have noticed however that, according to 'ps -e f', it spawns lxc list and hangs in there for the duration od a timeout. Identical lxc list command issued on the command line returns within seconds. So, it might be something else, not the timeout per se. The version that works that I use is the last release (v1.3.0), so it might be something added to the master after that.
|
Okie, I've just prepared a different implementation - feel free to checkout current master branch if you find a minute 🙂 |
Hi,
Thanks for a wonderful tool, it saved my life a couple of times already :)
I have a large(ish) cluster, 6 nodes, 150+ containers and there is always something going on, either a backup or devs playing around overloading individual nodes, upgrades, maintenance,etc, so more often than not lxc times out and then the complete service "fails", like this:
in some cases this is a problem as a snapshot that is not deleted on time uses disk space which is sometimes scarce, so would it be possible to implement some kind of retry policy, preferably configurable, like:
retry: 5
retry-interval: 30s
The text was updated successfully, but these errors were encountered: