-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ETCD "wal sync: duration" error. #10414
Comments
From a distance just looking at these metrics my guess is that etcd is using HDD vs SSD to store state? You need faster disks to overcome this I/O bottleneck. |
Thanks for the reply @hexfusion. Actually this issue suddenly came in our cluster after running a long time without any issue. Thanks |
Hi @ronakpandya7 just because you were able to operate for N period of time with the current hardware does not mean it is not the problem right? Increases in workload, size of your db etc could have just caught up with you. The metrics point and logging are pointing to slow disk. Also I am pretty certain that k8s does not support 3.3 for 10.5 and will be included in 11.4. kubernetes/kubernetes#61326. Not saying that is the issue but something to consider as well. Can you answer my question about state storage HDD vs SSD? If you can give me full metrics as attachments I can look for more issues and or additional logging around the issue. ref: https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/hardware.md#disks |
Here is a nice fio benchmark you can run to get an idea on your disks WTR IOPS. I have been meaning to add this to the repo, curious the output you see. Fair warning the test will generate a 4GB file.
|
Hello @hexfusion, For the storage state we are using 8 Disk (HDD) of 300 GB with Raid 1+0. We do not have the metrices.txt |
Here is the output for the spew command that will show the Disk I/O performance it this helps you for debugging.
|
From the hardware guide linked above
from your test
Do we need to research this more? Please make sure that you disks are not shared with anything else and dedicated to etcd. If this is currently the case then your disks are too slow and you need to update them to preferably SSD. |
Why is my memory/tmpfs to slow? |
Hello,
We are using kubernetes version
10.5
and etcd version3.3.3
.We are facing "wal sync duration" issue in etcd.
Status of our etcd cluster is continiously changing from healthy to unhealthy and from unhealthy to healthy.
After some time...
We checked in etcd service logs and it is as below.
etcd logs:
And below are the etcd metrices for "etcd_disk_wal_fsync_duration_seconds_bucket" fsync wal logs:
And below are the etcd metrices for "etcd_disk_backend_commit_duration_seconds_bucket" fsync wal logs:
So what we can do here to make etcd stable and working fine?
Thanks
The text was updated successfully, but these errors were encountered: