-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcd timeouts and election problems #915
Comments
While writing this, we suffered another failover. Here's the log from the server that was the master
Here is the log from the server that took over as master
I'm looking at my ping log, and the highest latency logged is is half a millisecond. Here is my current leader info
|
I should also mention nothing is talking to the etcd cluster yet, so it's not processing commands or anything and getting messed up. My leader changed 10+ times while I slept last night. |
Here is my self statistics app01
app02
app03
|
I looked at my state information, and I am seeing get's increase even though no clients should be connected. I did an strace and I am seeing this over and over again for network traffic:
|
@WillPlatnick |
@unihorn I will blow my configs away and start a new cluster, standby |
@unihorn
I will give more info if/as things degrade. |
So far, it's going just going as it did when I installed it earlier. Here's how we start. With missed heartbeats, even though LAN communication is fine and these boxes are not taxed at all. I'll update with more logs as things progress.
|
@WillPlatnick |
@unihorn https://gist.github.com/WillPlatnick/d4be17b1704ebb4ae511 CPU usage is higher than it should be since we're not querying etcd at all, it's just sitting there. It doesn't start off this high, it grows.
Here is my store information. Please note getsSuccess is growing, even though nothing is querying it.
Here's my config
And here is an example of leader statistics:
|
@WillPlatnick The number of I guess that you may turn off the snapshot feature, and that may be the reason for frequent leader changes. It makes etcd keep all entries in the memory and slows the heap allocator. |
Here is what the server looked like over the last 3 days: Here's the disk usage: And my config:
|
@WillPlatnick I think if you could start etcd with |
We have a gig pipe, kernel has plenty of open ports and ulimit's seem fine, so I don't think so. We would also see it reflected on our app side as well, which we don't. I have turned on snapshot (I just grabbed this config file defaults from your documentation, so that may be a good thing to have turned on by default) and I will let you know if frequent elections still happen. |
I'm happy to say that since turning snapshot to true, the election issue seems to have stopped! I am still getting missed heartbeats. What can I do to gather more info for this? |
@WillPlatnick |
No I gave a graph that showed system resource usage. These boxes are not taxed at all when this happens. On Thu, Aug 14, 2014 at 3:56 AM, Yicheng Qin notifications@github.com
|
@unihorn Should |
@carmstrong Yes, the default config file is an old one and has been updated in upstream. @WillPlatnick I have no idea why that happens. We will possibly try to add more logging for it. |
@WillPlatnick Sorry we are getting back to this issue so late, but I'm glad to see things have improved for you. As far as the timeouts are concerned, there are just so many reasons this can happen. I would recommend you following the tuning guide and bump the timeouts little by little until the issue goes away. To find the root cause of the missed timeouts will require some in depth troubleshooting. If this is a production environment I would strongly recommend running etcd on a dedicate set of node without normal workloads. With that said I'm closing this ticket out due to age, and the fact that things have improved for you. |
Hello,
I am new to etcd, built a cluster last night using 0.4.6 running on Debian Wheezy, and I am experiencing timeouts. The cluster has 3 active members, and 12 standbys.
I am really wondering why I am having timeouts. These are baremetal servers on the same gigabit LAN. At the same time I was running these tests, I was running ping tests and there was never any increase in latency. There's 12+ gigs free on each of these servers, and according to graphs, CPU on the box has not gone above 15%. Network pipe is nowhere near saturated. These boxes are active application servers, so we would know through tons of different metrics we track if there were issues with our network connection.
I've read the tuning guide, and for a gigabit LAN, it seems we really shouldn't have to do any tuning at all. Average latency (as determined by ICMP) is 50 microseconds.
Where do I go from here? What information would be most helpful?
Current etcd usage from ps
We're also occasionally (a few per hour, more in the middle of the afternoon today, which is one of our slow times) seeing missed heartbeats. These started immediately after starting the cluster.
The text was updated successfully, but these errors were encountered: