Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd timeouts and election problems #915

Closed
WillPlatnick opened this issue Aug 1, 2014 · 20 comments
Closed

etcd timeouts and election problems #915

WillPlatnick opened this issue Aug 1, 2014 · 20 comments
Labels

Comments

@WillPlatnick
Copy link

Hello,
I am new to etcd, built a cluster last night using 0.4.6 running on Debian Wheezy, and I am experiencing timeouts. The cluster has 3 active members, and 12 standbys.

I am really wondering why I am having timeouts. These are baremetal servers on the same gigabit LAN. At the same time I was running these tests, I was running ping tests and there was never any increase in latency. There's 12+ gigs free on each of these servers, and according to graphs, CPU on the box has not gone above 15%. Network pipe is nowhere near saturated. These boxes are active application servers, so we would know through tons of different metrics we track if there were issues with our network connection.

I've read the tuning guide, and for a gigabit LAN, it seems we really shouldn't have to do any tuning at all. Average latency (as determined by ICMP) is 50 microseconds.

Where do I go from here? What information would be most helpful?

[etcd] Jul 31 15:15:29.411 INFO      | app03: warning: heartbeat near election timeout: 168.757749ms
[etcd] Jul 31 15:17:29.570 INFO      | app03: warning: heartbeat near election timeout: 161.028528ms
[etcd] Jul 31 15:19:29.720 INFO      | app03: warning: heartbeat near election timeout: 160.664617ms
[etcd] Jul 31 15:23:30.030 INFO      | app03: warning: heartbeat near election timeout: 169.964666ms
[etcd] Jul 31 15:27:30.336 INFO      | app03: warning: heartbeat near election timeout: 176.612334ms
[etcd] Jul 31 15:29:30.493 INFO      | app03: warning: heartbeat near election timeout: 183.208334ms
[etcd] Jul 31 15:31:30.653 INFO      | app03: warning: heartbeat near election timeout: 193.302576ms
[etcd] Jul 31 15:33:30.815 INFO      | app03: warning: heartbeat near election timeout: 205.103987ms
[etcd] Jul 31 15:35:30.971 INFO      | app03: warning: heartbeat near election timeout: 160.976343ms
[etcd] Jul 31 15:39:31.276 INFO      | app03: warning: heartbeat near election timeout: 166.767389ms
[etcd] Jul 31 15:41:31.430 INFO      | app03: warning: heartbeat near election timeout: 169.963436ms
[etcd] Jul 31 15:43:31.588 INFO      | app03: warning: heartbeat near election timeout: 178.110018ms
[etcd] Jul 31 15:45:31.735 INFO      | app03: warning: heartbeat near election timeout: 174.857319ms
[etcd] Jul 31 15:49:32.052 INFO      | app03: warning: heartbeat near election timeout: 192.572154ms
[etcd] Jul 31 15:51:32.219 INFO      | app03: warning: heartbeat near election timeout: 208.909646ms
[etcd] Jul 31 15:55:32.538 INFO      | app03: warning: heartbeat near election timeout: 178.398576ms
[etcd] Jul 31 15:57:32.707 INFO      | app03: warning: heartbeat near election timeout: 197.037774ms
[etcd] Jul 31 15:59:32.860 INFO      | app03: state changed from 'follower' to 'candidate'.
[etcd] Jul 31 15:59:32.861 INFO      | app03: leader changed from 'app01' to ''.
[etcd] Jul 31 15:59:32.862 INFO      | app03: state changed from 'candidate' to 'leader'.
[etcd] Jul 31 15:59:32.862 INFO      | app03: leader changed from '' to 'app03'.

Current etcd usage from ps

root      8522  0.7  0.2  99936 89168 ?        Sl   Jul30  10:23 /usr/bin/etcd -config /etc/etcd/etcd.conf

We're also occasionally (a few per hour, more in the middle of the afternoon today, which is one of our slow times) seeing missed heartbeats. These started immediately after starting the cluster.

[etcd] Jul 31 19:27:49.027 INFO      | app01: warning: heartbeat time out peer="app02" missed=1 backoff="2s"
[etcd] Jul 31 19:44:34.067 INFO      | app01: warning: heartbeat time out peer="app02" missed=1 backoff="2s"
@WillPlatnick
Copy link
Author

While writing this, we suffered another failover.

Here's the log from the server that was the master

[etcd] Jul 31 19:14:34.516 INFO      | app01: state changed from 'follower' to 'candidate'.
[etcd] Jul 31 19:14:34.516 INFO      | app01: leader changed from 'app03' to ''.
[etcd] Jul 31 19:14:34.517 INFO      | app01: state changed from 'candidate' to 'leader'.
[etcd] Jul 31 19:14:34.517 INFO      | app01: leader changed from '' to 'app01'.
[etcd] Jul 31 19:14:34.567 INFO      | app01: warning: heartbeat time out peer="app03" missed=1 backoff="2s"
[etcd] Jul 31 19:27:49.027 INFO      | app01: warning: heartbeat time out peer="app02" missed=1 backoff="2s"
[etcd] Jul 31 19:44:34.067 INFO      | app01: warning: heartbeat time out peer="app02" missed=1 backoff="2s"
[etcd] Jul 31 19:53:51.331 INFO      | app01: state changed from 'leader' to 'follower'.
[etcd] Jul 31 19:53:51.331 INFO      | app01: term #7 started.
[etcd] Jul 31 19:53:51.331 INFO      | app01: leader changed from 'app01' to ''.

Here is the log from the server that took over as master

[etcd] Jul 31 19:14:34.516 INFO      | app02: leader changed from 'app03' to ''.
[etcd] Jul 31 19:15:47.974 INFO      | app02: warning: heartbeat near election timeout: 206.743366ms
[etcd] Jul 31 19:17:48.151 INFO      | app02: warning: heartbeat near election timeout: 183.463733ms
[etcd] Jul 31 19:19:48.326 INFO      | app02: warning: heartbeat near election timeout: 208.428745ms
[etcd] Jul 31 19:21:48.504 INFO      | app02: warning: heartbeat near election timeout: 186.814225ms
[etcd] Jul 31 19:23:48.681 INFO      | app02: warning: heartbeat near election timeout: 213.856737ms
[etcd] Jul 31 19:29:49.225 INFO      | app02: warning: heartbeat near election timeout: 207.1543ms
[etcd] Jul 31 19:31:49.391 INFO      | app02: warning: heartbeat near election timeout: 173.001583ms
[etcd] Jul 31 19:33:49.587 INFO      | app02: warning: heartbeat near election timeout: 219.494006ms
[etcd] Jul 31 19:35:49.749 INFO      | app02: warning: heartbeat near election timeout: 181.632944ms
[etcd] Jul 31 19:37:49.920 INFO      | app02: warning: heartbeat near election timeout: 202.313727ms
[etcd] Jul 31 19:39:50.086 INFO      | app02: warning: heartbeat near election timeout: 168.645856ms
[etcd] Jul 31 19:41:50.261 INFO      | app02: warning: heartbeat near election timeout: 193.292717ms
[etcd] Jul 31 19:43:50.440 INFO      | app02: warning: heartbeat near election timeout: 222.085741ms
[etcd] Jul 31 19:45:50.639 INFO      | app02: warning: heartbeat near election timeout: 221.036645ms
[etcd] Jul 31 19:47:50.815 INFO      | app02: warning: heartbeat near election timeout: 197.009793ms
[etcd] Jul 31 19:49:50.994 INFO      | app02: warning: heartbeat near election timeout: 220.67798ms
[etcd] Jul 31 19:51:51.157 INFO      | app02: warning: heartbeat near election timeout: 189.628257ms
[etcd] Jul 31 19:53:51.325 INFO      | app02: state changed from 'follower' to 'candidate'.
[etcd] Jul 31 19:53:51.325 INFO      | app02: leader changed from 'app01' to ''.
[etcd] Jul 31 19:53:51.326 INFO      | app02: state changed from 'candidate' to 'leader'.
[etcd] Jul 31 19:53:51.326 INFO      | app02: leader changed from '' to 'app02'.

I'm looking at my ping log, and the highest latency logged is is half a millisecond.

Here is my current leader info

{
    "followers": {
        "app01": {
            "counts": {
                "fail": 0,
                "success": 10626
            },
            "latency": {
                "average": 4.188319562676455,
                "current": 0.356693,
                "maximum": 172.631804,
                "minimum": 0.288977,
                "standardDeviation": 11.928285111683419
            }
        },
        "app03": {
            "counts": {
                "fail": 0,
                "success": 10588
            },
            "latency": {
                "average": 4.949293496883256,
                "current": 0.394716,
                "maximum": 340.710121,
                "minimum": 0.326099,
                "standardDeviation": 15.646117752647621
            }
        }
    },
    "leader": "app02"
}

@WillPlatnick WillPlatnick changed the title etcd timeouts etcd timeouts and election problems Aug 1, 2014
@WillPlatnick
Copy link
Author

I should also mention nothing is talking to the etcd cluster yet, so it's not processing commands or anything and getting messed up. My leader changed 10+ times while I slept last night.

@WillPlatnick
Copy link
Author

Here is my self statistics

app01

{
    "leaderInfo": {
        "leader": "app02",
        "startTime": "2014-08-01T10:51:17.327092451-05:00",
        "uptime": "1m55.957327286s"
    },
    "name": "app01",
    "recvAppendRequestCnt": 1259105,
    "recvBandwidthRate": 900.1873044457652,
    "recvPkgRate": 20.100196593631022,
    "sendAppendRequestCnt": 2982741,
    "startTime": "2014-07-30T20:29:53.913685961-05:00",
    "state": "follower"
}

app02

{
    "leaderInfo": {
        "leader": "app02",
        "startTime": "2014-08-01T10:51:17.374857263-05:00",
        "uptime": "2m44.543917287s"
    },
    "name": "app02",
    "recvAppendRequestCnt": 2166321,
    "sendAppendRequestCnt": 1115667,
    "sendBandwidthRate": 1773.0685496022056,
    "sendPkgRate": 39.6039434800582,
    "startTime": "2014-07-30T20:49:59.389824018-05:00",
    "state": "leader"
}

app03

{
    "leaderInfo": {
        "leader": "app02",
        "startTime": "2014-08-01T10:51:17.323766255-05:00",
        "uptime": "3m8.317772415s"
    },
    "name": "app03",
    "recvAppendRequestCnt": 678255,
    "recvBandwidthRate": 900.3151676261217,
    "recvPkgRate": 20.10080749332712,
    "sendAppendRequestCnt": 885476,
    "startTime": "2014-07-31T19:14:34.293943442-05:00",
    "state": "follower"
}

@WillPlatnick
Copy link
Author

I looked at my state information, and I am seeing get's increase even though no clients should be connected. I did an strace and I am seeing this over and over again for network traffic:

5251  accept4(9, {sa_family=AF_INET6, sin6_port=htons(21101), inet_pton(AF_INET6, "::ffff:10.17.185.139", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28], SOCK_CLOEXEC|SOCK_NONBLOCK) = 11
5251  getsockname(11, {sa_family=AF_INET6, sin6_port=htons(7001), inet_pton(AF_INET6, "::ffff:10.16.145.49", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
5251  setsockopt(11, SOL_TCP, TCP_NODELAY, [1], 4) = 0
5251  accept4(9, 0xc2164e9af0, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
5251  --- SIGPIPE (Broken pipe) @ 0 (0) ---
5251  accept4(9, {sa_family=AF_INET6, sin6_port=htons(14356), inet_pton(AF_INET6, "::ffff:10.17.185.130", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28], SOCK_CLOEXEC|SOCK_NONBLOCK) = 11
5251  getsockname(11, {sa_family=AF_INET6, sin6_port=htons(7001), inet_pton(AF_INET6, "::ffff:10.16.145.49", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
5251  setsockopt(11, SOL_TCP, TCP_NODELAY, [1], 4) = 0
5251  accept4(9, 0xc2164e4f50, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
5251  accept4(9, {sa_family=AF_INET6, sin6_port=htons(9108), inet_pton(AF_INET6, "::ffff:10.16.145.48", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28], SOCK_CLOEXEC|SOCK_NONBLOCK) = 11
5251  getsockname(11, {sa_family=AF_INET6, sin6_port=htons(7001), inet_pton(AF_INET6, "::ffff:10.16.145.49", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
5251  setsockopt(11, SOL_TCP, TCP_NODELAY, [1], 4) = 0
5251  accept4(9, 0xc2164d4230, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
5251  --- SIGPIPE (Broken pipe) @ 0 (0) ---
5251  accept4(9, {sa_family=AF_INET6, sin6_port=htons(9332), inet_pton(AF_INET6, "::ffff:10.16.145.60", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28], SOCK_CLOEXEC|SOCK_NONBLOCK) = 11
5251  getsockname(11, {sa_family=AF_INET6, sin6_port=htons(7001), inet_pton(AF_INET6, "::ffff:10.16.145.49", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
5251  setsockopt(11, SOL_TCP, TCP_NODELAY, [1], 4) = 0
5251  accept4(9, 0xc2164c4850, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)
5220  accept4(9, {sa_family=AF_INET6, sin6_port=htons(38526), inet_pton(AF_INET6, "::ffff:10.16.145.58", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28], SOCK_CLOEXEC|SOCK_NONBLOCK) = 11
5220  getsockname(11, {sa_family=AF_INET6, sin6_port=htons(7001), inet_pton(AF_INET6, "::ffff:10.16.145.49", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
5220  setsockopt(11, SOL_TCP, TCP_NODELAY, [1], 4) = 0
5220  accept4(9, 0xc2164b7af0, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable)

@yichengq
Copy link
Contributor

yichengq commented Aug 1, 2014

@WillPlatnick
Could you share the log of these three nodes and standby nodes to us?
Have you tried to use 3 active members only? Does it work well over night?
It is also helpful to share the cluster config info with us. curl http://127.0.0.1:7001/v2/admin/config

@WillPlatnick
Copy link
Author

@unihorn I will blow my configs away and start a new cluster, standby

@WillPlatnick
Copy link
Author

@unihorn
Here is my cluster config

{
    "activeSize": 3,
    "removeDelay": 180,
    "syncInterval": 5
}

I will give more info if/as things degrade.

@WillPlatnick
Copy link
Author

So far, it's going just going as it did when I installed it earlier.

Here's how we start. With missed heartbeats, even though LAN communication is fine and these boxes are not taxed at all. I'll update with more logs as things progress.

[etcd] Aug  1 14:44:58.100 INFO      | app01: warning: heartbeat time out peer="app02" missed=1 backoff="2s"
[etcd] Aug  1 14:49:28.815 INFO      | app01: warning: heartbeat time out peer="app03" missed=1 backoff="2s"
[etcd] Aug  1 14:49:38.269 INFO      | app01: warning: heartbeat time out peer="app02" missed=1 backoff="2s"
[etcd] Aug  1 14:49:43.350 INFO      | app01: warning: heartbeat time out peer="app02" missed=1 backoff="2s"
[etcd] Aug  1 14:49:44.665 INFO      | app01: warning: heartbeat time out peer="app03" missed=1 backoff="2s"
[etcd] Aug  1 14:49:49.550 INFO      | app01: warning: heartbeat time out peer="app02" missed=1 backoff="2s"
[etcd] Aug  1 14:49:51.215 INFO      | app01: warning: heartbeat time out peer="app03" missed=1 backoff="2s"
[etcd] Aug  1 14:49:51.620 INFO      | app01: warning: heartbeat time out peer="app02" missed=5 backoff="4s"
[etcd] Aug  1 14:49:53.316 INFO      | app01: warning: heartbeat time out peer="app03" missed=6 backoff="4s"
[etcd] Aug  1 14:50:03.915 INFO      | app01: warning: heartbeat time out peer="app03" missed=1 backoff="2s"
[etcd] Aug  1 14:50:13.050 INFO      | app01: warning: heartbeat time out peer="app02" missed=1 backoff="2s"
[etcd] Aug  1 14:50:48.050 INFO      | app01: warning: heartbeat time out peer="app02" missed=1 backoff="2s"

@yichengq
Copy link
Contributor

yichengq commented Aug 4, 2014

@WillPlatnick
CPU usage is always low for all etcds?
The cluster consists of 3 peers and 12 standbys still. Right?
Could you share the system log of etcd with me? That could be super helpful.

@WillPlatnick
Copy link
Author

@unihorn
The cluster is just 3 actives, as per your request. Here are my etcd logs for all 3 servers in this gist. As you can see, we are going through constant elections, about once per minute right now.

https://gist.github.com/WillPlatnick/d4be17b1704ebb4ae511

CPU usage is higher than it should be since we're not querying etcd at all, it's just sitting there. It doesn't start off this high, it grows.

root      1400  4.3  1.0 345016 334352 ?       Sl   Aug01 239:54 /usr/bin/etcd -config /etc/etcd/etcd.conf

Here is my store information. Please note getsSuccess is growing, even though nothing is querying it.

{
    "compareAndDeleteFail": 0,
    "compareAndDeleteSuccess": 0,
    "compareAndSwapFail": 0,
    "compareAndSwapSuccess": 0,
    "createFail": 0,
    "createSuccess": 3,
    "deleteFail": 0,
    "deleteSuccess": 0,
    "expireCount": 0,
    "getsFail": 6,
    "getsSuccess": 502135,
    "setsFail": 0,
    "setsSuccess": 1,
    "updateFail": 0,
    "updateSuccess": 0,
    "watchers": 0
}

Here's my config

{
    "activeSize": 3,
    "removeDelay": 180,
    "syncInterval": 5
}

And here is an example of leader statistics:

{
    "followers": {
        "app01": {
            "counts": {
                "fail": 125,
                "success": 1610594
            },
            "latency": {
                "average": 5.244450236231638,
                "current": 0.744285,
                "maximum": 468.251326,
                "minimum": 0.172269,
                "standardDeviation": 16.623390540371723
            }
        },
        "app02": {
            "counts": {
                "fail": 260,
                "success": 1609752
            },
            "latency": {
                "average": 4.569998826313543,
                "current": 0.578571,
                "maximum": 742.599151,
                "minimum": 0.188828,
                "standardDeviation": 15.5175726296063
            }
        }
    },
    "leader": "app03"
}

@yichengq
Copy link
Contributor

yichengq commented Aug 7, 2014

@WillPlatnick
Could you check the disk utilization of three machines? Etcd does a lot sync and write operations on disk.
May you give out the configuration of etcd?

The number of getsSuccess is reasonable because cluster info is also fetched from store.

I guess that you may turn off the snapshot feature, and that may be the reason for frequent leader changes. It makes etcd keep all entries in the memory and slows the heap allocator.
But this cannot explain the heartbeat miss at least once per five minutes well.
I start 3 etcd machines in AWS using m1.small. It doesn't report any heartbeat miss in 8 hours, and that should be the normal case.

@WillPlatnick
Copy link
Author

Here is what the server looked like over the last 3 days:
https://www.evernote.com/shard/s5/sh/d8a44797-67bb-4aa6-a3a4-1178a39ac627/0d75285b02adbce9b90038b4427bca8c

Here's the disk usage:
https://www.evernote.com/shard/s5/sh/b768741e-9800-489f-a1ee-8c9af81a806b/6d59d4236224dc269aed7eb5b7ffebc4

And my config:

addr = "10.x.x.x:4001"
ca_file = ""
cert_file = ""
cors = []
cpu_profile_file = ""
data_dir = "/etc/etcd"
discovery = "https://discovery.etcd.io/xxxxxxxxxx"
http_read_timeout = 10.0
http_write_timeout = 10.0
key_file = ""
peers = []
peers_file = ""
max_cluster_size = 3
max_result_buffer = 1024
max_retry_attempts = 3
name = "app01"
snapshot = false
verbose = false
very_verbose = false

[peer]
addr = "10.x.x.x:7001"
ca_file = ""
cert_file = ""
key_file = ""

[cluster]
active_size = 3
remove_delay = 180.0
sync_interval = 5.0

@yichengq
Copy link
Contributor

yichengq commented Aug 8, 2014

@WillPlatnick
The network bandwidth for 3-node etcd is about 10K/s. Could network usage from other processes account for the delay fluctuation?

I think if you could start etcd with snapshot = true, it could solve the frequent leader changes. If it doesn't do snapshot, it keeps all data in memory and allocator slows down the world.

@WillPlatnick
Copy link
Author

We have a gig pipe, kernel has plenty of open ports and ulimit's seem fine, so I don't think so. We would also see it reflected on our app side as well, which we don't.

I have turned on snapshot (I just grabbed this config file defaults from your documentation, so that may be a good thing to have turned on by default) and I will let you know if frequent elections still happen.

@WillPlatnick
Copy link
Author

I'm happy to say that since turning snapshot to true, the election issue seems to have stopped!

I am still getting missed heartbeats. What can I do to gather more info for this?

@yichengq
Copy link
Contributor

@WillPlatnick
Do the machines run other processes also? Is it possible that they consume too many resources?

@WillPlatnick
Copy link
Author

No I gave a graph that showed system resource usage. These boxes are not taxed at all when this happens.

On Thu, Aug 14, 2014 at 3:56 AM, Yicheng Qin notifications@github.com
wrote:

@WillPlatnick

Do the machines run other processes also? Is it possible that they consume too many resources?

Reply to this email directly or view it on GitHub:
#915 (comment)

@carmstrong
Copy link

@unihorn Should snapshot = true be the default, then? We've run into this more than once ourselves, and has been reported by our users.

@yichengq
Copy link
Contributor

@carmstrong Yes, the default config file is an old one and has been updated in upstream.

@WillPlatnick I have no idea why that happens. We will possibly try to add more logging for it.

@kelseyhightower
Copy link
Contributor

@WillPlatnick Sorry we are getting back to this issue so late, but I'm glad to see things have improved for you. As far as the timeouts are concerned, there are just so many reasons this can happen. I would recommend you following the tuning guide and bump the timeouts little by little until the issue goes away.

To find the root cause of the missed timeouts will require some in depth troubleshooting. If this is a production environment I would strongly recommend running etcd on a dedicate set of node without normal workloads.

With that said I'm closing this ticket out due to age, and the fact that things have improved for you.

@xiang90 xiang90 mentioned this issue Dec 15, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

4 participants