Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.9.2] timeouts from CQ lead to 500 for all subsequent writes, dies on first restart #3488

Closed
beckettsean opened this issue Jul 28, 2015 · 1 comment

Comments

@beckettsean
Copy link
Contributor

This is on my local box, OS X laptop. The laptop was closed during this entire log snippet so it could be related to OS X powersleep.

Telegraf writes to InfluxDB every 7 seconds. The posts succeeded all night until a timeout in a CQ (the only CQ running in the database) and then all writes after that failed.

[http] 2015/07/27 23:49:49 127.0.0.1 - - [27/Jul/2015:23:49:49 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 204 0 - telegraf d78bea53-34f4-11e5-8da0-000000000000 3.834168ms
[http] 2015/07/27 23:49:56 127.0.0.1 - - [27/Jul/2015:23:49:56 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 204 0 - telegraf dbb73ca2-34f4-11e5-8da1-000000000000 2.288269ms
[http] 2015/07/27 23:50:03 127.0.0.1 - - [27/Jul/2015:23:50:03 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 204 0 - telegraf dfe41937-34f4-11e5-8da2-000000000000 2.101293ms
[continuous_querier] 2015/07/27 23:50:10 timeout
[continuous_querier] 2015/07/27 23:50:10 error: timeout. running: SELECT mean(value) INTO "telegraf"."default".cpu_idle_5m FROM "telegraf"."default".cpu_idle WHERE time >= '2015-07-28T06:50:00Z' AND time < '2015-07-28T06:55:00Z' GROUP BY time(5m), cpu
[continuous_querier] 2015/07/27 23:50:10 error executing query: CREATE CONTINUOUS QUERY foo ON telegraf BEGIN SELECT mean(value) INTO "telegraf"."default".cpu_idle_5m FROM "telegraf"."default".cpu_idle GROUP BY time(5m), cpu END: err = timeout
[http] 2015/07/27 23:50:15 127.0.0.1 - - [27/Jul/2015:23:50:10 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 500 32 - telegraf e40fd44b-34f4-11e5-8da3-000000000000 5.009289605s
[http] 2015/07/27 23:50:22 127.0.0.1 - - [27/Jul/2015:23:50:17 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 500 32 - telegraf e83becaf-34f4-11e5-8da4-000000000000 5.006411918s
[http] 2015/07/27 23:50:29 127.0.0.1 - - [27/Jul/2015:23:50:24 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 500 32 - telegraf ec6896ba-34f4-11e5-8da5-000000000000 5.006055648s
[http] 2015/07/27 23:50:36 127.0.0.1 - - [27/Jul/2015:23:50:31 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 500 32 - telegraf f094d9f2-34f4-11e5-8da6-000000000000 5.002237544s

This morning I did pkill influxd and restarted the process. It refused to stay up.

[http] 2015/07/28 11:27:13 127.0.0.1 - - [28/Jul/2015:11:27:08 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 500 32 - telegraf 4188b90d-3556-11e5-90ce-000000000000 5.010216697s
[http] 2015/07/28 11:27:20 127.0.0.1 - - [28/Jul/2015:11:27:15 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 500 32 - telegraf 45b4cd29-3556-11e5-90cf-000000000000 5.000855157s
[http] 2015/07/28 11:27:27 127.0.0.1 - - [28/Jul/2015:11:27:22 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 500 32 - telegraf 49e01ffd-3556-11e5-90d0-000000000000 5.001232953s

<pkill here>

[run] 2015/07/28 11:27:34 signal received, initializing clean shutdown...
[run] 2015/07/28 11:27:34 waiting for clean shutdown...
[snapshot] 2015/07/28 11:27:34 snapshot listener closed
[tcp] 2015/07/28 11:27:34 cluster service accept error: network connection closed
[http] 2015/07/28 11:27:34 127.0.0.1 - - [28/Jul/2015:11:27:29 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 500 32 - telegraf 4e0cfc77-3556-11e5-90d1-000000000000 5.005786746s

<restart here>

[metastore] 2015/07/28 11:27:37 [INFO] raft: Node at 127.0.0.1:8088 [Follower] entering Follower state
[metastore] 2015/07/28 11:27:37 read local node id: 1
[metastore] 2015/07/28 11:27:39 [WARN] raft: Heartbeat timeout reached, starting election
[metastore] 2015/07/28 11:27:39 [INFO] raft: Node at 127.0.0.1:8088 [Candidate] entering Candidate state
[metastore] 2015/07/28 11:27:39 [DEBUG] raft: Votes needed: 1
[metastore] 2015/07/28 11:27:39 [DEBUG] raft: Vote granted. Tally: 1
[metastore] 2015/07/28 11:27:39 [INFO] raft: Election won. Tally: 1
[metastore] 2015/07/28 11:27:39 [INFO] raft: Node at 127.0.0.1:8088 [Leader] entering Leader state
[metastore] 2015/07/28 11:27:39 [INFO] raft: Disabling EnableSingleNode (bootstrap)
[metastore] 2015/07/28 11:27:39 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/28 11:27:39 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/28 11:27:39 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/28 11:27:39 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/28 11:27:39 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/28 11:27:39 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/28 11:27:39 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/28 11:27:39 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/28 11:27:39 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[http] 2015/07/28 11:27:41 127.0.0.1 - - [28/Jul/2015:11:27:36 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 500 32 - telegraf 523875a1-3556-11e5-90d2-000000000000 5.003825849s
[shard-precreation] 2015/07/28 11:27:48 precreation service terminating
[snapshot] 2015/07/28 11:27:48 snapshot listener closed
[tcp] 2015/07/28 11:27:48 cluster service accept error: network connection closed
run: open server: open service: listen tcp :8083: bind: address already in use
[http] 2015/07/28 11:27:48 127.0.0.1 - - [28/Jul/2015:11:27:43 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 500 32 - telegraf 5665076a-3556-11e5-90d3-000000000000 5.000718384s
[http] 2015/07/28 11:27:55 127.0.0.1 - - [28/Jul/2015:11:27:50 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 500 32 - telegraf 5a912f9a-3556-11e5-90d4-000000000000 5.005867478s
[http] 2015/07/28 11:28:02 127.0.0.1 - - [28/Jul/2015:11:27:57 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 500 32 - telegraf 5ebce539-3556-11e5-90d5-000000000000 5.000814744s
[run] 2015/07/28 11:28:04 time limit reached, initializing hard shutdown

<process died here> 

I restarted it again and everything seems fine:

[metastore] 2015/07/28 11:29:50 [INFO] raft: Node at 127.0.0.1:8088 [Follower] entering Follower state
[metastore] 2015/07/28 11:29:50 read local node id: 1
[shard] 2015/07/28 11:29:50 flush 1114 points in 0.019s
[shard] 2015/07/28 11:29:50 flush 919 points in 0.016s
[shard] 2015/07/28 11:29:50 flush 650 points in 0.011s
[shard] 2015/07/28 11:29:50 flush 1097 points in 0.018s
[shard] 2015/07/28 11:29:50 flush 1033 points in 0.017s
[shard] 2015/07/28 11:29:50 flush 905 points in 0.015s
[shard] 2015/07/28 11:29:50 flush 828 points in 0.014s
[shard] 2015/07/28 11:29:50 flush 954 points in 0.015s
[admin] 2015/07/28 11:29:50 listening on HTTP: [::]:8083
[httpd] 2015/07/28 11:29:50 authentication enabled: false
[httpd] 2015/07/28 11:29:50 listening on HTTP: [::]:8086
2015/07/28 11:29:50 InfluxDB starting, version 0.9.2, commit 6c0a91f775f9fc0e625d17ffa04a3fe86945ba09
2015/07/28 11:29:50 GOMAXPROCS set to 8
[run] 2015/07/28 11:29:50 listening for signals
[metastore] 2015/07/28 11:29:51 [WARN] raft: Heartbeat timeout reached, starting election
[metastore] 2015/07/28 11:29:51 [INFO] raft: Node at 127.0.0.1:8088 [Candidate] entering Candidate state
[metastore] 2015/07/28 11:29:51 [DEBUG] raft: Votes needed: 1
[metastore] 2015/07/28 11:29:51 [DEBUG] raft: Vote granted. Tally: 1
[metastore] 2015/07/28 11:29:51 [INFO] raft: Election won. Tally: 1
[metastore] 2015/07/28 11:29:51 [INFO] raft: Node at 127.0.0.1:8088 [Leader] entering Leader state
[metastore] 2015/07/28 11:29:51 [INFO] raft: Disabling EnableSingleNode (bootstrap)
[metastore] 2015/07/28 11:29:51 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/28 11:29:51 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/28 11:29:51 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/28 11:29:51 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/28 11:29:51 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/28 11:29:51 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/28 11:29:51 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/28 11:29:51 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/28 11:29:51 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[metastore] 2015/07/28 11:29:51 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
2015/07/28 11:29:51 Sending anonymous usage statistics to m.influxdb.com
[http] 2015/07/28 11:29:56 127.0.0.1 - - [28/Jul/2015:11:29:56 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 204 0 - telegraf a5aaf263-3556-11e5-8001-000000000000 2.348823ms
[http] 2015/07/28 11:30:03 127.0.0.1 - - [28/Jul/2015:11:30:03 -0700] POST /write?consistency=&db=telegraf&precision=&rp= HTTP/1.1 204 0 - telegraf a9d6da27-3556-11e5-8002-000000000000 2.194528ms

Not sure I can repro. Will stay alert for another occurrence.

@beckettsean
Copy link
Contributor Author

closing in favor of #3469

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant