Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

collectd, failed to write batch: timeout #3199

Closed
jasonrm opened this issue Jun 30, 2015 · 31 comments
Closed

collectd, failed to write batch: timeout #3199

jasonrm opened this issue Jun 30, 2015 · 31 comments
Assignees
Milestone

Comments

@jasonrm
Copy link

jasonrm commented Jun 30, 2015

I have about 18 instances of collectd sending data to the collectd listener of InfluxDB, and for the first couple minutes everything seems to run fine until the following message starts to appear.

[collectd] 2015/06/30 20:46:08 failed to write batch: timeout

Far as I can tell, as soon as that shows up, queries for data just hang.

-- Once the 'failed to write batch' messages start, queries like this just hang
SELECT max(value) FROM "sensors_value" WHERE type = 'temperature' AND time > now() - 1h GROUP BY time(1m), host ORDER BY asc

-- However this works without issue the entire time
SHOW TAG KEYS FROM cpu_value

versions

Linux hostname 4.0.6-1-ARCH #1 SMP PREEMPT Tue Jun 23 14:25:08 CEST 2015 x86_64 GNU/Linux
influxdb - git master - a2bf480
grafana - git master - a38a06a
collectd - 5.5.0

I happen to be running zfs root as well, but far as I can tell that shouldn't be an issue.

log (after restarting influxdb)

using configuration at: /etc/influxdb.conf
2015/06/30 20:39:00 [INFO] raft: Node at 127.0.0.1:8088 [Follower] entering Follower state
2015/06/30 20:39:00 read local node id: 1
[shard] 2015/06/30 20:39:01 flush 17981 points in 0.577s
[shard] 2015/06/30 20:39:01 flush 17528 points in 0.567s
2015/06/30 20:39:01 [WARN] raft: Heartbeat timeout reached, starting election
2015/06/30 20:39:01 [INFO] raft: Node at 127.0.0.1:8088 [Candidate] entering Candidate state
2015/06/30 20:39:01 [DEBUG] raft: Votes needed: 1
2015/06/30 20:39:01 [DEBUG] raft: Vote granted. Tally: 1
2015/06/30 20:39:01 [INFO] raft: Election won. Tally: 1
2015/06/30 20:39:01 [INFO] raft: Node at 127.0.0.1:8088 [Leader] entering Leader state
2015/06/30 20:39:01 [INFO] raft: Disabling EnableSingleNode (bootstrap)
2015/06/30 20:39:01 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
2015/06/30 20:39:01 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
2015/06/30 20:39:01 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
2015/06/30 20:39:01 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
2015/06/30 20:39:01 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[shard] 2015/06/30 20:39:02 flush 16957 points in 0.524s
[shard] 2015/06/30 20:39:02 flush 17782 points in 0.410s
[shard] 2015/06/30 20:39:03 flush 17500 points in 0.506s
[shard] 2015/06/30 20:39:04 flush 17917 points in 0.547s
[shard] 2015/06/30 20:39:04 flush 17831 points in 0.505s
[shard] 2015/06/30 20:39:04 flush 812 points in 0.026s
[shard] 2015/06/30 20:39:05 flush 99 points in 0.019s
[shard] 2015/06/30 20:39:05 flush 650 points in 0.042s
[shard] 2015/06/30 20:39:05 flush 1126 points in 0.028s
[shard] 2015/06/30 20:39:05 flush 574 points in 0.032s
[shard] 2015/06/30 20:39:05 flush 578 points in 0.037s
[shard] 2015/06/30 20:39:05 flush 692 points in 0.049s
[shard] 2015/06/30 20:39:05 flush 1104 points in 0.041s
[httpd] 2015/06/30 20:39:05 listening on HTTP: [::]:8086
[collectd] 2015/06/30 20:39:05 collectd UDP started
[udp] 2015/06/30 20:39:05 Started listening on :2054
2015/06/30 20:39:05 InfluxDB starting, version 0.9, commit unknown
2015/06/30 20:39:05 GOMAXPROCS set to 8
2015/06/30 20:39:06 Sending anonymous usage statistics to m.influxdb.com
[collectd] 2015/06/30 20:46:08 failed to write batch: timeout
[collectd] 2015/06/30 20:46:13 failed to write batch: timeout
[collectd] 2015/06/30 20:46:18 failed to write batch: timeout
[collectd] 2015/06/30 20:46:23 failed to write batch: timeout
[collectd] 2015/06/30 20:46:28 failed to write batch: timeout
[collectd] 2015/06/30 20:46:33 failed to write batch: timeout
[collectd] 2015/06/30 20:46:38 failed to write batch: timeout
[collectd] 2015/06/30 20:46:43 failed to write batch: timeout
    <snip>
[collectd] 2015/06/30 21:08:45 failed to write batch: timeout
[collectd] 2015/06/30 21:08:50 failed to write batch: timeout
[collectd] 2015/06/30 21:08:55 failed to write batch: timeout
[retention] 2015/06/30 21:08:57 retention policy enforcement check commencing
[retention] 2015/06/30 21:08:57 retention policy shard deletion check commencing
[shard] 2015/06/30 21:08:58 flush 725 points in 0.028s
[shard] 2015/06/30 21:08:58 flush 117 points in 0.015s
[shard] 2015/06/30 21:08:58 flush 452 points in 0.018s
[shard] 2015/06/30 21:08:58 flush 1060 points in 0.019s
[shard] 2015/06/30 21:08:58 flush 489 points in 0.025s
[shard] 2015/06/30 21:08:58 flush 538 points in 0.017s
[shard] 2015/06/30 21:08:58 flush 479 points in 0.017s
[shard] 2015/06/30 21:08:59 flush 1008 points in 0.023s
[collectd] 2015/06/30 21:09:00 failed to write batch: timeout
[collectd] 2015/06/30 21:09:05 failed to write batch: timeout
[collectd] 2015/06/30 21:09:10 failed to write batch: timeout
[collectd] 2015/06/30 21:09:15 failed to write batch: timeout
    <snip>

/etc/influxdb.conf

reporting-disabled = false

[meta]
  dir = "/var/lib/influxdb/meta"
  hostname = "localhost"
  bind-address = ":8088"
  retention-autocreate = true
  election-timeout = "1s"
  heartbeat-timeout = "1s"
  leader-lease-timeout = "500ms"
  commit-timeout = "50ms"

[data]
  dir = "/var/lib/influxdb/data"

[cluster]
  shard-writer-timeout = "5s"

[retention]
  enabled = true
  check-interval = "10m"

[admin]
  enabled = true
  bind-address = ":8083"

[http]
  enabled = true
  bind-address = ":8086"
  auth-enabled = false
  log-enabled = false
  write-tracing = false
  pprof-enabled = false

[collectd]
  enabled = true
  bind-address = ":2100"
  database = "collectd"
  typesdb = "/etc/collectd/types.db"
  batch-size = 512
  batch-timeout = "30s"

[udp]
  enabled = true
  bind-address = ":2054"
  database = "udp"
  batch-size = 1024
  batch-timeout = "10s"

[monitoring]
  enabled = true
  write-interval = "24h"

[continuous_queries]
  enabled = true
  recompute-previous-n = 2
  recompute-no-older-than = "10m"
  compute-runs-per-interval = 10
  compute-no-more-than = "2m"

[hinted-handoff]
  enabled = true
  dir = "/var/lib/influxdb/hh"
  max-size = 1073741824
  max-age = "168h"
  retry-rate-limit = 0
  retry-interval = "1s"

/etc/collectd.conf

TypesDB "/etc/collectd/types.db"
Interval 30

LoadPlugin syslog

<Plugin syslog>
    LogLevel warning
</Plugin>

LoadPlugin battery
LoadPlugin cpu
LoadPlugin df
LoadPlugin disk
LoadPlugin entropy
LoadPlugin interface
LoadPlugin irq
LoadPlugin load
LoadPlugin memory
LoadPlugin network
LoadPlugin processes
LoadPlugin sensors
LoadPlugin swap
LoadPlugin users

<Plugin "cpu">
  ReportByState true
  ReportByCpu true
  ValuesPercentage true
</Plugin>

<Plugin "network">
    Server "127.0.0.1" "2100"
</Plugin>
@otoolep
Copy link
Contributor

otoolep commented Jun 30, 2015

Thanks for the report @jasonrm

@jwilder -- would increasing the shard timeout help? Not as a solution, but to understand this issue better?

@jwilder
Copy link
Contributor

jwilder commented Jun 30, 2015

@otoolep Possibly. It's currently 5s but should probably be configurable. https://github.com/influxdb/influxdb/blob/master/cluster/points_writer.go#L17

If it's a single node, that means that a local write is taking longer than 5s to write a batch. If it's in a cluster, all the nodes took longer than 5s to write. The shard flushes look like they are happening pretty quickly so I wonder if something is getting deadlocked.

@pdf
Copy link

pdf commented Jul 2, 2015

I've had InfluxDB stop accepting writes a couple of times in as many days now, testing on a single node.

Logs here, I killed it to produce a stack trace.

@otoolep
Copy link
Contributor

otoolep commented Jul 2, 2015

@pdf -- are you using the collectd input? If not, how are you writing data into the system?

@pdf
Copy link

pdf commented Jul 2, 2015

TSDB, as indicated by the logs. Is this likely input-specific? I figured it was likely something on the back of the retention policies stopping the world.

@otoolep
Copy link
Contributor

otoolep commented Jul 2, 2015

I didn't check the logs yet, and quickly wanted to understand if you were also working with collectd.

Retention policy enforcement is designed not to be an issue, and even then the only shards affected are those that are older than the retention period of the policy. Of course, there could always be a bug. Are you writing data into the past?

@pdf
Copy link

pdf commented Jul 2, 2015

Can you describe what writing data into the past entails? I'd surmise it's possible for multiple clients to send data to the same measurement with a timestamp that precedes one on a record that's already been sent - this seems like something that would be difficult to prevent with multiple writers. In this case, I'm sending metrics via scollector from the bosun project.

@otoolep
Copy link
Contributor

otoolep commented Jul 2, 2015

By "writing into the past" I mean writing data that has timestamps
(significantly) earlier than now. For example, writing data with timestamps
representing yesterday. If the target retention policy is, say, 6 hours it
would mean you would be creating data in the system that would be deleted
almost immediately.

Philip

On Wed, Jul 1, 2015 at 9:24 PM, Peter Fern notifications@github.com wrote:

Can you describe what writing data into the past entails? I'd surmise it's
possible for multiple clients to send data to the same measurement that
preceeds the timestamp on a record that's already been sent - this seems
like something that would be difficult to prevent with multiple writers. In
this case, I'm sending metrics via scollector from the bosun project.


Reply to this email directly or view it on GitHub
#3199 (comment).

@pdf
Copy link

pdf commented Jul 2, 2015

Gotcha, nothing like that sort of period - we'd be looking at seconds out of order rather than hours.

@otoolep
Copy link
Contributor

otoolep commented Jul 2, 2015

#3214 also reported that the system either hung when under write load. There are also messages in the logs regarding retention enforcement.

Retention enforcement runs, by default, every 10 minutes, so it will be source of much logging. It may be co-incidental that both logs contain messages about retention when the problem happens. It may not be. I'm running some tests this morning to see if I can bring out any issues related to retention enforcement.

@pdf
Copy link

pdf commented Jul 2, 2015

My naive guess would be that #3214 is different (2GB RAM total, OOM-killed, maybe?), but I did note that the retention logs appeared immediately before the consistent timeouts on both occasions I encountered them. This certainly could be coincidence though. Is there something more useful I can do than cause a stack-trace if I encounter it again?

@otoolep
Copy link
Contributor

otoolep commented Jul 2, 2015

@pdf -- what are your retention policy settings? Did you create a retention policy yourself?

If you have the disk space, you could disable retention policy enforcement, and see if your system runs for longer. You can re-enable retention enforcement at any time (it will require a restart) which will then delete any old data. The configuration to change is this line here:

https://github.com/influxdb/influxdb/blob/master/etc/config.sample.toml#L54

Set it to false, and you should see no more log messages about it. This isn't a fix, but it might allow us to rule out retention enforcement as a factor.

@pdf
Copy link

pdf commented Jul 2, 2015

I altered the default retention policy to 30d. If I can reproduce the error again, I'll disable retention enforcement then, since I'll be moderately certain that it's likely to occur regularly. I might have just hit two flukes in quick succession, so disabling it now may not really tell us anything - smells like a race or something similarly nasty to track down.

@otoolep
Copy link
Contributor

otoolep commented Jul 2, 2015

Thanks @pdf -- that may be sufficient. Let me know if you continue to see the issue.

@jasonrm
Copy link
Author

jasonrm commented Jul 2, 2015

Still seeing failed to write point batch to database with both graphite and collectd. I've set retention enabled = false and running 2b2cc3c. Usually it only takes about one-two minutes before I start to see the errors after I restart the service. iostat shows disk util to be under 10% and all the influxd processes are usually in the sleeping state.

I am able to use a PHP library to send data to the udp listener without issue.

I've poked around with perf a bit, but haven't seen anything that really jumps out at me.

@otoolep
Copy link
Contributor

otoolep commented Jul 2, 2015

@jasonrm -- we just released 0.9.1, which has new timeout configuration options. You could try increasing this setting: https://github.com/influxdb/influxdb/blob/master/etc/config.sample.toml#L41 though 2 seconds may be sufficient (it is 100ms in the code you are running now). We don't consider this a "fix" per se, but is part of our work towards improving write throughput.

@otoolep
Copy link
Contributor

otoolep commented Jul 2, 2015

@jasonrm -- do let us know if this improves anything.

@jasonrm
Copy link
Author

jasonrm commented Jul 2, 2015

Sorry, forgot to include that I had updated my config to include the new timeout configuration options. At both 2 and 15 seconds I still would get the error within one to two minutes after a restart.

I hadn't noticed it before, but queries to the database I use for the udp listener don't hang.

@otoolep
Copy link
Contributor

otoolep commented Jul 2, 2015

OK, well at least we know the issue you are seeing @jasonrm is not related to retention enforcement -- thanks.

@otoolep otoolep self-assigned this Jul 2, 2015
@otoolep
Copy link
Contributor

otoolep commented Jul 2, 2015

@jasonrm -- there are two other timing controls you should try to change. I'm not saying these are the "fix" but it may allow us to understand better what is happening within your system.

https://github.com/influxdb/influxdb/blob/master/etc/config.sample.toml#L51
https://github.com/influxdb/influxdb/blob/master/etc/config.sample.toml#L52

The second is only in master right now, either build from source or pull down a nightly build in about a day or so.

https://influxdb.com/download/index.html

Try bumping each to, say, 30 seconds. That should be plenty long enough for the data to get through your system, assuming it's not a total lock-up due to a bug.

What are your data rates like? 18 machines sending out data every 30 seconds is probably not huge.

@jasonrm
Copy link
Author

jasonrm commented Jul 2, 2015

I'm almost sure at this point that it is only triggered by a query, something similar to https://groups.google.com/forum/#!topic/influxdb/H11ivieFPG4

I can run this query without issue,

-- works
SELECT mean(value) FROM "dbi" WHERE type = 'order_payment_method' AND time > now() - 6h GROUP BY time(1h), type_instance ORDER BY asc

however, change mean to difference (or anything that isn't a valid function) and within seconds I start to see the failed to write point batch errors.

-- breaks
SELECT difference(value) FROM "dbi" WHERE type = 'order_payment_method' AND time > now() - 6h GROUP BY time(1h), type_instance ORDER BY asc

@otoolep
Copy link
Contributor

otoolep commented Jul 2, 2015

Does the problematic query return?

@jasonrm
Copy link
Author

jasonrm commented Jul 2, 2015

In the admin UI i get back Server returned error: function not found: "difference"

@toddboom
Copy link
Contributor

toddboom commented Jul 2, 2015

I'm pretty sure difference isn't wired up yet - see #1825.

@jasonrm
Copy link
Author

jasonrm commented Jul 3, 2015

I think I was running into #2685, where trying to call any function that doesn't exist results in the database being locked (or something that has similar appearance).

@otoolep
Copy link
Contributor

otoolep commented Jul 3, 2015

OK, if this is a query-related issue, I'll take a look. Much of the query code is being re-written at the moment.

@pdf
Copy link

pdf commented Jul 7, 2015

Having spent some more time on this, it does look like invalid queries are causing the database to be locked in my case too.

@olivierHa
Copy link

Hello,

I can trigger this bug using a "mode" query for example (from grafana)

[http] 2015/07/10 14:27:36 10.22.111.40 - - [10/Jul/2015:14:27:36 +0200] GET /query?db=graphite&epoch=ms&p=zzzzz&q=SELECT+mode%28value%29+FROM+%22count%22+WHERExxxx 5.376143ms

[graphite] 2015/07/10 14:27:46 failed to write point batch to database "graphite": timeout
[graphite] 2015/07/10 14:27:51 failed to write point batch to database "graphite": timeout

I am using nightly builld : "InfluxDB starting, version 0.9.1-rc1-111-g8b67872, commit 8b67872"

Regards

Olivier

@otoolep
Copy link
Contributor

otoolep commented Jul 10, 2015

@olivierHa -- any chance you can use the "developer" functionality of your browser to show the actual request that Grafana is making to InfluxDB?

@olivierHa
Copy link

The query is :

SELECT mode(value) FROM "count" WHERE provider = 'XXX' AND time > now() - 1h GROUP BY time(10s), api ORDER BY asc

I also can reproduce the issue using the influxdb cli with the same query :

The return is : ERR: function not found: "mode"

Using Grafana :

{"results":[{"error":"function not found: "mode""}]}

In both case, I got the "failed to write to point batch timeout"

Regards

Olivier

@beckettsean
Copy link
Contributor

Major changes the write and query paths in 0.9.3 as well as better plugin support. Please let us know if these issues still happen with 0.9.3 installed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants