collectd, failed to write batch: timeout #3199

jasonrm · 2015-06-30T21:58:34Z

I have about 18 instances of collectd sending data to the collectd listener of InfluxDB, and for the first couple minutes everything seems to run fine until the following message starts to appear.

[collectd] 2015/06/30 20:46:08 failed to write batch: timeout

Far as I can tell, as soon as that shows up, queries for data just hang.

-- Once the 'failed to write batch' messages start, queries like this just hang
SELECT max(value) FROM "sensors_value" WHERE type = 'temperature' AND time > now() - 1h GROUP BY time(1m), host ORDER BY asc

-- However this works without issue the entire time
SHOW TAG KEYS FROM cpu_value

versions

Linux hostname 4.0.6-1-ARCH #1 SMP PREEMPT Tue Jun 23 14:25:08 CEST 2015 x86_64 GNU/Linux
influxdb - git master - a2bf480
grafana - git master - a38a06a
collectd - 5.5.0

I happen to be running zfs root as well, but far as I can tell that shouldn't be an issue.

log (after restarting influxdb)

using configuration at: /etc/influxdb.conf
2015/06/30 20:39:00 [INFO] raft: Node at 127.0.0.1:8088 [Follower] entering Follower state
2015/06/30 20:39:00 read local node id: 1
[shard] 2015/06/30 20:39:01 flush 17981 points in 0.577s
[shard] 2015/06/30 20:39:01 flush 17528 points in 0.567s
2015/06/30 20:39:01 [WARN] raft: Heartbeat timeout reached, starting election
2015/06/30 20:39:01 [INFO] raft: Node at 127.0.0.1:8088 [Candidate] entering Candidate state
2015/06/30 20:39:01 [DEBUG] raft: Votes needed: 1
2015/06/30 20:39:01 [DEBUG] raft: Vote granted. Tally: 1
2015/06/30 20:39:01 [INFO] raft: Election won. Tally: 1
2015/06/30 20:39:01 [INFO] raft: Node at 127.0.0.1:8088 [Leader] entering Leader state
2015/06/30 20:39:01 [INFO] raft: Disabling EnableSingleNode (bootstrap)
2015/06/30 20:39:01 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
2015/06/30 20:39:01 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
2015/06/30 20:39:01 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
2015/06/30 20:39:01 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
2015/06/30 20:39:01 [DEBUG] raft: Node 127.0.0.1:8088 updated peer set (2): [127.0.0.1:8088]
[shard] 2015/06/30 20:39:02 flush 16957 points in 0.524s
[shard] 2015/06/30 20:39:02 flush 17782 points in 0.410s
[shard] 2015/06/30 20:39:03 flush 17500 points in 0.506s
[shard] 2015/06/30 20:39:04 flush 17917 points in 0.547s
[shard] 2015/06/30 20:39:04 flush 17831 points in 0.505s
[shard] 2015/06/30 20:39:04 flush 812 points in 0.026s
[shard] 2015/06/30 20:39:05 flush 99 points in 0.019s
[shard] 2015/06/30 20:39:05 flush 650 points in 0.042s
[shard] 2015/06/30 20:39:05 flush 1126 points in 0.028s
[shard] 2015/06/30 20:39:05 flush 574 points in 0.032s
[shard] 2015/06/30 20:39:05 flush 578 points in 0.037s
[shard] 2015/06/30 20:39:05 flush 692 points in 0.049s
[shard] 2015/06/30 20:39:05 flush 1104 points in 0.041s
[httpd] 2015/06/30 20:39:05 listening on HTTP: [::]:8086
[collectd] 2015/06/30 20:39:05 collectd UDP started
[udp] 2015/06/30 20:39:05 Started listening on :2054
2015/06/30 20:39:05 InfluxDB starting, version 0.9, commit unknown
2015/06/30 20:39:05 GOMAXPROCS set to 8
2015/06/30 20:39:06 Sending anonymous usage statistics to m.influxdb.com
[collectd] 2015/06/30 20:46:08 failed to write batch: timeout
[collectd] 2015/06/30 20:46:13 failed to write batch: timeout
[collectd] 2015/06/30 20:46:18 failed to write batch: timeout
[collectd] 2015/06/30 20:46:23 failed to write batch: timeout
[collectd] 2015/06/30 20:46:28 failed to write batch: timeout
[collectd] 2015/06/30 20:46:33 failed to write batch: timeout
[collectd] 2015/06/30 20:46:38 failed to write batch: timeout
[collectd] 2015/06/30 20:46:43 failed to write batch: timeout
    <snip>
[collectd] 2015/06/30 21:08:45 failed to write batch: timeout
[collectd] 2015/06/30 21:08:50 failed to write batch: timeout
[collectd] 2015/06/30 21:08:55 failed to write batch: timeout
[retention] 2015/06/30 21:08:57 retention policy enforcement check commencing
[retention] 2015/06/30 21:08:57 retention policy shard deletion check commencing
[shard] 2015/06/30 21:08:58 flush 725 points in 0.028s
[shard] 2015/06/30 21:08:58 flush 117 points in 0.015s
[shard] 2015/06/30 21:08:58 flush 452 points in 0.018s
[shard] 2015/06/30 21:08:58 flush 1060 points in 0.019s
[shard] 2015/06/30 21:08:58 flush 489 points in 0.025s
[shard] 2015/06/30 21:08:58 flush 538 points in 0.017s
[shard] 2015/06/30 21:08:58 flush 479 points in 0.017s
[shard] 2015/06/30 21:08:59 flush 1008 points in 0.023s
[collectd] 2015/06/30 21:09:00 failed to write batch: timeout
[collectd] 2015/06/30 21:09:05 failed to write batch: timeout
[collectd] 2015/06/30 21:09:10 failed to write batch: timeout
[collectd] 2015/06/30 21:09:15 failed to write batch: timeout
    <snip>

/etc/influxdb.conf

reporting-disabled = false

[meta]
  dir = "/var/lib/influxdb/meta"
  hostname = "localhost"
  bind-address = ":8088"
  retention-autocreate = true
  election-timeout = "1s"
  heartbeat-timeout = "1s"
  leader-lease-timeout = "500ms"
  commit-timeout = "50ms"

[data]
  dir = "/var/lib/influxdb/data"

[cluster]
  shard-writer-timeout = "5s"

[retention]
  enabled = true
  check-interval = "10m"

[admin]
  enabled = true
  bind-address = ":8083"

[http]
  enabled = true
  bind-address = ":8086"
  auth-enabled = false
  log-enabled = false
  write-tracing = false
  pprof-enabled = false

[collectd]
  enabled = true
  bind-address = ":2100"
  database = "collectd"
  typesdb = "/etc/collectd/types.db"
  batch-size = 512
  batch-timeout = "30s"

[udp]
  enabled = true
  bind-address = ":2054"
  database = "udp"
  batch-size = 1024
  batch-timeout = "10s"

[monitoring]
  enabled = true
  write-interval = "24h"

[continuous_queries]
  enabled = true
  recompute-previous-n = 2
  recompute-no-older-than = "10m"
  compute-runs-per-interval = 10
  compute-no-more-than = "2m"

[hinted-handoff]
  enabled = true
  dir = "/var/lib/influxdb/hh"
  max-size = 1073741824
  max-age = "168h"
  retry-rate-limit = 0
  retry-interval = "1s"

/etc/collectd.conf

TypesDB "/etc/collectd/types.db"
Interval 30

LoadPlugin syslog

<Plugin syslog>
    LogLevel warning
</Plugin>

LoadPlugin battery
LoadPlugin cpu
LoadPlugin df
LoadPlugin disk
LoadPlugin entropy
LoadPlugin interface
LoadPlugin irq
LoadPlugin load
LoadPlugin memory
LoadPlugin network
LoadPlugin processes
LoadPlugin sensors
LoadPlugin swap
LoadPlugin users

<Plugin "cpu">
  ReportByState true
  ReportByCpu true
  ValuesPercentage true
</Plugin>

<Plugin "network">
    Server "127.0.0.1" "2100"
</Plugin>

The text was updated successfully, but these errors were encountered:

otoolep · 2015-06-30T22:38:01Z

Thanks for the report @jasonrm

@jwilder -- would increasing the shard timeout help? Not as a solution, but to understand this issue better?

jwilder · 2015-06-30T23:04:07Z

@otoolep Possibly. It's currently 5s but should probably be configurable. https://github.com/influxdb/influxdb/blob/master/cluster/points_writer.go#L17

If it's a single node, that means that a local write is taking longer than 5s to write a batch. If it's in a cluster, all the nodes took longer than 5s to write. The shard flushes look like they are happening pretty quickly so I wonder if something is getting deadlocked.

pdf · 2015-07-02T00:12:02Z

I've had InfluxDB stop accepting writes a couple of times in as many days now, testing on a single node.

Logs here, I killed it to produce a stack trace.

otoolep · 2015-07-02T00:13:45Z

@pdf -- are you using the collectd input? If not, how are you writing data into the system?

pdf · 2015-07-02T00:34:00Z

TSDB, as indicated by the logs. Is this likely input-specific? I figured it was likely something on the back of the retention policies stopping the world.

otoolep · 2015-07-02T00:45:35Z

I didn't check the logs yet, and quickly wanted to understand if you were also working with collectd.

Retention policy enforcement is designed not to be an issue, and even then the only shards affected are those that are older than the retention period of the policy. Of course, there could always be a bug. Are you writing data into the past?

pdf · 2015-07-02T01:23:56Z

Can you describe what writing data into the past entails? I'd surmise it's possible for multiple clients to send data to the same measurement with a timestamp that precedes one on a record that's already been sent - this seems like something that would be difficult to prevent with multiple writers. In this case, I'm sending metrics via scollector from the bosun project.

otoolep · 2015-07-02T01:31:21Z

By "writing into the past" I mean writing data that has timestamps
(significantly) earlier than now. For example, writing data with timestamps
representing yesterday. If the target retention policy is, say, 6 hours it
would mean you would be creating data in the system that would be deleted
almost immediately.

Philip

On Wed, Jul 1, 2015 at 9:24 PM, Peter Fern notifications@github.com wrote:

Can you describe what writing data into the past entails? I'd surmise it's
possible for multiple clients to send data to the same measurement that
preceeds the timestamp on a record that's already been sent - this seems
like something that would be difficult to prevent with multiple writers. In
this case, I'm sending metrics via scollector from the bosun project.

—
Reply to this email directly or view it on GitHub
#3199 (comment).

pdf · 2015-07-02T01:53:18Z

Gotcha, nothing like that sort of period - we'd be looking at seconds out of order rather than hours.

otoolep · 2015-07-02T13:16:43Z

#3214 also reported that the system either hung when under write load. There are also messages in the logs regarding retention enforcement.

Retention enforcement runs, by default, every 10 minutes, so it will be source of much logging. It may be co-incidental that both logs contain messages about retention when the problem happens. It may not be. I'm running some tests this morning to see if I can bring out any issues related to retention enforcement.

pdf · 2015-07-02T13:28:41Z

My naive guess would be that #3214 is different (2GB RAM total, OOM-killed, maybe?), but I did note that the retention logs appeared immediately before the consistent timeouts on both occasions I encountered them. This certainly could be coincidence though. Is there something more useful I can do than cause a stack-trace if I encounter it again?

otoolep · 2015-07-02T13:38:02Z

@pdf -- what are your retention policy settings? Did you create a retention policy yourself?

If you have the disk space, you could disable retention policy enforcement, and see if your system runs for longer. You can re-enable retention enforcement at any time (it will require a restart) which will then delete any old data. The configuration to change is this line here:

https://github.com/influxdb/influxdb/blob/master/etc/config.sample.toml#L54

Set it to false, and you should see no more log messages about it. This isn't a fix, but it might allow us to rule out retention enforcement as a factor.

pdf · 2015-07-02T13:52:43Z

I altered the default retention policy to 30d. If I can reproduce the error again, I'll disable retention enforcement then, since I'll be moderately certain that it's likely to occur regularly. I might have just hit two flukes in quick succession, so disabling it now may not really tell us anything - smells like a race or something similarly nasty to track down.

otoolep · 2015-07-02T14:59:52Z

Thanks @pdf -- that may be sufficient. Let me know if you continue to see the issue.

jasonrm · 2015-07-02T17:40:54Z

Still seeing failed to write point batch to database with both graphite and collectd. I've set retention enabled = false and running 2b2cc3c. Usually it only takes about one-two minutes before I start to see the errors after I restart the service. iostat shows disk util to be under 10% and all the influxd processes are usually in the sleeping state.

I am able to use a PHP library to send data to the udp listener without issue.

I've poked around with perf a bit, but haven't seen anything that really jumps out at me.

otoolep · 2015-07-02T18:03:13Z

@jasonrm -- we just released 0.9.1, which has new timeout configuration options. You could try increasing this setting: https://github.com/influxdb/influxdb/blob/master/etc/config.sample.toml#L41 though 2 seconds may be sufficient (it is 100ms in the code you are running now). We don't consider this a "fix" per se, but is part of our work towards improving write throughput.

otoolep · 2015-07-02T18:03:35Z

@jasonrm -- do let us know if this improves anything.

jasonrm · 2015-07-02T18:41:32Z

Sorry, forgot to include that I had updated my config to include the new timeout configuration options. At both 2 and 15 seconds I still would get the error within one to two minutes after a restart.

I hadn't noticed it before, but queries to the database I use for the udp listener don't hang.

otoolep · 2015-07-02T20:51:21Z

OK, well at least we know the issue you are seeing @jasonrm is not related to retention enforcement -- thanks.

otoolep · 2015-07-02T21:30:37Z

@jasonrm -- there are two other timing controls you should try to change. I'm not saying these are the "fix" but it may allow us to understand better what is happening within your system.

https://github.com/influxdb/influxdb/blob/master/etc/config.sample.toml#L51
https://github.com/influxdb/influxdb/blob/master/etc/config.sample.toml#L52

The second is only in master right now, either build from source or pull down a nightly build in about a day or so.

https://influxdb.com/download/index.html

Try bumping each to, say, 30 seconds. That should be plenty long enough for the data to get through your system, assuming it's not a total lock-up due to a bug.

What are your data rates like? 18 machines sending out data every 30 seconds is probably not huge.

jasonrm · 2015-07-02T22:35:46Z

I'm almost sure at this point that it is only triggered by a query, something similar to https://groups.google.com/forum/#!topic/influxdb/H11ivieFPG4

I can run this query without issue,

-- works
SELECT mean(value) FROM "dbi" WHERE type = 'order_payment_method' AND time > now() - 6h GROUP BY time(1h), type_instance ORDER BY asc

however, change mean to difference (or anything that isn't a valid function) and within seconds I start to see the failed to write point batch errors.

-- breaks
SELECT difference(value) FROM "dbi" WHERE type = 'order_payment_method' AND time > now() - 6h GROUP BY time(1h), type_instance ORDER BY asc

otoolep · 2015-07-02T23:21:57Z

Does the problematic query return?

jasonrm · 2015-07-02T23:27:27Z

In the admin UI i get back Server returned error: function not found: "difference"

toddboom · 2015-07-02T23:38:41Z

I'm pretty sure difference isn't wired up yet - see #1825.

jasonrm · 2015-07-03T00:14:33Z

I think I was running into #2685, where trying to call any function that doesn't exist results in the database being locked (or something that has similar appearance).

otoolep · 2015-07-03T13:30:31Z

OK, if this is a query-related issue, I'll take a look. Much of the query code is being re-written at the moment.

pdf · 2015-07-07T03:01:38Z

Having spent some more time on this, it does look like invalid queries are causing the database to be locked in my case too.

olivierHa · 2015-07-10T12:30:44Z

Hello,

I can trigger this bug using a "mode" query for example (from grafana)

[http] 2015/07/10 14:27:36 10.22.111.40 - - [10/Jul/2015:14:27:36 +0200] GET /query?db=graphite&epoch=ms&p=zzzzz&q=SELECT+mode%28value%29+FROM+%22count%22+WHERExxxx 5.376143ms

[graphite] 2015/07/10 14:27:46 failed to write point batch to database "graphite": timeout
[graphite] 2015/07/10 14:27:51 failed to write point batch to database "graphite": timeout

I am using nightly builld : "InfluxDB starting, version 0.9.1-rc1-111-g8b67872, commit 8b67872"

Regards

Olivier

otoolep · 2015-07-10T15:03:35Z

@olivierHa -- any chance you can use the "developer" functionality of your browser to show the actual request that Grafana is making to InfluxDB?

olivierHa · 2015-07-10T20:04:57Z

The query is :

SELECT mode(value) FROM "count" WHERE provider = 'XXX' AND time > now() - 1h GROUP BY time(10s), api ORDER BY asc

I also can reproduce the issue using the influxdb cli with the same query :

The return is : ERR: function not found: "mode"

Using Grafana :

{"results":[{"error":"function not found: "mode""}]}

In both case, I got the "failed to write to point batch timeout"

Regards

Olivier

beckettsean · 2015-08-27T23:12:47Z

Major changes the write and query paths in 0.9.3 as well as better plugin support. Please let us know if these issues still happen with 0.9.3 installed.

otoolep self-assigned this Jul 2, 2015

shish mentioned this issue Jul 9, 2015

[0.9.1ish-master] failed to write batch: timeout, graphite protocol #3275

Closed

Inisfair mentioned this issue Jul 13, 2015

[0.9.2-nightly] unresponsive on large queries/failed to write batch point, graphite plugin #3316

Closed

beckettsean added this to the 0.9.2 milestone Jul 15, 2015

beckettsean added area/performance category/plugins labels Jul 15, 2015

jhedlund mentioned this issue Jul 17, 2015

Continuous Queries causing 500 Timeouts #3368

Closed

toddboom added the review label Jul 29, 2015

toddboom modified the milestones: 0.9.2, 0.9.3 Jul 29, 2015

beckettsean modified the milestones: 0.9.4, 0.9.3 Aug 6, 2015

beckettsean closed this as completed Aug 27, 2015

beckettsean removed the review label Aug 27, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

collectd, failed to write batch: timeout #3199

collectd, failed to write batch: timeout #3199

jasonrm commented Jun 30, 2015

otoolep commented Jun 30, 2015

jwilder commented Jun 30, 2015

pdf commented Jul 2, 2015

otoolep commented Jul 2, 2015

pdf commented Jul 2, 2015

otoolep commented Jul 2, 2015

pdf commented Jul 2, 2015

otoolep commented Jul 2, 2015

pdf commented Jul 2, 2015

otoolep commented Jul 2, 2015

pdf commented Jul 2, 2015

otoolep commented Jul 2, 2015

pdf commented Jul 2, 2015

otoolep commented Jul 2, 2015

jasonrm commented Jul 2, 2015

otoolep commented Jul 2, 2015

otoolep commented Jul 2, 2015

jasonrm commented Jul 2, 2015

otoolep commented Jul 2, 2015

otoolep commented Jul 2, 2015

jasonrm commented Jul 2, 2015

otoolep commented Jul 2, 2015

jasonrm commented Jul 2, 2015

toddboom commented Jul 2, 2015

jasonrm commented Jul 3, 2015

otoolep commented Jul 3, 2015

pdf commented Jul 7, 2015

olivierHa commented Jul 10, 2015

otoolep commented Jul 10, 2015

olivierHa commented Jul 10, 2015

beckettsean commented Aug 27, 2015

collectd, failed to write batch: timeout #3199

collectd, failed to write batch: timeout #3199

Comments

jasonrm commented Jun 30, 2015

versions

log (after restarting influxdb)

/etc/influxdb.conf

/etc/collectd.conf

otoolep commented Jun 30, 2015

jwilder commented Jun 30, 2015

pdf commented Jul 2, 2015

otoolep commented Jul 2, 2015

pdf commented Jul 2, 2015

otoolep commented Jul 2, 2015

pdf commented Jul 2, 2015

otoolep commented Jul 2, 2015

pdf commented Jul 2, 2015

otoolep commented Jul 2, 2015

pdf commented Jul 2, 2015

otoolep commented Jul 2, 2015

pdf commented Jul 2, 2015

otoolep commented Jul 2, 2015

jasonrm commented Jul 2, 2015

otoolep commented Jul 2, 2015

otoolep commented Jul 2, 2015

jasonrm commented Jul 2, 2015

otoolep commented Jul 2, 2015

otoolep commented Jul 2, 2015

jasonrm commented Jul 2, 2015

otoolep commented Jul 2, 2015

jasonrm commented Jul 2, 2015

toddboom commented Jul 2, 2015

jasonrm commented Jul 3, 2015

otoolep commented Jul 3, 2015

pdf commented Jul 7, 2015

olivierHa commented Jul 10, 2015

otoolep commented Jul 10, 2015

olivierHa commented Jul 10, 2015

beckettsean commented Aug 27, 2015