-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
collectd, failed to write batch: timeout #3199
Comments
@otoolep Possibly. It's currently If it's a single node, that means that a local write is taking longer than 5s to write a batch. If it's in a cluster, all the nodes took longer than |
I've had InfluxDB stop accepting writes a couple of times in as many days now, testing on a single node. Logs here, I killed it to produce a stack trace. |
@pdf -- are you using the collectd input? If not, how are you writing data into the system? |
TSDB, as indicated by the logs. Is this likely input-specific? I figured it was likely something on the back of the retention policies stopping the world. |
I didn't check the logs yet, and quickly wanted to understand if you were also working with collectd. Retention policy enforcement is designed not to be an issue, and even then the only shards affected are those that are older than the retention period of the policy. Of course, there could always be a bug. Are you writing data into the past? |
Can you describe what writing data into the past entails? I'd surmise it's possible for multiple clients to send data to the same measurement with a timestamp that precedes one on a record that's already been sent - this seems like something that would be difficult to prevent with multiple writers. In this case, I'm sending metrics via scollector from the bosun project. |
By "writing into the past" I mean writing data that has timestamps Philip On Wed, Jul 1, 2015 at 9:24 PM, Peter Fern notifications@github.com wrote:
|
Gotcha, nothing like that sort of period - we'd be looking at seconds out of order rather than hours. |
#3214 also reported that the system either hung when under write load. There are also messages in the logs regarding retention enforcement. Retention enforcement runs, by default, every 10 minutes, so it will be source of much logging. It may be co-incidental that both logs contain messages about retention when the problem happens. It may not be. I'm running some tests this morning to see if I can bring out any issues related to retention enforcement. |
My naive guess would be that #3214 is different (2GB RAM total, OOM-killed, maybe?), but I did note that the retention logs appeared immediately before the consistent timeouts on both occasions I encountered them. This certainly could be coincidence though. Is there something more useful I can do than cause a stack-trace if I encounter it again? |
@pdf -- what are your retention policy settings? Did you create a retention policy yourself? If you have the disk space, you could disable retention policy enforcement, and see if your system runs for longer. You can re-enable retention enforcement at any time (it will require a restart) which will then delete any old data. The configuration to change is this line here: https://github.com/influxdb/influxdb/blob/master/etc/config.sample.toml#L54 Set it to false, and you should see no more log messages about it. This isn't a fix, but it might allow us to rule out retention enforcement as a factor. |
I altered the default retention policy to 30d. If I can reproduce the error again, I'll disable retention enforcement then, since I'll be moderately certain that it's likely to occur regularly. I might have just hit two flukes in quick succession, so disabling it now may not really tell us anything - smells like a race or something similarly nasty to track down. |
Thanks @pdf -- that may be sufficient. Let me know if you continue to see the issue. |
Still seeing I am able to use a PHP library to send data to the udp listener without issue. I've poked around with |
@jasonrm -- we just released 0.9.1, which has new timeout configuration options. You could try increasing this setting: https://github.com/influxdb/influxdb/blob/master/etc/config.sample.toml#L41 though 2 seconds may be sufficient (it is 100ms in the code you are running now). We don't consider this a "fix" per se, but is part of our work towards improving write throughput. |
@jasonrm -- do let us know if this improves anything. |
Sorry, forgot to include that I had updated my config to include the new timeout configuration options. At both 2 and 15 seconds I still would get the error within one to two minutes after a restart. I hadn't noticed it before, but queries to the database I use for the udp listener don't hang. |
OK, well at least we know the issue you are seeing @jasonrm is not related to retention enforcement -- thanks. |
@jasonrm -- there are two other timing controls you should try to change. I'm not saying these are the "fix" but it may allow us to understand better what is happening within your system. https://github.com/influxdb/influxdb/blob/master/etc/config.sample.toml#L51 The second is only in master right now, either build from source or pull down a nightly build in about a day or so. https://influxdb.com/download/index.html Try bumping each to, say, 30 seconds. That should be plenty long enough for the data to get through your system, assuming it's not a total lock-up due to a bug. What are your data rates like? 18 machines sending out data every 30 seconds is probably not huge. |
I'm almost sure at this point that it is only triggered by a query, something similar to https://groups.google.com/forum/#!topic/influxdb/H11ivieFPG4 I can run this query without issue, -- works
SELECT mean(value) FROM "dbi" WHERE type = 'order_payment_method' AND time > now() - 6h GROUP BY time(1h), type_instance ORDER BY asc however, change -- breaks
SELECT difference(value) FROM "dbi" WHERE type = 'order_payment_method' AND time > now() - 6h GROUP BY time(1h), type_instance ORDER BY asc |
Does the problematic query return? |
In the admin UI i get back |
I'm pretty sure |
I think I was running into #2685, where trying to call any function that doesn't exist results in the database being locked (or something that has similar appearance). |
OK, if this is a query-related issue, I'll take a look. Much of the query code is being re-written at the moment. |
Having spent some more time on this, it does look like invalid queries are causing the database to be locked in my case too. |
Hello, I can trigger this bug using a "mode" query for example (from grafana) [http] 2015/07/10 14:27:36 10.22.111.40 - - [10/Jul/2015:14:27:36 +0200] GET /query?db=graphite&epoch=ms&p=zzzzz&q=SELECT+mode%28value%29+FROM+%22count%22+WHERExxxx 5.376143ms [graphite] 2015/07/10 14:27:46 failed to write point batch to database "graphite": timeout I am using nightly builld : "InfluxDB starting, version 0.9.1-rc1-111-g8b67872, commit 8b67872" Regards Olivier |
@olivierHa -- any chance you can use the "developer" functionality of your browser to show the actual request that Grafana is making to InfluxDB? |
The query is : SELECT mode(value) FROM "count" WHERE provider = 'XXX' AND time > now() - 1h GROUP BY time(10s), api ORDER BY asc I also can reproduce the issue using the influxdb cli with the same query : The return is : ERR: function not found: "mode" Using Grafana : {"results":[{"error":"function not found: "mode""}]} In both case, I got the "failed to write to point batch timeout" Regards Olivier |
Major changes the write and query paths in 0.9.3 as well as better plugin support. Please let us know if these issues still happen with 0.9.3 installed. |
I have about 18 instances of collectd sending data to the collectd listener of InfluxDB, and for the first couple minutes everything seems to run fine until the following message starts to appear.
Far as I can tell, as soon as that shows up, queries for data just hang.
versions
Linux hostname 4.0.6-1-ARCH #1 SMP PREEMPT Tue Jun 23 14:25:08 CEST 2015 x86_64 GNU/Linux
influxdb - git master - a2bf480
grafana - git master - a38a06a
collectd - 5.5.0
I happen to be running zfs root as well, but far as I can tell that shouldn't be an issue.
log (after restarting influxdb)
/etc/influxdb.conf
/etc/collectd.conf
The text was updated successfully, but these errors were encountered: