-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[0.9.2-nightly] unresponsive on large queries/failed to write batch point, graphite plugin #3316
Comments
Here is a log snippet of the issue occurring. Because of the time of day I was able to let it recover naturally instead of restarting influxd to force grafana back to being responsive. Note that the browser had long been closed after the grafana UI became unresponsive waiting for the return. https://gist.github.com/Inisfair/604be50944d688535135 In this log you can see a continuous stream of incoming data being written, then eventually a large enough request comes in from grafana to stall out influxdb which results in some dropped writes. Eventually (after about 25 minutes) it recovered and the stream of incoming data resumed. |
Similar issue here, a snippet of logs:
Seems once a while there would be a |
The 0.9.3 release addresses the issue of queries affecting write https://influxdb.com/download/index.html On Thursday, August 27, 2015, can. notifications@github.com wrote:
|
@otoolep So I need to remove all previous data and start a brand new cluster to use bz1 engine? |
@cannium -- no, you simply need to create a new database ( |
That new database will have bz1 shards. |
OK, I'll try |
Great, let us know how it goes. |
I would also try out a larger batch size, and definitely a larger timeout than what was shown above. 20 ms may be too low. E.g.
My concern with such a low timeout as 20ms is that you're not batching as much as you think. Granted it may introduce some latency (of a second), but it should guarantee better throughput. |
This is my current and previous config:
|
I would use TCP for performance testing. Otherwise it is difficult to know On Thursday, August 27, 2015, can. notifications@github.com wrote:
|
I installed 0.9.3 in a test environment and have been inserting the same data into it in parallel. So far at least I have not been able to cause it to fail to insert data via the graphite plugin due to a large chart request from grafana. However, today I was reviewing the influxdb.log file and noticed the following entries in the log. This does not appear to be related to large queries/grafana requests, but possibly the compaction task?
|
I have been looking at similar problem to this with 0.9.2. The issue arises when there is a long running query that spans a large number of shards and then a write request arrives which requires Bolt DB to acquire a writelock on the Bolt DB mmaplock mutex presumably to perform extra allocation with a shard that overlaps the current time. The acquisition of the writelock may be blocked by readlocks on mmaplock held by the long running query. The pending writelock on mmaplock then prevents new queries obtaining a readlock on mmaplock. I for one would like a way to prevent long running queries running more than a fixed amount of time because it is quite hard to prevent a grafana user, for example, from specifying a time range that results in a long running query. |
There were some deadlocks fixed in the 0.9.5 release. I recommend upgrading if possible, as a number of improvements and bug fixes have gone into the product since 0.9.2. |
This is fixed by the |
Suspect this is related to #3275, #3282, #3199 . I have data coming in via the Graphite plugin, several hundreds of thousands of metric points every couple of minutes. This works fine for days until via grafana someone requests a large set of data for charting, once that happens all rendering via grafana stops working and eventually in the influx logs I start to see the "failed to write point batch to database" errors. The only fix seems to be to restart influxDB at which point data is lost but the errors stop and grafana becomes responsive again until someone makes a large request again. In this case, a large request is 7 days of data which would consist of over a million raw data points being summarized.
Tuning the [data] section as suggested in #3282 seems to have significantly reduced the likelihood of a complete lockup situation however with enough read queries requesting a large enough amount of data it is easily reproducible in my environment.
There are currently no Continuous Queries configured.
This is on a 2x4 VM running as a single node.
Version:
Config:
The text was updated successfully, but these errors were encountered: