[0.9.2-nightly] unresponsive on large queries/failed to write batch point, graphite plugin #3316

Inisfair · 2015-07-13T21:09:00Z

Suspect this is related to #3275, #3282, #3199 . I have data coming in via the Graphite plugin, several hundreds of thousands of metric points every couple of minutes. This works fine for days until via grafana someone requests a large set of data for charting, once that happens all rendering via grafana stops working and eventually in the influx logs I start to see the "failed to write point batch to database" errors. The only fix seems to be to restart influxDB at which point data is lost but the errors stop and grafana becomes responsive again until someone makes a large request again. In this case, a large request is 7 days of data which would consist of over a million raw data points being summarized.

Tuning the [data] section as suggested in #3282 seems to have significantly reduced the likelihood of a complete lockup situation however with enough read queries requesting a large enough amount of data it is easily reproducible in my environment.

There are currently no Continuous Queries configured.

This is on a 2x4 VM running as a single node.

Version:

inisfair@influxdb01 => uname -a
Linux influxdb01 2.6.32-504.el6.x86_64 #1 SMP Tue Sep 16 01:56:35 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux

2015/07/13 13:22:23 InfluxDB starting, version 0.9.1-rc1-111-g8b67872, commit 8b
67872d6bd1013d455e3b12af3e79b24d78bf0d

Config:

# no internet acces...
reporting-disabled = true

[meta]
  dir = "/var/opt/influxdb/meta"
  hostname = "influxdb01"
  bind-address = ":8088"
  retention-autocreate = true
  election-timeout = "1s"
  heartbeat-timeout = "1s"
  leader-lease-timeout = "500ms"
  commit-timeout = "50ms"

[data]
 dir = "/var/opt/influxdb/data"
 max-wal-size = 1024
 max-flush-interval = "15s"
 max-partition-flush-delay = "2s"

[cluster]
  shard-writer-timeout = "90s"
  write-timeout = "90s"

[retention]
  enabled = true
  check-interval = "10m"

[admin]
  enabled = true
  bind-address = ":8083"

[http]
  enabled = true
  bind-address = ":8086"
  auth-enabled = false
  log-enabled = true
  write-tracing = false
  pprof-enabled = false

[[graphite]]
  enabled = true
  bind-address = ":2003"
  protocol = "tcp"
  consistency-level = "one"
  name-separator = "."

  database = "graphite_raw"
  batch-size = 250
  batch-timeout = "20ms"

  templates = [
    "tldbdial.* source.app.role.logical_host.datacenter.environment.mw_layer.api_name.mw_source.mw_provider.status.measurement template=tldbdial",
    "mwvolume.* .app.role.datacenter.environment.hostname.mw_layer.api_name.mw_source.mw_provider.status.measurement source=tldb,template=mwvolume",
    "logwatcher.* source.app.role.datacenter.environment.hostname.measurement status=totals,template=logwatcher",
    "source.app.role.datacenter.environment.hostname.measurement template=default",
  ]
  ignore-unnamed = false

[collectd]
  enabled = false
  # bind-address = ""
  # database = ""
  # typesdb = ""

[opentsdb]
  enabled = false
  # bind-address = ""
  # database = ""
  # retention-policy = ""

[udp]
  enabled = false
  # bind-address = ""
  # database = ""
  # batch-size = 0
  # batch-timeout = "0"

[monitoring]
  enabled = true
  write-interval = "24h"

[continuous_queries]
  enabled = true
  recompute-previous-n = 2
  recompute-no-older-than = "30m"
  compute-runs-per-interval = 10
  compute-no-more-than = "2m"

[hinted-handoff]
  enabled = true
  dir = "/var/opt/influxdb/hh"
  max-size = 1073741824
  max-age = "168h"
  retry-rate-limit = 0
  retry-interval = "1s"

The text was updated successfully, but these errors were encountered:

Inisfair · 2015-07-16T05:57:30Z

Here is a log snippet of the issue occurring. Because of the time of day I was able to let it recover naturally instead of restarting influxd to force grafana back to being responsive. Note that the browser had long been closed after the grafana UI became unresponsive waiting for the return.

https://gist.github.com/Inisfair/604be50944d688535135

In this log you can see a continuous stream of incoming data being written, then eventually a large enough request comes in from grafana to stall out influxdb which results in some dropped writes. Eventually (after about 25 minutes) it recovered and the stream of incoming data resumed.

cannium · 2015-08-27T07:43:55Z

Similar issue here, a snippet of logs:

[b1] 2015/08/27 15:14:25 flush 156818 points in 35.160s
[b1] 2015/08/27 15:15:09 flush 159034 points in 40.072s
[b1] 2015/08/27 15:15:51 flush 158285 points in 38.009s
[b1] 2015/08/27 15:16:36 flush 160066 points in 41.388s
[b1] 2015/08/27 15:17:17 flush 13118 points in 36.533s
[b1] 2015/08/27 15:17:29 flush 11838 points in 7.961s
[b1] 2015/08/27 15:18:08 flush 12081 points in 35.295s
[graphite] 2015/08/27 15:18:30 failed to write point batch to database "graphite": timeout
[b1] 2015/08/27 15:18:34 flush 9370 points in 22.244s
[b1] 2015/08/27 15:18:59 flush 7903 points in 20.771s
[b1] 2015/08/27 15:19:20 flush 6693 points in 17.271s
[graphite] 2015/08/27 15:19:30 failed to write point batch to database "graphite": timeout
[b1] 2015/08/27 15:19:44 flush 5377 points in 19.472s
[b1] 2015/08/27 15:20:02 flush 5375 points in 13.706s
[graphite] 2015/08/27 15:20:30 failed to write point batch to database "graphite": timeout

Seems once a while there would be a timeout and some data point lost. I can see from iostat that the iops is very high, but as @Inisfair provided, it also happens when only a few hundreds of points to write. So I suspect if there're some hidden bugs.

otoolep · 2015-08-27T15:13:55Z

The 0.9.3 release addresses the issue of queries affecting write
performance, in many circumstances. You should upgrade and see if
performance improves. You will need to write to a brand new database to
ensure you use the bz1 engine.

https://influxdb.com/download/index.html

On Thursday, August 27, 2015, can. notifications@github.com wrote:

Similar issue here, a snippet of logs:

[b1] 2015/08/27 15:14:25 flush 156818 points in 35.160s
[b1] 2015/08/27 15:15:09 flush 159034 points in 40.072s
[b1] 2015/08/27 15:15:51 flush 158285 points in 38.009s
[b1] 2015/08/27 15:16:36 flush 160066 points in 41.388s
[b1] 2015/08/27 15:17:17 flush 13118 points in 36.533s
[b1] 2015/08/27 15:17:29 flush 11838 points in 7.961s
[b1] 2015/08/27 15:18:08 flush 12081 points in 35.295s
[graphite] 2015/08/27 15:18:30 failed to write point batch to database "graphite": timeout
[b1] 2015/08/27 15:18:34 flush 9370 points in 22.244s
[b1] 2015/08/27 15:18:59 flush 7903 points in 20.771s
[b1] 2015/08/27 15:19:20 flush 6693 points in 17.271s
[graphite] 2015/08/27 15:19:30 failed to write point batch to database "graphite": timeout
[b1] 2015/08/27 15:19:44 flush 5377 points in 19.472s
[b1] 2015/08/27 15:20:02 flush 5375 points in 13.706s
[graphite] 2015/08/27 15:20:30 failed to write point batch to database "graphite": timeout

Seems once a while there would be a timeout and some data point lost. I
can see from iostat that the iops is very high, but as @Inisfair
https://github.com/Inisfair provided, it also happens when only a few
hundreds of points to write. So I suspect if there're some hidden bugs.

—
Reply to this email directly or view it on GitHub
#3316 (comment).

cannium · 2015-08-28T02:06:14Z

@otoolep So I need to remove all previous data and start a brand new cluster to use bz1 engine?

otoolep · 2015-08-28T02:09:23Z

@cannium -- no, you simply need to create a new database (CREATE DATABASE) and start writing to it.

otoolep · 2015-08-28T02:09:39Z

That new database will have bz1 shards.

cannium · 2015-08-28T02:09:52Z

OK, I'll try

otoolep · 2015-08-28T02:10:09Z

Great, let us know how it goes.

otoolep · 2015-08-28T02:12:28Z

I would also try out a larger batch size, and definitely a larger timeout than what was shown above. 20 ms may be too low. E.g.

  batch-size = 1000
  batch-timeout = "1s"

My concern with such a low timeout as 20ms is that you're not batching as much as you think. Granted it may introduce some latency (of a second), but it should guarantee better throughput.

cannium · 2015-08-28T02:26:36Z

This is my current and previous config:

[[graphite]]
  enabled = true
  bind-address = ":2003"
  protocol = "udp"
  consistency-level = "any"
  name-separator = "."

  batch-size = 10240 # will flush if this many points get buffered
  batch-timeout = "1s" # will flush at least this often even if we haven't hit buffer limit

otoolep · 2015-08-28T02:57:56Z

I would use TCP for performance testing. Otherwise it is difficult to know
if the server is keeping up.

On Thursday, August 27, 2015, can. notifications@github.com wrote:

This is my current and previous config:

[[graphite]]
enabled = true
bind-address = ":2003"
protocol = "udp"
consistency-level = "any"
name-separator = "."

batch-size = 10240 # will flush if this many points get buffered
batch-timeout = "1s" # will flush at least this often even if we haven't
hit buffer limit

—
Reply to this email directly or view it on GitHub
#3316 (comment).

Inisfair · 2015-08-31T21:31:05Z

I installed 0.9.3 in a test environment and have been inserting the same data into it in parallel. So far at least I have not been able to cause it to fail to insert data via the graphite plugin due to a large chart request from grafana. However, today I was reviewing the influxdb.log file and noticed the following entries in the log.

This does not appear to be related to large queries/grafana requests, but possibly the compaction task?

[wal] 2015/08/31 15:04:17 Flushing 0 measurements and 214 series to index
[wal] 2015/08/31 15:04:31 Metadata flush took 13.150238408s
[retention] 2015/08/31 15:04:34 retention policy enforcement check commencing
[retention] 2015/08/31 15:04:34 retention policy shard deletion check commencing
[graphite] 2015/08/31 15:07:33 failed to write point batch to database "graphite_raw": timeout
[wal] 2015/08/31 15:07:34 Flush due to memory. Flushing 39 series with 1786887 bytes from partition 3. Compacting 12368 series
[wal] 2015/08/31 15:07:42 compaction of partition 3 took 7.186054746s
[wal] 2015/08/31 15:07:56 Flush due to memory. Flushing 75 series with 632281 bytes from partition 5. Compacting 13338 series
[wal] 2015/08/31 15:08:05 compaction of partition 5 took 8.704178898s
[wal] 2015/08/31 15:09:05 Flush due to memory. Flushing 87 series with 744447 bytes from partition 1. Compacting 13480 series
[graphite] 2015/08/31 15:09:17 failed to write point batch to database "graphite_raw": timeout
[graphite] 2015/08/31 15:09:22 failed to write point batch to database "graphite_raw": timeout
[graphite] 2015/08/31 15:09:37 failed to write point batch to database "graphite_raw": timeout
[wal] 2015/08/31 15:09:37 Flush due to memory. Flushing 79 series with 677994 bytes from partition 2. Compacting 13359 series
[wal] 2015/08/31 15:09:38 compaction of partition 1 took 32.98032967s
[graphite] 2015/08/31 15:10:14 failed to write point batch to database "graphite_raw": timeout
[wal] 2015/08/31 15:10:14 Flush due to memory. Flushing 2606 series with 14185922 bytes from partition 5. Compacting 10840 series
[graphite] 2015/08/31 15:10:30 failed to write point batch to database "graphite_raw": timeout
[wal] 2015/08/31 15:10:37 compaction of partition 2 took 1m0.063656549s
[wal] 2015/08/31 15:11:44 compaction of partition 5 took 1m29.754678882s
[wal] 2015/08/31 15:14:17 Flushing 0 measurements and 260 series to index
[wal] 2015/08/31 15:14:27 Metadata flush took 9.687606789s
[retention] 2015/08/31 15:14:34 retention policy enforcement check commencing
[retention] 2015/08/31 15:14:34 retention policy shard deletion check commencing
[wal] 2015/08/31 15:17:23 Flush due to idle. Flushing 12571 series with 7251333 bytes from partition 5. Compacting 0 series
[wal] 2015/08/31 15:17:29 Flush due to memory. Flushing 2596 series with 14224954 bytes from partition 2. Compacting 10900 series
[wal] 2015/08/31 15:17:33 Flush due to memory. Flushing 2584 series with 14196496 bytes from partition 1. Compacting 11040 series
[http] 2015/08/31 15:17:50 127.0.0.1 - grafana [31/Aug/2015:15:17:23 -0600] GET /query?db=graphite_raw&epoch=ms&p=%5BREDACTED HTTP/1.1 200 2719 http://localhost/dashboard/db/sxs-prod-overview Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36 ac031917-5025-11e5-8c88-000000000000 26.850590558s
[http] 2015/08/31 15:17:54 127.0.0.1 - grafana [31/Aug/2015:15:17:23 -0600] GET /query?db=graphite_raw&epoch=ms&p=%5BREDACTED HTTP/1.1 200 2679 http://localhost/dashboard/db/sxs-prod-overview Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36 ac05aef2-5025-11e5-8c89-000000000000 30.638434664s
[graphite] 2015/08/31 15:18:13 failed to write point batch to database "graphite_raw": timeout
[graphite] 2015/08/31 15:18:18 failed to write point batch to database "graphite_raw": timeout
[wal] 2015/08/31 15:18:20 compaction of partition 5 took 56.949228554s
[wal] 2015/08/31 15:18:43 compaction of partition 2 took 1m13.486551683s
[wal] 2015/08/31 15:19:03 compaction of partition 1 took 1m29.649347006s
[wal] 2015/08/31 15:19:27 Flush due to memory. Flushing 665 series with 8346167 bytes from partition 3. Compacting 11901 series
[wal] 2015/08/31 15:19:36 compaction of partition 3 took 8.547117345s
[wal] 2015/08/31 15:24:17 Flushing 0 measurements and 212 series to index
[wal] 2015/08/31 15:24:24 Metadata flush took 6.679940445s

jonseymour · 2016-01-09T11:55:16Z

I have been looking at similar problem to this with 0.9.2. The issue arises when there is a long running query that spans a large number of shards and then a write request arrives which requires Bolt DB to acquire a writelock on the Bolt DB mmaplock mutex presumably to perform extra allocation with a shard that overlaps the current time. The acquisition of the writelock may be blocked by readlocks on mmaplock held by the long running query. The pending writelock on mmaplock then prevents new queries obtaining a readlock on mmaplock.

I for one would like a way to prevent long running queries running more than a fixed amount of time because it is quite hard to prevent a grafana user, for example, from specifying a time range that results in a long running query.

beckettsean · 2016-01-13T19:22:15Z

There were some deadlocks fixed in the 0.9.5 release. I recommend upgrading if possible, as a number of improvements and bug fixes have gone into the product since 0.9.2.

jwilder · 2016-03-18T02:00:53Z

This is fixed by the tsm engine.

beckettsean added area/queries area/performance labels Jul 14, 2015

beckettsean changed the title ~~[0.9.2-nightly] unresponsive on large queries/failed to write batch point~~ [0.9.2-nightly] unresponsive on large queries/failed to write batch point, graphite plugin Jul 14, 2015

beckettsean added this to the 0.9.2 milestone Jul 15, 2015

beckettsean assigned toddboom Jul 15, 2015

toddboom added the review label Jul 29, 2015

toddboom modified the milestones: 0.9.2, 0.9.3 Jul 29, 2015

beckettsean modified the milestones: Next Point Release, 0.9.3 Aug 6, 2015

jwilder closed this as completed Mar 18, 2016

timhallinflux modified the milestones: 0.10.0, Future Point Release Dec 19, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[0.9.2-nightly] unresponsive on large queries/failed to write batch point, graphite plugin #3316

[0.9.2-nightly] unresponsive on large queries/failed to write batch point, graphite plugin #3316

Inisfair commented Jul 13, 2015

Inisfair commented Jul 16, 2015

cannium commented Aug 27, 2015

otoolep commented Aug 27, 2015

cannium commented Aug 28, 2015

otoolep commented Aug 28, 2015

otoolep commented Aug 28, 2015

cannium commented Aug 28, 2015

otoolep commented Aug 28, 2015

otoolep commented Aug 28, 2015

cannium commented Aug 28, 2015

otoolep commented Aug 28, 2015

Inisfair commented Aug 31, 2015

jonseymour commented Jan 9, 2016

beckettsean commented Jan 13, 2016

jwilder commented Mar 18, 2016

[0.9.2-nightly] unresponsive on large queries/failed to write batch point, graphite plugin #3316

[0.9.2-nightly] unresponsive on large queries/failed to write batch point, graphite plugin #3316

Comments

Inisfair commented Jul 13, 2015

Inisfair commented Jul 16, 2015

cannium commented Aug 27, 2015

otoolep commented Aug 27, 2015

cannium commented Aug 28, 2015

otoolep commented Aug 28, 2015

otoolep commented Aug 28, 2015

cannium commented Aug 28, 2015

otoolep commented Aug 28, 2015

otoolep commented Aug 28, 2015

cannium commented Aug 28, 2015

otoolep commented Aug 28, 2015

Inisfair commented Aug 31, 2015

jonseymour commented Jan 9, 2016

beckettsean commented Jan 13, 2016

jwilder commented Mar 18, 2016