Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.9.2-nightly] unresponsive on large queries/failed to write batch point, graphite plugin #3316

Closed
Inisfair opened this issue Jul 13, 2015 · 15 comments

Comments

@Inisfair
Copy link

Suspect this is related to #3275, #3282, #3199 . I have data coming in via the Graphite plugin, several hundreds of thousands of metric points every couple of minutes. This works fine for days until via grafana someone requests a large set of data for charting, once that happens all rendering via grafana stops working and eventually in the influx logs I start to see the "failed to write point batch to database" errors. The only fix seems to be to restart influxDB at which point data is lost but the errors stop and grafana becomes responsive again until someone makes a large request again. In this case, a large request is 7 days of data which would consist of over a million raw data points being summarized.

Tuning the [data] section as suggested in #3282 seems to have significantly reduced the likelihood of a complete lockup situation however with enough read queries requesting a large enough amount of data it is easily reproducible in my environment.

There are currently no Continuous Queries configured.

This is on a 2x4 VM running as a single node.

Version:

inisfair@influxdb01 => uname -a
Linux influxdb01 2.6.32-504.el6.x86_64 #1 SMP Tue Sep 16 01:56:35 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux

2015/07/13 13:22:23 InfluxDB starting, version 0.9.1-rc1-111-g8b67872, commit 8b
67872d6bd1013d455e3b12af3e79b24d78bf0d

Config:

# no internet acces...
reporting-disabled = true

[meta]
  dir = "/var/opt/influxdb/meta"
  hostname = "influxdb01"
  bind-address = ":8088"
  retention-autocreate = true
  election-timeout = "1s"
  heartbeat-timeout = "1s"
  leader-lease-timeout = "500ms"
  commit-timeout = "50ms"

[data]
 dir = "/var/opt/influxdb/data"
 max-wal-size = 1024
 max-flush-interval = "15s"
 max-partition-flush-delay = "2s"

[cluster]
  shard-writer-timeout = "90s"
  write-timeout = "90s"

[retention]
  enabled = true
  check-interval = "10m"

[admin]
  enabled = true
  bind-address = ":8083"

[http]
  enabled = true
  bind-address = ":8086"
  auth-enabled = false
  log-enabled = true
  write-tracing = false
  pprof-enabled = false

[[graphite]]
  enabled = true
  bind-address = ":2003"
  protocol = "tcp"
  consistency-level = "one"
  name-separator = "."

  database = "graphite_raw"
  batch-size = 250
  batch-timeout = "20ms"

  templates = [
    "tldbdial.* source.app.role.logical_host.datacenter.environment.mw_layer.api_name.mw_source.mw_provider.status.measurement template=tldbdial",
    "mwvolume.* .app.role.datacenter.environment.hostname.mw_layer.api_name.mw_source.mw_provider.status.measurement source=tldb,template=mwvolume",
    "logwatcher.* source.app.role.datacenter.environment.hostname.measurement status=totals,template=logwatcher",
    "source.app.role.datacenter.environment.hostname.measurement template=default",
  ]
  ignore-unnamed = false

[collectd]
  enabled = false
  # bind-address = ""
  # database = ""
  # typesdb = ""

[opentsdb]
  enabled = false
  # bind-address = ""
  # database = ""
  # retention-policy = ""

[udp]
  enabled = false
  # bind-address = ""
  # database = ""
  # batch-size = 0
  # batch-timeout = "0"

[monitoring]
  enabled = true
  write-interval = "24h"

[continuous_queries]
  enabled = true
  recompute-previous-n = 2
  recompute-no-older-than = "30m"
  compute-runs-per-interval = 10
  compute-no-more-than = "2m"

[hinted-handoff]
  enabled = true
  dir = "/var/opt/influxdb/hh"
  max-size = 1073741824
  max-age = "168h"
  retry-rate-limit = 0
  retry-interval = "1s"
@beckettsean beckettsean changed the title [0.9.2-nightly] unresponsive on large queries/failed to write batch point [0.9.2-nightly] unresponsive on large queries/failed to write batch point, graphite plugin Jul 14, 2015
@beckettsean beckettsean added this to the 0.9.2 milestone Jul 15, 2015
@Inisfair
Copy link
Author

Here is a log snippet of the issue occurring. Because of the time of day I was able to let it recover naturally instead of restarting influxd to force grafana back to being responsive. Note that the browser had long been closed after the grafana UI became unresponsive waiting for the return.

https://gist.github.com/Inisfair/604be50944d688535135

In this log you can see a continuous stream of incoming data being written, then eventually a large enough request comes in from grafana to stall out influxdb which results in some dropped writes. Eventually (after about 25 minutes) it recovered and the stream of incoming data resumed.

@toddboom toddboom modified the milestones: 0.9.2, 0.9.3 Jul 29, 2015
@beckettsean beckettsean modified the milestones: Next Point Release, 0.9.3 Aug 6, 2015
@cannium
Copy link
Contributor

cannium commented Aug 27, 2015

Similar issue here, a snippet of logs:

[b1] 2015/08/27 15:14:25 flush 156818 points in 35.160s
[b1] 2015/08/27 15:15:09 flush 159034 points in 40.072s
[b1] 2015/08/27 15:15:51 flush 158285 points in 38.009s
[b1] 2015/08/27 15:16:36 flush 160066 points in 41.388s
[b1] 2015/08/27 15:17:17 flush 13118 points in 36.533s
[b1] 2015/08/27 15:17:29 flush 11838 points in 7.961s
[b1] 2015/08/27 15:18:08 flush 12081 points in 35.295s
[graphite] 2015/08/27 15:18:30 failed to write point batch to database "graphite": timeout
[b1] 2015/08/27 15:18:34 flush 9370 points in 22.244s
[b1] 2015/08/27 15:18:59 flush 7903 points in 20.771s
[b1] 2015/08/27 15:19:20 flush 6693 points in 17.271s
[graphite] 2015/08/27 15:19:30 failed to write point batch to database "graphite": timeout
[b1] 2015/08/27 15:19:44 flush 5377 points in 19.472s
[b1] 2015/08/27 15:20:02 flush 5375 points in 13.706s
[graphite] 2015/08/27 15:20:30 failed to write point batch to database "graphite": timeout

Seems once a while there would be a timeout and some data point lost. I can see from iostat that the iops is very high, but as @Inisfair provided, it also happens when only a few hundreds of points to write. So I suspect if there're some hidden bugs.

@otoolep
Copy link
Contributor

otoolep commented Aug 27, 2015

The 0.9.3 release addresses the issue of queries affecting write
performance, in many circumstances. You should upgrade and see if
performance improves. You will need to write to a brand new database to
ensure you use the bz1 engine.

https://influxdb.com/download/index.html

On Thursday, August 27, 2015, can. notifications@github.com wrote:

Similar issue here, a snippet of logs:

[b1] 2015/08/27 15:14:25 flush 156818 points in 35.160s
[b1] 2015/08/27 15:15:09 flush 159034 points in 40.072s
[b1] 2015/08/27 15:15:51 flush 158285 points in 38.009s
[b1] 2015/08/27 15:16:36 flush 160066 points in 41.388s
[b1] 2015/08/27 15:17:17 flush 13118 points in 36.533s
[b1] 2015/08/27 15:17:29 flush 11838 points in 7.961s
[b1] 2015/08/27 15:18:08 flush 12081 points in 35.295s
[graphite] 2015/08/27 15:18:30 failed to write point batch to database "graphite": timeout
[b1] 2015/08/27 15:18:34 flush 9370 points in 22.244s
[b1] 2015/08/27 15:18:59 flush 7903 points in 20.771s
[b1] 2015/08/27 15:19:20 flush 6693 points in 17.271s
[graphite] 2015/08/27 15:19:30 failed to write point batch to database "graphite": timeout
[b1] 2015/08/27 15:19:44 flush 5377 points in 19.472s
[b1] 2015/08/27 15:20:02 flush 5375 points in 13.706s
[graphite] 2015/08/27 15:20:30 failed to write point batch to database "graphite": timeout

Seems once a while there would be a timeout and some data point lost. I
can see from iostat that the iops is very high, but as @Inisfair
https://github.com/Inisfair provided, it also happens when only a few
hundreds of points to write. So I suspect if there're some hidden bugs.


Reply to this email directly or view it on GitHub
#3316 (comment).

@cannium
Copy link
Contributor

cannium commented Aug 28, 2015

@otoolep So I need to remove all previous data and start a brand new cluster to use bz1 engine?

@otoolep
Copy link
Contributor

otoolep commented Aug 28, 2015

@cannium -- no, you simply need to create a new database (CREATE DATABASE) and start writing to it.

@otoolep
Copy link
Contributor

otoolep commented Aug 28, 2015

That new database will have bz1 shards.

@cannium
Copy link
Contributor

cannium commented Aug 28, 2015

OK, I'll try

@otoolep
Copy link
Contributor

otoolep commented Aug 28, 2015

Great, let us know how it goes.

@otoolep
Copy link
Contributor

otoolep commented Aug 28, 2015

I would also try out a larger batch size, and definitely a larger timeout than what was shown above. 20 ms may be too low. E.g.

  batch-size = 1000
  batch-timeout = "1s"

My concern with such a low timeout as 20ms is that you're not batching as much as you think. Granted it may introduce some latency (of a second), but it should guarantee better throughput.

@cannium
Copy link
Contributor

cannium commented Aug 28, 2015

This is my current and previous config:

[[graphite]]
  enabled = true
  bind-address = ":2003"
  protocol = "udp"
  consistency-level = "any"
  name-separator = "."

  batch-size = 10240 # will flush if this many points get buffered
  batch-timeout = "1s" # will flush at least this often even if we haven't hit buffer limit

@otoolep
Copy link
Contributor

otoolep commented Aug 28, 2015

I would use TCP for performance testing. Otherwise it is difficult to know
if the server is keeping up.

On Thursday, August 27, 2015, can. notifications@github.com wrote:

This is my current and previous config:

[[graphite]]
enabled = true
bind-address = ":2003"
protocol = "udp"
consistency-level = "any"
name-separator = "."

batch-size = 10240 # will flush if this many points get buffered
batch-timeout = "1s" # will flush at least this often even if we haven't
hit buffer limit


Reply to this email directly or view it on GitHub
#3316 (comment).

@Inisfair
Copy link
Author

I installed 0.9.3 in a test environment and have been inserting the same data into it in parallel. So far at least I have not been able to cause it to fail to insert data via the graphite plugin due to a large chart request from grafana. However, today I was reviewing the influxdb.log file and noticed the following entries in the log.

This does not appear to be related to large queries/grafana requests, but possibly the compaction task?

[wal] 2015/08/31 15:04:17 Flushing 0 measurements and 214 series to index
[wal] 2015/08/31 15:04:31 Metadata flush took 13.150238408s
[retention] 2015/08/31 15:04:34 retention policy enforcement check commencing
[retention] 2015/08/31 15:04:34 retention policy shard deletion check commencing
[graphite] 2015/08/31 15:07:33 failed to write point batch to database "graphite_raw": timeout
[wal] 2015/08/31 15:07:34 Flush due to memory. Flushing 39 series with 1786887 bytes from partition 3. Compacting 12368 series
[wal] 2015/08/31 15:07:42 compaction of partition 3 took 7.186054746s
[wal] 2015/08/31 15:07:56 Flush due to memory. Flushing 75 series with 632281 bytes from partition 5. Compacting 13338 series
[wal] 2015/08/31 15:08:05 compaction of partition 5 took 8.704178898s
[wal] 2015/08/31 15:09:05 Flush due to memory. Flushing 87 series with 744447 bytes from partition 1. Compacting 13480 series
[graphite] 2015/08/31 15:09:17 failed to write point batch to database "graphite_raw": timeout
[graphite] 2015/08/31 15:09:22 failed to write point batch to database "graphite_raw": timeout
[graphite] 2015/08/31 15:09:37 failed to write point batch to database "graphite_raw": timeout
[wal] 2015/08/31 15:09:37 Flush due to memory. Flushing 79 series with 677994 bytes from partition 2. Compacting 13359 series
[wal] 2015/08/31 15:09:38 compaction of partition 1 took 32.98032967s
[graphite] 2015/08/31 15:10:14 failed to write point batch to database "graphite_raw": timeout
[wal] 2015/08/31 15:10:14 Flush due to memory. Flushing 2606 series with 14185922 bytes from partition 5. Compacting 10840 series
[graphite] 2015/08/31 15:10:30 failed to write point batch to database "graphite_raw": timeout
[wal] 2015/08/31 15:10:37 compaction of partition 2 took 1m0.063656549s
[wal] 2015/08/31 15:11:44 compaction of partition 5 took 1m29.754678882s
[wal] 2015/08/31 15:14:17 Flushing 0 measurements and 260 series to index
[wal] 2015/08/31 15:14:27 Metadata flush took 9.687606789s
[retention] 2015/08/31 15:14:34 retention policy enforcement check commencing
[retention] 2015/08/31 15:14:34 retention policy shard deletion check commencing
[wal] 2015/08/31 15:17:23 Flush due to idle. Flushing 12571 series with 7251333 bytes from partition 5. Compacting 0 series
[wal] 2015/08/31 15:17:29 Flush due to memory. Flushing 2596 series with 14224954 bytes from partition 2. Compacting 10900 series
[wal] 2015/08/31 15:17:33 Flush due to memory. Flushing 2584 series with 14196496 bytes from partition 1. Compacting 11040 series
[http] 2015/08/31 15:17:50 127.0.0.1 - grafana [31/Aug/2015:15:17:23 -0600] GET /query?db=graphite_raw&epoch=ms&p=%5BREDACTED HTTP/1.1 200 2719 http://localhost/dashboard/db/sxs-prod-overview Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36 ac031917-5025-11e5-8c88-000000000000 26.850590558s
[http] 2015/08/31 15:17:54 127.0.0.1 - grafana [31/Aug/2015:15:17:23 -0600] GET /query?db=graphite_raw&epoch=ms&p=%5BREDACTED HTTP/1.1 200 2679 http://localhost/dashboard/db/sxs-prod-overview Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36 ac05aef2-5025-11e5-8c89-000000000000 30.638434664s
[graphite] 2015/08/31 15:18:13 failed to write point batch to database "graphite_raw": timeout
[graphite] 2015/08/31 15:18:18 failed to write point batch to database "graphite_raw": timeout
[wal] 2015/08/31 15:18:20 compaction of partition 5 took 56.949228554s
[wal] 2015/08/31 15:18:43 compaction of partition 2 took 1m13.486551683s
[wal] 2015/08/31 15:19:03 compaction of partition 1 took 1m29.649347006s
[wal] 2015/08/31 15:19:27 Flush due to memory. Flushing 665 series with 8346167 bytes from partition 3. Compacting 11901 series
[wal] 2015/08/31 15:19:36 compaction of partition 3 took 8.547117345s
[wal] 2015/08/31 15:24:17 Flushing 0 measurements and 212 series to index
[wal] 2015/08/31 15:24:24 Metadata flush took 6.679940445s

@jonseymour
Copy link
Contributor

I have been looking at similar problem to this with 0.9.2. The issue arises when there is a long running query that spans a large number of shards and then a write request arrives which requires Bolt DB to acquire a writelock on the Bolt DB mmaplock mutex presumably to perform extra allocation with a shard that overlaps the current time. The acquisition of the writelock may be blocked by readlocks on mmaplock held by the long running query. The pending writelock on mmaplock then prevents new queries obtaining a readlock on mmaplock.

I for one would like a way to prevent long running queries running more than a fixed amount of time because it is quite hard to prevent a grafana user, for example, from specifying a time range that results in a long running query.

@beckettsean
Copy link
Contributor

There were some deadlocks fixed in the 0.9.5 release. I recommend upgrading if possible, as a number of improvements and bug fixes have gone into the product since 0.9.2.

@jwilder
Copy link
Contributor

jwilder commented Mar 18, 2016

This is fixed by the tsm engine.

@jwilder jwilder closed this as completed Mar 18, 2016
@timhallinflux timhallinflux modified the milestones: 0.10.0, Future Point Release Dec 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants