Deadlock in statsd input plugin #2927

buckleypm · 2017-06-15T21:41:05Z

Bug report

Relevant telegraf.conf:

[agent]
interval = "60s"
round_interval = true
metric_batch_size = 5000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "60s"
flush_jitter = "5s"
precision = ""
debug = false
quiet = false
logfile = "/mnt/log/telegraf/telegraf.log"

hostname = ""
omit_hostname = false

Custom output plugins censored

[[inputs.statsd]]
service_address = "127.0.0.1:8125"
delete_gauges = true
delete_counters = false
delete_sets = true
delete_timings = true
percentiles = [90, 95, 99]
metric_separator = "."
parse_data_dog_tags = false
allowed_pending_messages = 10000
percentile_limit = 1000

[[aggregators.rate]]
period = "60s"
drop_original = true

System info:

Originally saw on 1.1, rebased off master and the issue still exists.

Steps to reproduce:

https://github.com/influxdata/telegraf/blob/master/plugins/inputs/statsd/statsd.go#L206 one of the acc.AddFields() calls hangs (Outputs block inputs when batch size is reached #2919 possibly related)

Expected behavior:

Metrics are gathered normally

Actual behavior:

Incoming metrics get blocked due to a mutex lock between
https://github.com/influxdata/telegraf/blob/master/plugins/inputs/statsd/statsd.go#L181
and
https://github.com/influxdata/telegraf/blob/master/plugins/inputs/statsd/statsd.go#L330
eventually causing the incoming buffer to fill up and new metrics to be dropped.

Additional info:

2017-06-13T01:28:00Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (1m0s)
2017-06-13T01:28:16Z E! Error: statsd message queue full. We have dropped 10000 messages so far. You may want to increase allowed_pending_messages in the config

Proposal:

Remove mutex locking in favor of dedicated per metric type goroutines and channels

danielnelson · 2017-06-16T01:14:22Z

These don't look like they can deadlock, I assume AddFields is blocking because it is full. Have you seen this #2914? Maybe try removing your aggregator.

buckleypm · 2017-06-16T16:43:56Z

@danielnelson Thank you for pointing me at the aggregator. https://github.com/influxdata/telegraf/blob/master/internal/models/running_aggregator.go#L27 I increased this value to match my metric_buffer_limit and have now gone 12 hours without a telegraf daemon getting wedged. This works great while the future of metricC is decided.

Also I got the go ahead to submit #2935. This is the first time submitting something thru google's oss so hopefully I got all the t's crossed and i's dotted.

buckleypm · 2017-07-05T19:55:53Z

So we have seen a re-occurrence of this issue pop up (albeit a smaller percentage, was seeing 50% of our telegraf daemons wedged in 24 hours, now at about 2% after two weeks with the metric_buffer_limit change)

danielnelson · 2017-08-02T01:57:57Z

@buckleypm I believe this will be fixed in 1.4 by #3016

buckleypm mentioned this issue Jun 16, 2017

Remove blocking behavior and improve internal concurrency #2935

Closed

buckleypm closed this as completed Jun 16, 2017

buckleypm reopened this Jun 16, 2017

danielnelson added this to the 1.4.0 milestone Aug 2, 2017

danielnelson closed this as completed Aug 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock in statsd input plugin #2927

Deadlock in statsd input plugin #2927

buckleypm commented Jun 15, 2017 •

edited

Loading

danielnelson commented Jun 16, 2017

buckleypm commented Jun 16, 2017

buckleypm commented Jul 5, 2017

danielnelson commented Aug 2, 2017

Deadlock in statsd input plugin #2927

Deadlock in statsd input plugin #2927

Comments

buckleypm commented Jun 15, 2017 • edited Loading

Bug report

Relevant telegraf.conf:

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

Additional info:

Proposal:

danielnelson commented Jun 16, 2017

buckleypm commented Jun 16, 2017

buckleypm commented Jul 5, 2017

danielnelson commented Aug 2, 2017

buckleypm commented Jun 15, 2017 •

edited

Loading