Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock in statsd input plugin #2927

Closed
buckleypm opened this issue Jun 15, 2017 · 4 comments
Closed

Deadlock in statsd input plugin #2927

buckleypm opened this issue Jun 15, 2017 · 4 comments
Milestone

Comments

@buckleypm
Copy link

buckleypm commented Jun 15, 2017

Bug report

Relevant telegraf.conf:

[agent]
interval = "60s"
round_interval = true
metric_batch_size = 5000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "60s"
flush_jitter = "5s"
precision = ""
debug = false
quiet = false
logfile = "/mnt/log/telegraf/telegraf.log"

hostname = ""
omit_hostname = false

Custom output plugins censored

[[inputs.statsd]]
service_address = "127.0.0.1:8125"
delete_gauges = true
delete_counters = false
delete_sets = true
delete_timings = true
percentiles = [90, 95, 99]
metric_separator = "."
parse_data_dog_tags = false
allowed_pending_messages = 10000
percentile_limit = 1000

[[aggregators.rate]]
period = "60s"
drop_original = true

System info:

Originally saw on 1.1, rebased off master and the issue still exists.

Steps to reproduce:

  1. https://github.com/influxdata/telegraf/blob/master/plugins/inputs/statsd/statsd.go#L206 one of the acc.AddFields() calls hangs (Outputs block inputs when batch size is reached #2919 possibly related)

Expected behavior:

Metrics are gathered normally

Actual behavior:

Incoming metrics get blocked due to a mutex lock between
https://github.com/influxdata/telegraf/blob/master/plugins/inputs/statsd/statsd.go#L181
and
https://github.com/influxdata/telegraf/blob/master/plugins/inputs/statsd/statsd.go#L330
eventually causing the incoming buffer to fill up and new metrics to be dropped.

Additional info:

2017-06-13T01:28:00Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (1m0s)
2017-06-13T01:28:16Z E! Error: statsd message queue full. We have dropped 10000 messages so far. You may want to increase allowed_pending_messages in the config

Proposal:

Remove mutex locking in favor of dedicated per metric type goroutines and channels

@danielnelson
Copy link
Contributor

These don't look like they can deadlock, I assume AddFields is blocking because it is full. Have you seen this #2914? Maybe try removing your aggregator.

@buckleypm
Copy link
Author

@danielnelson Thank you for pointing me at the aggregator. https://github.com/influxdata/telegraf/blob/master/internal/models/running_aggregator.go#L27 I increased this value to match my metric_buffer_limit and have now gone 12 hours without a telegraf daemon getting wedged. This works great while the future of metricC is decided.

Also I got the go ahead to submit #2935. This is the first time submitting something thru google's oss so hopefully I got all the t's crossed and i's dotted.

@buckleypm buckleypm reopened this Jun 16, 2017
@buckleypm
Copy link
Author

So we have seen a re-occurrence of this issue pop up (albeit a smaller percentage, was seeing 50% of our telegraf daemons wedged in 24 hours, now at about 2% after two weeks with the metric_buffer_limit change)

@danielnelson danielnelson added this to the 1.4.0 milestone Aug 2, 2017
@danielnelson
Copy link
Contributor

@buckleypm I believe this will be fixed in 1.4 by #3016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants