-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock in statsd input plugin #2927
Comments
These don't look like they can deadlock, I assume AddFields is blocking because it is full. Have you seen this #2914? Maybe try removing your aggregator. |
@danielnelson Thank you for pointing me at the aggregator. https://github.com/influxdata/telegraf/blob/master/internal/models/running_aggregator.go#L27 I increased this value to match my metric_buffer_limit and have now gone 12 hours without a telegraf daemon getting wedged. This works great while the future of metricC is decided. Also I got the go ahead to submit #2935. This is the first time submitting something thru google's oss so hopefully I got all the t's crossed and i's dotted. |
So we have seen a re-occurrence of this issue pop up (albeit a smaller percentage, was seeing 50% of our telegraf daemons wedged in 24 hours, now at about 2% after two weeks with the metric_buffer_limit change) |
@buckleypm I believe this will be fixed in 1.4 by #3016 |
Bug report
Relevant telegraf.conf:
[agent]
interval = "60s"
round_interval = true
metric_batch_size = 5000
metric_buffer_limit = 10000
collection_jitter = "0s"
flush_interval = "60s"
flush_jitter = "5s"
precision = ""
debug = false
quiet = false
logfile = "/mnt/log/telegraf/telegraf.log"
hostname = ""
omit_hostname = false
Custom output plugins censored
[[inputs.statsd]]
service_address = "127.0.0.1:8125"
delete_gauges = true
delete_counters = false
delete_sets = true
delete_timings = true
percentiles = [90, 95, 99]
metric_separator = "."
parse_data_dog_tags = false
allowed_pending_messages = 10000
percentile_limit = 1000
[[aggregators.rate]]
period = "60s"
drop_original = true
System info:
Originally saw on 1.1, rebased off master and the issue still exists.
Steps to reproduce:
Expected behavior:
Metrics are gathered normally
Actual behavior:
Incoming metrics get blocked due to a mutex lock between
https://github.com/influxdata/telegraf/blob/master/plugins/inputs/statsd/statsd.go#L181
and
https://github.com/influxdata/telegraf/blob/master/plugins/inputs/statsd/statsd.go#L330
eventually causing the incoming buffer to fill up and new metrics to be dropped.
Additional info:
2017-06-13T01:28:00Z E! Error in plugin [inputs.statsd]: took longer to collect than collection interval (1m0s)
2017-06-13T01:28:16Z E! Error: statsd message queue full. We have dropped 10000 messages so far. You may want to increase allowed_pending_messages in the config
Proposal:
Remove mutex locking in favor of dedicated per metric type goroutines and channels
The text was updated successfully, but these errors were encountered: