-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
telegraf needs better support for partial failure/success #1446
Comments
I like the idea of having a function on the accumulator. PRs are welcome. |
Have a (very lightly tested) implementation at: master...phemmer:add_error I'm not sure about it though. It's a little messy because where accumulator channels (shutdown, ticker, metrics) are created & read, and where input errors are handled are scattered around, and isn't implemented the same way between service inputs, normal inputs, and agent.Test. This also maintains support for the existing |
Latter seems better to avoid confusion about what error should plugin return from Gather if it already used AddError(). I will help updating plugins as necessary. |
@sparrc Does influxdata/influxdb#6976 affect your decision any? If influxdb supports a logger with multiple log levels per-subsystem, should telegraf also support such a thing to be consistent between projects? Would this replace the need for |
@phemmer I think one way you can cleanup your code would be to make the errorC channel internal to the accumulator. That way you don't need to change the call signature of NewAccumulator. You can simply spin off a goroutine when NewAccumulator is called, which monitors an internal Also, two things to note is that (1) Gather() of service inputs is always a dummy call, errors are handled internally to each plugin, and (2) don't worry at all about the Test() function, that has no real utility so it can just be made to conform to whatever makes sense for the real gathering routines. I would support replacing the interface to return errors from I'm not sure each plugin should have it's own logger, and I'm also not sure if this change would prevent us from doing that in the future anyways. |
Wouldn't prevent it no, but it would make I'll formalize the code and put up a PR (with |
Just adding another note, more as a reminder for myself: The jolokia plugin will need to be adjusted to remove the
Which is very confusing as nothing in the message indicates that it came from jolokia. |
@phemmer if you are working on this code, please submit it in stages. ie, just submit the addition of the AddError function before changing every plugin please :) |
re: changing the I'm not sure we should change this interface, I think it's OK to leave it as-is and document that returning an error will be treated as a "fatal" error, exiting telegraf. Otherwise users should be encouraged to use acc.AddError. I think this will be easy to do with documentation and examples. The reason behind this is that there needs to be a way to indicate "fatal" errors, and returning them to the accumulator doesn't seem like the right way to do it. To me they should get returned directly to the "agent", which is calling the Gather() function. |
I don't think I'm fond of this idea. If just one plugin goes bad, this would stop all plugins, not just the one. What about reinitializing that plugin instead? |
If a plugin is misconfigured or experiences a fatal error, I think telegraf should fail and exit. Not sure what you mean by "going bad", but if telegraf is being run as a service it will get reloaded anyways. |
If a plugin is misconfigured, telegraf shouldn't even start. A separate mechanism should be provided for a config check (#1438). Having
Your idea sounds like you want to support if a plugin goes into an unrecoverable state that telegraf should exit.
Not all service managers automatically restart telegraf if it dies. |
seems to me like it would be difficult to anticipate all runtime errors ahead-of-time. "reloading" of a plugin sounds like a simple concept but is actually quite difficult and dangerous. What if the plugin has spun off hung processes? What if it has unclosable connections still open? The only way to guarantee that some of these get cleaned up is by restarting the process. This is one of the drawbacks of a "monolith" style of architecture, which Telegraf certainly is. We are not spinning up plugins as independent processes. They are not dependent but they are part of the same process, and thus fatal errors can happen, and it's impossible to completely isolate & protect one plugin from another. |
and by the way, I'm not advocating for a change in behavior at the moment, currently we do not exit on fatal errors and we can continue with the same behavior. |
A restart won't fix anything. The processes will still be hung.
There's no such thing as an unclosable connection. A connection can always be closed. It might go into a TIME_WAIT state, but this only lasts a little while.
I just don't like the idea of one part of my monitoring system failing to be able to cause the whole thing to fail. I would rather have some metrics than no metrics. |
Restarting the process will kill child processes, so that would cleanup the hung processes. From a kernel perspective there are no closable connections. From a programming perspective there are many client libraries, and even Go idioms, which don't provide proper Close() methods. The benefit of failing is that users cannot complain about "bad" or "wrong" telegraf behavior. If the process is failing then the user obviously has the responsibility to fix the error. If the process is not failing, and is just logging or internally trying to clean itself up, then it's telegraf's fault for not fixing itself. Mainly I worry that it gets into an "unwinnable" state where it can't really do anything to fix itself. |
No it won't. Restarting will abandon the child processes.
This doesn't make any sense. Can you clarify what you mean?
EBADF - obviously already closed
I disagree. If a telegraf plugin has some sort of issue where it continuously fails, preventing telegraf from staying running, I would describe this as bad behavior.
By "failing" I assume you mean exiting with error code. And if so, I would argue that it's the users responsibility to restart telegraf, not telegraf restart itself.
I don't think one can argue that it's ever a processes responsibility for fixing itself. Self-heal mechanisms are next to unheard of. When applications get into a bad state, but are still partially operational, they log it. The admin then has to come along and fix it. I think that trying to self-heal will cause more problems than it solves. |
sorry, I meant to say "from a kernel perspective, there are no _un_closable connections" |
that's my entire point, if it's failing, it's a bug that should be fixed or a user-error. Why do you think it should cover up a bug? Telegraf should fail if it's failing.
I'm not advocating for telegraf to exit on every failure, I'm advocating for Telegraf to exit on fatal errors.
CORRECT! that is what AddError does. There are still fatal errors that can exist upon initialization. There are also panics which can be recovered from properly and then exited upon. |
Obviously neither of us is going to back down from our viewpoint. So how about an option to the control the behavior. As long as I have the ability to disable it, I'll be happy. |
Please don't. I'm running telegraf on freebsd, which is not very well supported. ZFS zpool plugin for example: You can't possibly catch all those cases in development on all different os versions I have. |
Correct, I am talking about fatal errors....In some ways we are arguing for the same thing. I am saying that plugins should be encouraged to use But as @kaey said, stopping So returning |
@sparrc So what's the resolution? Return err on fatal error? Should we start updating plugins or wait until after 1.0 is released? |
wait until after 1.0 |
Feature Request
Proposal:
The telgraf code needs a better way of supporting plugins with partial success/failure.
Current behavior:
Currently partial success/failure is handled, or not handled, in numerous ways. Many of which are very buggy. I started going through the plugins looking for issues (see #1440, #1441, #1442, #1443, #1444, #1445), but stopped because of how prevalent they are.
There seems to be 4 general categories which plugins fall into when they have the possibility of partial success/failure.
Desired behavior:
We instead need some simplified way for plugins to report multiple errors back to telegraf core.
I think the simplest way might be to just add a
AddError()
method totelegraf.Accumulator
.The other option might be to expose a general purpose logger on the accumulator. This might be better as then plugins could log things at different log levels (error, warning, info, debug, etc), and when a user wants to troubleshoot an issue, they can just bump up the log level.
Use case: [Why is this important (helps with prioritizing requests)]
Make it easier (and more consistent) for plugin developers to handle errors, and less likely for them to introduce buggy code.
The text was updated successfully, but these errors were encountered: