-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs(specs): Add probe
as value to startup_error_behavior
#16052
Merged
DStrand1
merged 17 commits into
influxdata:master
from
LandonTClipp:LandonTClipp/probe_spec
Dec 11, 2024
Merged
Changes from 15 commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
c19a5a4
docs: Add `probe` as value to `startup_error_behavior`
LandonTClipp 4b4d955
Update docs/specs/tsd-008-probe-on-startup.md
LandonTClipp d42b9d5
Update docs/specs/tsd-008-probe-on-startup.md
LandonTClipp 096615a
Address PR comments
LandonTClipp 938ff72
Add `probe` to the `startup-error-behavior` spec.
LandonTClipp 6b45aba
Add link to spec
LandonTClipp 290aaf4
Add whitespace to limit lines to 80 columns.
LandonTClipp fb226eb
Update docs/specs/tsd-006-startup-error-behavior.md
LandonTClipp d245b17
Update docs/specs/tsd-008-probe-on-startup.md
LandonTClipp 50e34d2
Update docs/specs/tsd-008-probe-on-startup.md
LandonTClipp 7d5ce02
Update docs/specs/tsd-008-probe-on-startup.md
LandonTClipp 23090cc
Update docs/specs/tsd-008-probe-on-startup.md
LandonTClipp d8b6c46
Remove duplicated config section
LandonTClipp 3c50a74
fix typos
LandonTClipp 3025e68
Rename to tsd-009
LandonTClipp 65eb8ec
remove trailing whitespaces
LandonTClipp 019c480
more linting fixes
LandonTClipp File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
# Probing plugins after startup | ||
|
||
## Objective | ||
|
||
Allow Telegraf to probe plugins during startup to enable enhanced plugin error | ||
detection like availability of hardware or services | ||
|
||
## Keywords | ||
|
||
inputs, outputs, startup, probe, error, ignore, behavior | ||
|
||
## Overview | ||
|
||
When plugins are first instantiated, Telegraf will call the plugin's `Start()` | ||
method (for inputs) or `Connect()` (for outputs) which will initialize its | ||
configuration based off of config options and the running environment. It is | ||
sometimes the case that while the initialization step succeeds, the upstream | ||
service in which the plugin relies on is not actually running, or is not capable | ||
of being communicated with due to incorrect configuration or environmental | ||
problems. In situations like this, Telegraf does not detect that the plugin's | ||
upstream service is not functioning properly, and thus it will continually call | ||
the plugin during each `Gather()` iteration. This often has the effect of | ||
polluting journald and system logs with voluminous error messages, which creates | ||
issues for system administrators who rely on such logs to identify other | ||
unrelated system problems. | ||
|
||
More background discussion on this option, including other possible avenues, can | ||
be viewed [here](https://github.com/influxdata/telegraf/issues/16028). | ||
|
||
## Probing | ||
|
||
Probing is an action whereby the plugin should ensure that the plugin will be | ||
fully functional on a best effort basis. This may comprise communicating with | ||
its external service, trying to access required devices, entities or executables | ||
etc to ensure that the plugin will not produce errors during e.g. data collection | ||
or data output. Probing must *not* produce, process or output any metrics. | ||
|
||
Plugins that support probing must implement the `ProbePlugin` interface. Such | ||
plugins must behave in the following manner: | ||
|
||
1. Return an error if the external dependencies (hardware, services, | ||
executables, etc.) of the plugin are not available. | ||
2. Return an error if information cannot be gathered (in the case of inputs) or | ||
sent (in the case of outputs) due to unrecoverable issues. For example, invalid | ||
authentication, missing permissions, or non-existent endpoints. | ||
3. Otherwise, return `nil` indicating the plugin will be fully functional. | ||
|
||
## Plugin Requirements | ||
|
||
Plugins that allow probing must implement the `ProbePlugin` interface. The | ||
exact implementation depends on the plugin's functionality and requirements, | ||
but generally it should take the same actions as it would during normal operation | ||
e.g. calling `Gather()` or `Write()` and check if errors occur. If probing fails, | ||
it must be safe to call the plugin's `Close()` method. | ||
|
||
Input plugins must *not* produce metrics, output plugins must *not* send any | ||
metrics to the service. Plugins must *not* influence the later data processing or | ||
collection by modifying the internal state of the plugin or the external state of the | ||
service or hardware. For example, file-offsets or other service states must be | ||
reset to not lose data during the first gather or write cycle. | ||
|
||
Plugins must return `nil` upon successful probing or an error otherwise. | ||
|
||
## Related Issues | ||
|
||
- [#16028](https://github.com/influxdata/telegraf/issues/16028) | ||
- [#15916](https://github.com/influxdata/telegraf/pull/15916) | ||
- [#16001](https://github.com/influxdata/telegraf/pull/16001) | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not clear what happens if the probing is not returning an error. As it reads that this probing only needs to be done when startup returns errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Hipska perhaps one more paragraph added here: https://github.com/influxdata/telegraf/pull/16052/files#diff-2d519c82d1022b0befbd7601817fd1b2073ad5e9b607cd9c2205f0529b7d0fffR47 more clearly outlining what Telegraf does on Probe success is what you're looking for? I'm happy to add more clarification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah indeed.
Also, is probing really only on startup? Meaning if conditions change and recourses come available, that will only take effect when telegraf gets restarted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. Sorry for the delayed response on the update to the spec, I will get to it soon.
This spec is backwards compatible so if you don't want this behavior, you don't need to do anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As said in other comment, maybe the spec should clarify a use case? When would it be handy to only be able to recover by restarting telegraf completely?