Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(specs): Add probe as value to startup_error_behavior #16052

Merged
merged 17 commits into from
Dec 11, 2024
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions docs/specs/tsd-006-startup-error-behavior.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,19 @@ must *not* fail on startup errors and should continue running. On startup error,
Telegraf must ignore the plugin as-if it was not configured at all, i.e. the
plugin must be completely removed from processing.

### `probe` behavior

When using the `probe` setting for the `startup_error_behavior` option Telegraf
must *not* fail on startup errors and should continue running. On startup error,
Telegraf must ignore the plugin as-if it was not configured at all, i.e. the
plugin must be completely removed from processing, similar to the `ignore`
behavior. Additionally, Telegraf must probe the plugin (as defined in
[TSD-009][tsd_009]) after startup, if it implements the `ProbePlugin` interface.
If probing is available *and* returns an error Telegraf must *ignore* the
plugin as-if it was not configured at all.
Comment on lines +86 to +87
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear what happens if the probing is not returning an error. As it reads that this probing only needs to be done when startup returns errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Hipska perhaps one more paragraph added here: https://github.com/influxdata/telegraf/pull/16052/files#diff-2d519c82d1022b0befbd7601817fd1b2073ad5e9b607cd9c2205f0529b7d0fffR47 more clearly outlining what Telegraf does on Probe success is what you're looking for? I'm happy to add more clarification.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah indeed.

Also, is probing really only on startup? Meaning if conditions change and recourses come available, that will only take effect when telegraf gets restarted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. Sorry for the delayed response on the update to the spec, I will get to it soon.

This spec is backwards compatible so if you don't want this behavior, you don't need to do anything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As said in other comment, maybe the spec should clarify a use case? When would it be handy to only be able to recover by restarting telegraf completely?


[tsd_009]: /docs/specs/tsd-009-probe-on-startup.md

## Plugin Requirements

Plugins participating in handling startup errors must implement the `Start()`
Expand Down
69 changes: 69 additions & 0 deletions docs/specs/tsd-009-probe-on-startup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# Probing plugins after startup

## Objective

Allow Telegraf to probe plugins during startup to enable enhanced plugin error
detection like availability of hardware or services

## Keywords

inputs, outputs, startup, probe, error, ignore, behavior

## Overview

When plugins are first instantiated, Telegraf will call the plugin's `Start()`
method (for inputs) or `Connect()` (for outputs) which will initialize its
configuration based off of config options and the running environment. It is
sometimes the case that while the initialization step succeeds, the upstream
service in which the plugin relies on is not actually running, or is not capable
of being communicated with due to incorrect configuration or environmental
problems. In situations like this, Telegraf does not detect that the plugin's
upstream service is not functioning properly, and thus it will continually call
the plugin during each `Gather()` iteration. This often has the effect of
polluting journald and system logs with voluminous error messages, which creates
issues for system administrators who rely on such logs to identify other
unrelated system problems.

More background discussion on this option, including other possible avenues, can
be viewed [here](https://github.com/influxdata/telegraf/issues/16028).

## Probing

Probing is an action whereby the plugin should ensure that the plugin will be
fully functional on a best effort basis. This may comprise communicating with
its external service, trying to access required devices, entities or executables
etc to ensure that the plugin will not produce errors during e.g. data collection
or data output. Probing must *not* produce, process or output any metrics.

Plugins that support probing must implement the `ProbePlugin` interface. Such
plugins must behave in the following manner:

1. Return an error if the external dependencies (hardware, services,
executables, etc.) of the plugin are not available.
2. Return an error if information cannot be gathered (in the case of inputs) or
sent (in the case of outputs) due to unrecoverable issues. For example, invalid
authentication, missing permissions, or non-existent endpoints.
3. Otherwise, return `nil` indicating the plugin will be fully functional.

## Plugin Requirements

Plugins that allow probing must implement the `ProbePlugin` interface. The
exact implementation depends on the plugin's functionality and requirements,
but generally it should take the same actions as it would during normal operation
e.g. calling `Gather()` or `Write()` and check if errors occur. If probing fails,
it must be safe to call the plugin's `Close()` method.

Input plugins must *not* produce metrics, output plugins must *not* send any
metrics to the service. Plugins must *not* influence the later data processing or
collection by modifying the internal state of the plugin or the external state of the
service or hardware. For example, file-offsets or other service states must be
reset to not lose data during the first gather or write cycle.

Plugins must return `nil` upon successful probing or an error otherwise.

## Related Issues

- [#16028](https://github.com/influxdata/telegraf/issues/16028)
- [#15916](https://github.com/influxdata/telegraf/pull/15916)
- [#16001](https://github.com/influxdata/telegraf/pull/16001)

Loading