Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement probe_failure_info Metric #1334

Draft
wants to merge 25 commits into
base: master
Choose a base branch
from

Conversation

slrtbtfs
Copy link

@slrtbtfs slrtbtfs commented Dec 12, 2024

This is a first draft for implementing #1077.

This make information about Probe failures that is already logged is additionally available via the probe_failure_info metric.

The underlying mechanisms are mainly implemented in prober/prober.go and prober/handler.go, the rest of the diff is mostly refactoring.

The main change is adding a new type ProbeResult, which replaces the boolean return type of Probes. For failed Probes this type includes diagnostic information which is then published in the probe_failure_reason metric. If error messages contain information that is quite large or may change very frequently, leading to a high cardinality, its not included in this metric and instead just logged. .

The underlying mechanisms are mainly implemented in prober/prober.go and prober/handler.go, the rest is mostly refactoring. Since this new type is used in almost every Probe function and their tests, the diff got quite big. I hope that doesn't make the review to cumbersome

As a side effect, now most tests not only check whether a probe succeeded, but also whether failed tests failed with the correct error message. Having the error message in the return type also allowed to remove some more complex testing code that used to extract this information from the prober logs.

Before merging, this should still get some improved tests and polishing, however I'd like to get some early feedback about whether you agree with the implementation approach in general.

There is still space for improvement as some error messages provide only limited detail (e.g. the generic HTTP request failed one). IMO this makes most sense as a follow up once the basic implementation of this metric is merged.

Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Signed-off-by: Tobias Guggenmos <guggenmos@dfn-cert.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant