-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Treat error response payloads from Slack as errors #3121
Conversation
@simonpasquier are you the best person to review this? |
Just rebased to make this merge-able. @gotjosh, @simonpasquier – any thoughts on this? |
Is there anything I can do to get this reviewed, folks? cc @gotjosh, @simonpasquier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any way that I can reproduce the issue locally?
bc12cb2
to
d5dfcee
Compare
Sure, I found this while using https://slack.com/api/chat.postMessage as
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something that the Grafana equivalent does much better, can we instead borrow the idea from what we do there? https://github.com/grafana/alerting/blob/main/receivers/slack/slack.go#L191-L207
@gotjosh the grafana/alerting code is AGPL, right? |
This situation happened a couple of months ago and I consulted internally what were the steps necessary to allow that. The result is #2845 (comment) as an example. I'll use similar wording as I approve the PR. |
2a10762
to
9ddc399
Compare
Sure! Given alertmanager's use of common Also expanded the test to cover:
I believe functionally this now should align with what Grafana are doing. PTAL? |
a465be9
to
7317c32
Compare
notify/slack/slack.go
Outdated
@@ -210,16 +210,66 @@ func (n *Notifier) Notify(ctx context.Context, as ...*types.Alert) (bool, error) | |||
if err != nil { | |||
return true, notify.RedactURL(err) | |||
} | |||
defer notify.Drain(resp) | |||
defer resp.Body.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% familiar with notify.Drain
, but why would we stop using it here? The call is deferred, so we can still read resp.Body
from the function call on Line 226.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was no longer necessary to always drain & close, so I changed it to use standard Go libraries to make it easier to understand. I agree it does not matter, so I returned notify.Drain
if you prefer that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
// (https://api.slack.com/messaging/webhooks#handling_errors) | ||
func checkTextResponseError(body []byte) (bool, error) { | ||
if !bytes.Equal(body, []byte("ok")) { | ||
return false, fmt.Errorf("received an error response from Slack: %s", string(body)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be true
shouldn't it? Don't we want to retry failed webhooks?
return false, fmt.Errorf("received an error response from Slack: %s", string(body)) | |
return true, fmt.Errorf("received an error response from Slack: %s", string(body)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, in Grafana we don't retry these errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the case when we got a 200 back with an error message. I suspect in most cases it will not indicate an infrastructure error (or an otherwise transient issue) and retrying it will not yield a different result.
The function comment here links to Slack documentation that describes some of those errors, and none of them seem like something where retrying would help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree! Sorry for the original confusing comment!
notify/slack/slack.go
Outdated
|
||
var data response | ||
if err := json.Unmarshal(body, &data); err != nil { | ||
return false, errors.Wrapf(err, "could not unmarshal JSON response %q", string(body)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here, earlier we retried if the body could not be read:
body, err := io.ReadAll(resp.Body)
if err != nil {
return true, errors.Wrap(err, "could not read response body")
}
Shouldn't we retry if the JSON cannot be unmarshalled too?
return false, errors.Wrapf(err, "could not unmarshal JSON response %q", string(body)) | |
return true, errors.Wrapf(err, "could not unmarshal JSON response %q", string(body)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, in Grafana we don't retry these errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, my original proposal (2bd3d4d) retried unmarshalling errors, but I have been asked here to match Grafana's behaviour.
I am happy either way, I just want alertmanager to not silently ignore errors while delivering notifications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the retry side, I think we should keep Alertmanager's behaviour intact (for now) - we can discuss as part of a separate PR if we should retry or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a quick look and we don't have a precedence in Alertmanager for these specific kinds of errors. I'm inclined to return true
if parsing the response JSON fails, as that's what we also do if reading the response buffer fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this to retry on JSON unmarshalling errors.
@gotjosh @simonpasquier something I missed is that these changes update Alertmanager to understand error responses from both Slack's Alertmanager supports both, but the fact that it does so seems coincidental rather than intentional (for example I think we should accept this change because I suspect Alertmanager users are using this integration for both APIs. However, I want to highlight that adding error handling for |
LGTM once #3121 (comment) has been fixed! Thanks for your contribution! ❤️ |
As described in the "More error types" section below, Slack API can return errors with a 200 response code: https://slack.dev/node-slack-sdk/web-api#handle-errors This change adds parsing of API response to extract error messages. Signed-off-by: Anton Tolchanov <anton@tailscale.com>
@grobinson-grafana would you be OK with updating the documentation for that? |
Sure! I just wanted to confirm that's what we want 🙂 |
I just tested a couple of different scenarios.
|
PR to update the docs #3455 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thank you very much for your contribution.
As described in the "More error types" section below, Slack API can return errors with a 200 response code: https://slack.dev/node-slack-sdk/web-api#handle-errors This change adds parsing of API response to extract error messages. Signed-off-by: Anton Tolchanov <anton@tailscale.com>
As described in the "More error types" section below, Slack API can return errors with a 200 response code: https://slack.dev/node-slack-sdk/web-api#handle-errors This change adds parsing of API response to extract error messages. Signed-off-by: Anton Tolchanov <anton@tailscale.com>
As described in the "More error types" section below, Slack API can return errors with a 200 response code:
https://slack.dev/node-slack-sdk/web-api#handle-errors
This change adds parsing of API response to extract error messages.
Signed-off-by: Anton Tolchanov anton@tailscale.com