Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Treat error response payloads from Slack as errors #3121

Merged
merged 1 commit into from
Aug 10, 2023

Conversation

knyar
Copy link
Contributor

@knyar knyar commented Oct 28, 2022

As described in the "More error types" section below, Slack API can return errors with a 200 response code:
https://slack.dev/node-slack-sdk/web-api#handle-errors

This change adds parsing of API response to extract error messages.

Signed-off-by: Anton Tolchanov anton@tailscale.com

@knyar knyar marked this pull request as ready for review October 28, 2022 11:38
@knyar
Copy link
Contributor Author

knyar commented Oct 28, 2022

@simonpasquier are you the best person to review this?

@gotjosh gotjosh self-requested a review October 28, 2022 18:42
@knyar
Copy link
Contributor Author

knyar commented Dec 22, 2022

Just rebased to make this merge-able.

@gotjosh, @simonpasquier – any thoughts on this?

@knyar
Copy link
Contributor Author

knyar commented Apr 12, 2023

Is there anything I can do to get this reviewed, folks? cc @gotjosh, @simonpasquier

Copy link
Member

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any way that I can reproduce the issue locally?

notify/slack/slack.go Outdated Show resolved Hide resolved
@knyar knyar force-pushed the slack_errors branch 3 times, most recently from bc12cb2 to d5dfcee Compare April 29, 2023 11:44
@knyar
Copy link
Contributor Author

knyar commented Apr 29, 2023

is there any way that I can reproduce the issue locally?

Sure, I found this while using https://slack.com/api/chat.postMessage as slack_api_url (and passing bot auth token via http_config.bearer_token). Notifications work, but errors are not getting reported, because:

$ curl -w "\ncode: %{http_code}\n" -X POST 'https://slack.com/api/chat.postMessage'
{"ok":false,"error":"not_authed"}
code: 200

Copy link
Member

@gotjosh gotjosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something that the Grafana equivalent does much better, can we instead borrow the idea from what we do there? https://github.com/grafana/alerting/blob/main/receivers/slack/slack.go#L191-L207

@simonpasquier
Copy link
Member

@gotjosh the grafana/alerting code is AGPL, right?
I'm not a lawyer but how are we sure that it won't create an issue?

@gotjosh
Copy link
Member

gotjosh commented May 5, 2023

This situation happened a couple of months ago and I consulted internally what were the steps necessary to allow that.

The result is #2845 (comment) as an example.

I'll use similar wording as I approve the PR.

@knyar knyar force-pushed the slack_errors branch 2 times, most recently from 2a10762 to 9ddc399 Compare May 13, 2023 11:24
@knyar
Copy link
Contributor Author

knyar commented May 13, 2023

This is something that the Grafana equivalent does much better, can we instead borrow the idea from what we do there?

Sure! Given alertmanager's use of common Retrier and ErrorWithReason it's not particularly straightforward here to re-use parts of that code in a meaningful way, but I have extended this PR to handle plaintext (non-JSON) responses as well. I have not added parsing of string error ids (like channel_is_archived) because they seem to be expressive enough.

Also expanded the test to cover:

  • 1xx HTTP responses;
  • Plaintext 2xx successful and error responses;
  • JSON parsing errors;
  • Classification of errors as retryable;
  • Text of error messages.

I believe functionally this now should align with what Grafana are doing. PTAL?

@knyar knyar requested a review from gotjosh May 13, 2023 11:25
@knyar knyar force-pushed the slack_errors branch 2 times, most recently from a465be9 to 7317c32 Compare May 13, 2023 11:36
@@ -210,16 +210,66 @@ func (n *Notifier) Notify(ctx context.Context, as ...*types.Alert) (bool, error)
if err != nil {
return true, notify.RedactURL(err)
}
defer notify.Drain(resp)
defer resp.Body.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% familiar with notify.Drain, but why would we stop using it here? The call is deferred, so we can still read resp.Body from the function call on Line 226.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was no longer necessary to always drain & close, so I changed it to use standard Go libraries to make it easier to understand. I agree it does not matter, so I returned notify.Drain if you prefer that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

// (https://api.slack.com/messaging/webhooks#handling_errors)
func checkTextResponseError(body []byte) (bool, error) {
if !bytes.Equal(body, []byte("ok")) {
return false, fmt.Errorf("received an error response from Slack: %s", string(body))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be true shouldn't it? Don't we want to retry failed webhooks?

Suggested change
return false, fmt.Errorf("received an error response from Slack: %s", string(body))
return true, fmt.Errorf("received an error response from Slack: %s", string(body))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, in Grafana we don't retry these errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the case when we got a 200 back with an error message. I suspect in most cases it will not indicate an infrastructure error (or an otherwise transient issue) and retrying it will not yield a different result.

The function comment here links to Slack documentation that describes some of those errors, and none of them seem like something where retrying would help.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree! Sorry for the original confusing comment!


var data response
if err := json.Unmarshal(body, &data); err != nil {
return false, errors.Wrapf(err, "could not unmarshal JSON response %q", string(body))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here, earlier we retried if the body could not be read:

body, err := io.ReadAll(resp.Body)
if err != nil {
	return true, errors.Wrap(err, "could not read response body")
}

Shouldn't we retry if the JSON cannot be unmarshalled too?

Suggested change
return false, errors.Wrapf(err, "could not unmarshal JSON response %q", string(body))
return true, errors.Wrapf(err, "could not unmarshal JSON response %q", string(body))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, in Grafana we don't retry these errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my original proposal (2bd3d4d) retried unmarshalling errors, but I have been asked here to match Grafana's behaviour.

I am happy either way, I just want alertmanager to not silently ignore errors while delivering notifications.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the retry side, I think we should keep Alertmanager's behaviour intact (for now) - we can discuss as part of a separate PR if we should retry or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a quick look and we don't have a precedence in Alertmanager for these specific kinds of errors. I'm inclined to return true if parsing the response JSON fails, as that's what we also do if reading the response buffer fails.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this to retry on JSON unmarshalling errors.

@grobinson-grafana
Copy link
Contributor

grobinson-grafana commented Aug 9, 2023

@gotjosh @simonpasquier something I missed is that these changes update Alertmanager to understand error responses from both Slack's chat.postMessage API and incoming webhooks. These are two different APIs with different authentication schemes and different features.

Alertmanager supports both, but the fact that it does so seems coincidental rather than intentional (for example slack_configs contains http_config so Bearer token can be set for chat.postMessage). This is also the case in Grafana, and I have suggested to the team in the past to separate Slack into two separate integrations: one for chat.postMessage and another for incoming webhooks.

I think we should accept this change because I suspect Alertmanager users are using this integration for both APIs. However, I want to highlight that adding error handling for chat.postMessage further encourages the use of this integration for either API, despite the documentation stating that this integration is just for incoming webhooks.

@grobinson-grafana
Copy link
Contributor

LGTM once #3121 (comment) has been fixed! Thanks for your contribution! ❤️

As described in the "More error types" section below, Slack API can return
errors with a 200 response code:
https://slack.dev/node-slack-sdk/web-api#handle-errors

This change adds parsing of API response to extract error messages.

Signed-off-by: Anton Tolchanov <anton@tailscale.com>
@gotjosh
Copy link
Member

gotjosh commented Aug 9, 2023

further encourages the use of this integration for either API, despite the documentation stating that this integration is just for incoming webhooks.

@grobinson-grafana would you be OK with updating the documentation for that?

@grobinson-grafana
Copy link
Contributor

Sure! I just wanted to confirm that's what we want 🙂

@grobinson-grafana
Copy link
Contributor

I just tested a couple of different scenarios.

  1. text/plain response with "oops"
ts=2023-08-10T09:27:48.060Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="slack/slack[0]: notify retry canceled due to unrecoverable error after 1 attempts: channel \"#test-alerts\": received an error response from Slack: oops\n"
  1. application/json response with invalid JSON "oops"
ts=2023-08-10T09:28:29.707Z caller=notify.go:745 level=warn component=dispatcher receiver=slack integration=slack[0] aggrGroup={}:{} msg="Notify attempt failed, will retry later" attempts=7 err="channel \"#test-alerts\": could not unmarshal JSON response \"oops\\n\": invalid character 'o' looking for beginning of value"
  1. application/json response with valid JSON
ts=2023-08-10T09:29:16.225Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="slack/slack[0]: notify retry canceled due to unrecoverable error after 9 attempts: channel \"#test-alerts\": error response from Slack: oops"

@grobinson-grafana
Copy link
Contributor

PR to update the docs #3455

Copy link
Member

@gotjosh gotjosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thank you very much for your contribution.

@gotjosh gotjosh merged commit 94625df into prometheus:main Aug 10, 2023
radek-ryckowski pushed a commit to goldmansachs/alertmanager that referenced this pull request Nov 6, 2023
As described in the "More error types" section below, Slack API can return
errors with a 200 response code:
https://slack.dev/node-slack-sdk/web-api#handle-errors

This change adds parsing of API response to extract error messages.

Signed-off-by: Anton Tolchanov <anton@tailscale.com>
qinxx108 pushed a commit to amazon-contributing/alertmanager that referenced this pull request Mar 28, 2024
As described in the "More error types" section below, Slack API can return
errors with a 200 response code:
https://slack.dev/node-slack-sdk/web-api#handle-errors

This change adds parsing of API response to extract error messages.

Signed-off-by: Anton Tolchanov <anton@tailscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants