Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improved error when victorops fails #1207

Merged
merged 6 commits into from
Jan 29, 2018
Merged

improved error when victorops fails #1207

merged 6 commits into from
Jan 29, 2018

Conversation

caarlos0
Copy link
Contributor

I am getting errors like this:

level=error ts=2018-01-25T11:14:09.509893874Z caller=dispatch.go:266 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="cancelling notify retry for \"victorops\" due to unrecoverable error: unexpected status code 404"
level=error ts=2018-01-25T11:14:11.027745774Z caller=notify.go:302 component=dispatcher msg="Error on notify" err="cancelling notify retry for \"victorops\" due to unrecoverable error: unexpected status code 404"
level=error ts=2018-01-25T11:14:11.027791108Z caller=dispatch.go:266 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="cancelling notify retry for \"victorops\" due to unrecoverable error: unexpected status code 404"
level=error ts=2018-01-25T11:14:11.038292052Z caller=notify.go:302 component=dispatcher msg="Error on notify" err="cancelling notify retry for \"victorops\" due to unrecoverable error: unexpected status code 404"
level=error ts=2018-01-25T11:14:11.038324649Z caller=dispatch.go:266 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="cancelling notify retry for \"victorops\" due to unrecoverable error: unexpected status code 404"

its hard to find out which alerts are erroring, I thought that maybe improve the error adding the alert names might help...

@stuartnelson3
Copy link
Contributor

I would prefer to leave the error message as is, and add the alert names to the debug level logging above.

@caarlos0
Copy link
Contributor Author

@stuartnelson3 done

notify/notify.go Outdated
@@ -600,7 +600,11 @@ func (r RetryStage) Exec(ctx context.Context, l log.Logger, alerts ...*types.Ale
case <-tick.C:
if retry, err := r.integration.Notify(ctx, alerts...); err != nil {
numFailedNotifications.WithLabelValues(r.integration.name).Inc()
level.Debug(l).Log("msg", "Notify attempt failed", "attempt", i, "integration", r.integration.name, "err", err)
var alertnames []string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one final nit .. could you change this to

alertnames := make([]string, 0, len(alerts))

makes sure we only allocate memory once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure! :)

@caarlos0
Copy link
Contributor Author

@stuartnelson3 done

@caarlos0
Copy link
Contributor Author

yeah, that didn't seem to work:

level=debug ts=2018-01-25T11:51:09.010972223Z caller=notify.go:607 component=dispatcher msg="Notify attempt failed" attempt=1 integration=victorops alerts="unsupported value type" err="unexpected status code 404"

@caarlos0
Copy link
Contributor Author

joining strings work as expected.

@stuartnelson3
Copy link
Contributor

Actually, now that I'm thinking about this, would it really be the fault of a particular alert for failure? I think it would be your configuration of a particular receiver, rather than the alert.

@caarlos0
Copy link
Contributor Author

caarlos0 commented Jan 25, 2018

Actually, now that I'm thinking about this, would it really be the fault of a particular alert for failure? I think it would be your configuration of a particular receiver, rather than the alert.

yeah, but I have dozens of different receivers on 2 different accounts.

I tried to access the receiver from the alert but wasn't able to... can I do that?

Anyway, by the alertnames I was able to found the issue, I think.

Another question: alerts are "bucketed" to be sent to victorops, right? It seems like one of the accounts api_key was revoked (still looking into that), and some alerts from the other api_key were not routed to victorops as well. Is it possible that a failing alert in a bucket abort the entire bucket (thus not even trying to send the other alerts)?

@caarlos0
Copy link
Contributor Author

actually seems like one of the accounts was removed... so that explains the 404

@stuartnelson3
Copy link
Contributor

The receiver (and its name) is available in the createStage function. You could add that to the creation on the retry stage.

I'm not sure what you mean by failing alert, but I'm assuming that if the connection info to victor ops is correct, it will receive the alerts successfully.

@caarlos0 caarlos0 closed this Jan 25, 2018
@caarlos0 caarlos0 reopened this Jan 25, 2018
@caarlos0
Copy link
Contributor Author

(sorry)

@caarlos0
Copy link
Contributor Author

caarlos0 commented Jan 25, 2018

the problem was the following:

  • we had 2 accounts set up (lets say engineering and sales)
  • several alerts, some going to engineering and others to sales
  • the sales account was deleted (on victorops), but alerts were still routed to it
  • some alerts that should have gone to engineering account didn't and there were lots of 404 as the description of this PR
  • some other alerts were routed to engineering account just fine

so my question is: is it possible that one bad config (sales) affects alerts from another config (engineering)?

as far as the stage thing goes, I'll try to do that, thanks! :)

@stuartnelson3
Copy link
Contributor

The two receivers should be totally separate and with separate API keys. Since you were sending lots of failing alerts for sales (and retrying), potentially you were being rate-limited based on your alertmanager IP address, which caused the engineering alerts to get blocked. I don't know unfortunately. Typically there's rate limiting per API token, but they might also have a more general IP rate limiting in place. This would be a question for victor ops.

notify/notify.go Outdated
@@ -238,7 +238,7 @@ func createStage(rc *config.Receiver, tmpl *template.Template, wait func() time.
var s MultiStage
s = append(s, NewWaitStage(wait))
s = append(s, NewDedupStage(notificationLog, recv))
s = append(s, NewRetryStage(i))
s = append(s, NewRetryStage(i, recv))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you update this to only pass in the GroupName? No reason to pass around more data than is necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@stuartnelson3 stuartnelson3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Sorry about all the back and forth :)

@stuartnelson3 stuartnelson3 merged commit 23f31d7 into prometheus:master Jan 29, 2018
@caarlos0 caarlos0 deleted the log branch January 29, 2018 16:01
@caarlos0
Copy link
Contributor Author

Thanks! Sorry about all the back and forth :)

no problem, thanks!

hh pushed a commit to ii/alertmanager that referenced this pull request May 7, 2019
Build and publish ARM32v7, ARM64v8 and ppc64le docker images.

Signed-off-by: Johannes Würbach <johannes.wuerbach@googlemail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants