improved error when victorops fails #1207

caarlos0 · 2018-01-25T11:20:05Z

I am getting errors like this:

level=error ts=2018-01-25T11:14:09.509893874Z caller=dispatch.go:266 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="cancelling notify retry for \"victorops\" due to unrecoverable error: unexpected status code 404"
level=error ts=2018-01-25T11:14:11.027745774Z caller=notify.go:302 component=dispatcher msg="Error on notify" err="cancelling notify retry for \"victorops\" due to unrecoverable error: unexpected status code 404"
level=error ts=2018-01-25T11:14:11.027791108Z caller=dispatch.go:266 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="cancelling notify retry for \"victorops\" due to unrecoverable error: unexpected status code 404"
level=error ts=2018-01-25T11:14:11.038292052Z caller=notify.go:302 component=dispatcher msg="Error on notify" err="cancelling notify retry for \"victorops\" due to unrecoverable error: unexpected status code 404"
level=error ts=2018-01-25T11:14:11.038324649Z caller=dispatch.go:266 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="cancelling notify retry for \"victorops\" due to unrecoverable error: unexpected status code 404"

its hard to find out which alerts are erroring, I thought that maybe improve the error adding the alert names might help...

stuartnelson3 · 2018-01-25T11:25:15Z

I would prefer to leave the error message as is, and add the alert names to the debug level logging above.

caarlos0 · 2018-01-25T11:28:42Z

@stuartnelson3 done

stuartnelson3 · 2018-01-25T11:33:51Z

notify/notify.go

@@ -600,7 +600,11 @@ func (r RetryStage) Exec(ctx context.Context, l log.Logger, alerts ...*types.Ale
 		case <-tick.C:
 			if retry, err := r.integration.Notify(ctx, alerts...); err != nil {
 				numFailedNotifications.WithLabelValues(r.integration.name).Inc()
-				level.Debug(l).Log("msg", "Notify attempt failed", "attempt", i, "integration", r.integration.name, "err", err)
+				var alertnames []string


one final nit .. could you change this to

alertnames := make([]string, 0, len(alerts))

makes sure we only allocate memory once.

caarlos0 · 2018-01-25T11:34:43Z

@stuartnelson3 done

caarlos0 · 2018-01-25T11:51:55Z

yeah, that didn't seem to work:

level=debug ts=2018-01-25T11:51:09.010972223Z caller=notify.go:607 component=dispatcher msg="Notify attempt failed" attempt=1 integration=victorops alerts="unsupported value type" err="unexpected status code 404"

caarlos0 · 2018-01-25T12:06:35Z

joining strings work as expected.

stuartnelson3 · 2018-01-25T12:09:46Z

Actually, now that I'm thinking about this, would it really be the fault of a particular alert for failure? I think it would be your configuration of a particular receiver, rather than the alert.

caarlos0 · 2018-01-25T12:19:41Z

Actually, now that I'm thinking about this, would it really be the fault of a particular alert for failure? I think it would be your configuration of a particular receiver, rather than the alert.

yeah, but I have dozens of different receivers on 2 different accounts.

I tried to access the receiver from the alert but wasn't able to... can I do that?

Anyway, by the alertnames I was able to found the issue, I think.

Another question: alerts are "bucketed" to be sent to victorops, right? It seems like one of the accounts api_key was revoked (still looking into that), and some alerts from the other api_key were not routed to victorops as well. Is it possible that a failing alert in a bucket abort the entire bucket (thus not even trying to send the other alerts)?

caarlos0 · 2018-01-25T12:23:31Z

actually seems like one of the accounts was removed... so that explains the 404

stuartnelson3 · 2018-01-25T14:09:09Z

The receiver (and its name) is available in the createStage function. You could add that to the creation on the retry stage.

I'm not sure what you mean by failing alert, but I'm assuming that if the connection info to victor ops is correct, it will receive the alerts successfully.

caarlos0 · 2018-01-25T16:09:43Z

(sorry)

caarlos0 · 2018-01-25T16:16:00Z

the problem was the following:

we had 2 accounts set up (lets say engineering and sales)
several alerts, some going to engineering and others to sales
the sales account was deleted (on victorops), but alerts were still routed to it
some alerts that should have gone to engineering account didn't and there were lots of 404 as the description of this PR
some other alerts were routed to engineering account just fine

so my question is: is it possible that one bad config (sales) affects alerts from another config (engineering)?

as far as the stage thing goes, I'll try to do that, thanks! :)

stuartnelson3 · 2018-01-26T10:08:33Z

The two receivers should be totally separate and with separate API keys. Since you were sending lots of failing alerts for sales (and retrying), potentially you were being rate-limited based on your alertmanager IP address, which caused the engineering alerts to get blocked. I don't know unfortunately. Typically there's rate limiting per API token, but they might also have a more general IP rate limiting in place. This would be a question for victor ops.

stuartnelson3 · 2018-01-26T10:09:58Z

notify/notify.go

@@ -238,7 +238,7 @@ func createStage(rc *config.Receiver, tmpl *template.Template, wait func() time.
 		var s MultiStage
 		s = append(s, NewWaitStage(wait))
 		s = append(s, NewDedupStage(notificationLog, recv))
-		s = append(s, NewRetryStage(i))
+		s = append(s, NewRetryStage(i, recv))


could you update this to only pass in the GroupName? No reason to pass around more data than is necessary.

stuartnelson3

Thanks! Sorry about all the back and forth :)

caarlos0 · 2018-01-29T16:01:39Z

Thanks! Sorry about all the back and forth :)

no problem, thanks!

Build and publish ARM32v7, ARM64v8 and ppc64le docker images. Signed-off-by: Johannes Würbach <johannes.wuerbach@googlemail.com>

improved error when victorops fails

501d860

moved to debug

bdf3499

stuartnelson3 reviewed Jan 25, 2018

View reviewed changes

allocate mem only once

3f3899c

joining strings

6a896f6

caarlos0 closed this Jan 25, 2018

caarlos0 reopened this Jan 25, 2018

logging receiver name

2d2224e

stuartnelson3 reviewed Jan 26, 2018

View reviewed changes

passing only group name

71a1e58

stuartnelson3 approved these changes Jan 29, 2018

View reviewed changes

stuartnelson3 merged commit 23f31d7 into prometheus:master Jan 29, 2018

caarlos0 deleted the log branch January 29, 2018 16:01

hh pushed a commit to ii/alertmanager that referenced this pull request May 7, 2019

Docker images for ARM32v7, ARM64v8 and ppc64le (prometheus#1207)

4e5c4d4

Build and publish ARM32v7, ARM64v8 and ppc64le docker images. Signed-off-by: Johannes Würbach <johannes.wuerbach@googlemail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improved error when victorops fails #1207

improved error when victorops fails #1207

caarlos0 commented Jan 25, 2018

stuartnelson3 commented Jan 25, 2018

caarlos0 commented Jan 25, 2018

stuartnelson3 Jan 25, 2018

caarlos0 Jan 25, 2018

caarlos0 commented Jan 25, 2018

caarlos0 commented Jan 25, 2018

caarlos0 commented Jan 25, 2018

stuartnelson3 commented Jan 25, 2018

caarlos0 commented Jan 25, 2018 •

edited

Loading

caarlos0 commented Jan 25, 2018

stuartnelson3 commented Jan 25, 2018

caarlos0 commented Jan 25, 2018

caarlos0 commented Jan 25, 2018 •

edited

Loading

stuartnelson3 commented Jan 26, 2018

stuartnelson3 Jan 26, 2018

caarlos0 Jan 26, 2018

stuartnelson3 left a comment

caarlos0 commented Jan 29, 2018

improved error when victorops fails #1207

improved error when victorops fails #1207

Conversation

caarlos0 commented Jan 25, 2018

stuartnelson3 commented Jan 25, 2018

caarlos0 commented Jan 25, 2018

stuartnelson3 Jan 25, 2018

Choose a reason for hiding this comment

caarlos0 Jan 25, 2018

Choose a reason for hiding this comment

caarlos0 commented Jan 25, 2018

caarlos0 commented Jan 25, 2018

caarlos0 commented Jan 25, 2018

stuartnelson3 commented Jan 25, 2018

caarlos0 commented Jan 25, 2018 • edited Loading

caarlos0 commented Jan 25, 2018

stuartnelson3 commented Jan 25, 2018

caarlos0 commented Jan 25, 2018

caarlos0 commented Jan 25, 2018 • edited Loading

stuartnelson3 commented Jan 26, 2018

stuartnelson3 Jan 26, 2018

Choose a reason for hiding this comment

caarlos0 Jan 26, 2018

Choose a reason for hiding this comment

stuartnelson3 left a comment

Choose a reason for hiding this comment

caarlos0 commented Jan 29, 2018

caarlos0 commented Jan 25, 2018 •

edited

Loading

caarlos0 commented Jan 25, 2018 •

edited

Loading